FWI in Exploration Geophysics: a Tutorial
This a tutorial on the principles of FWI (
in perpetual construction). We develop the theory in the
frequency domain for sake of compact notations. However, we will periodically make the link with the time domain formulation since the frequency-domain matrix formulation requires a higher level of abstraction and hence may be difficult to conceptualized after a first reading.
========================================================
$$\tag{Content}\label{Content}$$
Content
click on the section number to get there
\( \ref{I} \) Notation and problem statement
\( \ref{II} \) FWI with local gradient-based optimization methods
\( \ref{III} \) Generic expression of gradient and Hessian
\( \ref{IV} \) The gradient and Hessien of FWI: imaging principle
\( \ref{V} \) Efficient computation of the sensitivity matrix
\( \ref{VI} \) Efficient computation of the gradient
\( \ref{VII} \) Computing the gradient of the misfit function with the adjoint-state method
\( \ref{VIII} \) The adjoint-state method with Lagrangian functional
\( \ref{IX} \) Role of the Hessian
\( \ref{X} \) Resolution analysis
\( \ref{XI} \) The cycle skipping pathology
\( \ref{XII} \) Regularization of FWI
\( \ref{XIII} \) FWI and least-squares Reverse Time Migration
\( \ref{XIV} \) Reflection Waveform Inversion (RWI)
\( \ref{XV} \) Source estimation in FWI
\( \ref{XVI} \) Multi-parameter FWI
\( \ref{XVII} \) Appendix 1: Few words about the method of Lagrange multiplier
========================================================
$$\tag{I}\label{I}$$
Notation and problem statement
Let's consider the time-harmonic (Helmholtz) equation written in matrix form as
$$\boldsymbol{A} \left( \omega,\boldsymbol{m}(\boldsymbol{x}) \right)\boldsymbol{u}(\omega,\boldsymbol{x}) = \boldsymbol{b}(\omega,\boldsymbol{x}), \tag{1.1}\label{1.1}$$
where \( \boldsymbol{A} = \omega^2 \text{diag}(\boldsymbol{m}) + \Delta \) denotes the Helmholtz operator, \( \omega \) is the angular frequency, \(\Delta = \partial_{xx}+\partial_{zz}\) is the Laplace operator, \( \boldsymbol{m} \in \Bbb{R}^{N \times 1} \) is the squared slownesses, \( \boldsymbol{u} \in \Bbb{C}^{N \times 1} \) is the monochromatic wavefield and \( \boldsymbol{b} \in \Bbb{C}^{N \times 1} \) is the monochromatic source.
u and
b are vectors of dimension N, where N denotes the number of degrees of freedom discretizing the subsurface medium
m. Accordingly,
A is a full rank N x N squared matrix.
Unless explicitely stated otherwise, we will drop the dependency to ω in equation \( (\ref{1.1}) \) for sake of compact notations
$$\boldsymbol{A} (\boldsymbol{m}) \boldsymbol{u}= \boldsymbol{b}. \tag{1.2}\label{1.2}$$
This means that we consider a mono-frequency problem. However, multiple frequencies can be jointly processed by FWI through discrete summation in a similar manner to the time-domain formulation which proceeds by summation over time.
Equation \( (\ref{1.2}) \) is called the
state equation which can be written in generic form as
$$F(\boldsymbol{u},\boldsymbol{m}) =0, \tag{1.3}\label{1.3}$$
where
u is the
state variable and
m the
model parameters.
FWI aims to estimate the subsurface parameters
m from parsimonious measurements of
u.
Accordingly, FWI can be formulated as the following
multivariate PDE-constrained optimization problem
$$\min_{\boldsymbol{u},\boldsymbol{m}} \| \boldsymbol{P u} - \boldsymbol{d}^*\|_2^2 ~~ \text{subject to} ~~ \boldsymbol{A}(\boldsymbol{m}) \boldsymbol{u}= \boldsymbol{b}, \tag{1.4a}\label{1.4a}$$
or equivalently, $$\min_{\boldsymbol{u},\boldsymbol{m}} \| \boldsymbol{P u} - \boldsymbol{d}^*\|_2^2 ~~ \text{subject to} ~~ F(\boldsymbol{u},\boldsymbol{m}) =0,\tag{1.4b}\label{1.4b}$$
where
d* denotes the
observations (the recorded data) and the linear
observation operator P samples the wavefield \( \boldsymbol{ u} \) at the receiver positions. For sake of compactness, we consider a single source problem but the extension to multiple sources follows easily through a summation over sources.
P is a M x N matrix where M << N denotes the number of measurements.
The equation relating the wavefield solution
u (the state variable) to the measurements \( \boldsymbol{d}^*\) through the observation operator
P is called the
observation equation
$$\boldsymbol{P u} - \boldsymbol{d}^*. \tag{1.5}\label{1.5}$$
From equation \( (\ref{1.2}) \), let's consider the closed-form expression of
u as a nonlinear function of
m
$$\boldsymbol{u}= \boldsymbol{A}^{-1}(\boldsymbol{m})\boldsymbol{b}, \tag{1.6}\label{1.6}$$
where the columns of \( \boldsymbol{A}^{-1} \) are the Green functions for point sources located at each grid point of the computational domain (Indeed, \( \boldsymbol{A} \boldsymbol{A}^{-1} = \boldsymbol{I} \), where the identity matrix \( \boldsymbol{I} \) gathers the impulsional sources).
In compact form,
$$\boldsymbol{u}= \boldsymbol{S}(\boldsymbol{m}). \tag{1.7}\label{1.7}$$
With this notation, the observation equation reads
$$\boldsymbol{d}= \boldsymbol{P} \boldsymbol{S}(\boldsymbol{m})=\phi(\boldsymbol{m}). \tag{1.8}\label{1.8}$$
In the framework of the so-called reduced-space formulation of FWI, we can recast the multi-variate constrained problem, \( (\ref{1.4a}) \), as a
mono-variate unconstrained problem by enforcing the closed-form expression of
u as a function of
m, \( (\ref{1.6}) \), in the data misfit function (the first term of \( (\ref{1.4a}) \)). Simply put, we force the constraint to be strictly satisfied. This shrinks the FWI search space for the benefit of computational efficiency. This variable elimination is called
variable projection .
$$\min_{\boldsymbol{m}} \| \boldsymbol{P A}(\boldsymbol{m})^{-1} \boldsymbol{b} - \boldsymbol{d}^*\|_2^2. \tag{1.9}\label{1.9}$$
namely,
$$\min_{\boldsymbol{m}} \| \phi(\boldsymbol{m}) - \boldsymbol{d}^*\|_2^2. \tag{1.10}\label{1.10}$$
Back to \ref{Content}
$$\tag{II}\label{II}$$
FWI with local gradient-based optimization methods
Let's denote by \( C(\boldsymbol{m}) \) the nonlinear objective function associated with the minimization problem \( (\ref{1.10}) \),
$$C(\boldsymbol{m}) = \| \phi(\boldsymbol{m}) - \boldsymbol{d}^*\|_2^2. \tag{2.1}\label{2.1}$$
To solve \( (\ref{1.10}) \) with local optimization techniques, we consider a surrogate locally-quadratic approximation of \( C(\boldsymbol{m}) \). This surrogate function \( \tilde{C} \) majorizes C (
Figure 1) and therefore Newton methods belong to the category of
Majorization-Minimization (MM) method
(Lange, 2004). To build the quadratic surrogate function around an initial guess \( {\boldsymbol{m}}_0 \), we apply a second-order Taylor-Lagrange development on \( C(\boldsymbol{m}) \)
$$\tilde{C}({\boldsymbol{m}}_0 + \Delta {\boldsymbol{m}},{\boldsymbol{m}}_0)=C({\boldsymbol{m}}_0)+\sum_{j=1}^{M} \frac{\partial C({\boldsymbol{m}}_0)}{\partial m_j} \Delta m_j
+ \frac{1}{2} \sum_{j=1}^{M} \sum_{k=1}^{M} \frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial m_j \partial m_k} \Delta m_j \Delta m_k, \tag{2.2}\label{2.22}$$
where \( \Delta ({\boldsymbol{m}}) \) is a small perturbation of \( {\boldsymbol{m}}_0 \). The second argument of \( \tilde{C} \) is here to remind the point around which we linearize the problem. This point changes at each FWI iteration
(Figure 2.1).
In matrix form,
$$\tilde{C}({\boldsymbol{m}},{\boldsymbol{m}}_0)=C({\boldsymbol{m}}_0)+\left( \frac{\partial C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}} \right)^T (\boldsymbol{m}-\boldsymbol{m}_0)
+(\boldsymbol{m}-{\boldsymbol{m}}_0)^T \left( \frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}} \right) ( \boldsymbol{m} - \boldsymbol{m}_0). \tag{2.2}\label{2.2}$$
Note that \( \tilde{C}({\boldsymbol{m}}_0,{\boldsymbol{m}}_0)=C({\boldsymbol{m}}_0) \), \( \nabla_{m} \tilde{C}({\boldsymbol{m}}_0,{\boldsymbol{m}}_0) = \nabla_{m} C({\boldsymbol{m}}_0) \) and \( \partial^2_m \tilde{C}({\boldsymbol{m}}_0,{\boldsymbol{m}}_0) = \partial^2_m C({\boldsymbol{m}}_0) \).

Figure 2.1: At each iterate, the misfit function \( C(m) \) is majorized by a quadratic function \( \tilde{C}(m,m_k) \) such that \( \tilde{C}(m_k,m_k)=C(m_k) \) and \( \nabla_m \tilde{C}(m_k,m_k) = \nabla_m C(m_k) \). Provided that \( \tilde{C}(m,m_k)\) is sufficiently close to \( C(m) \) in the vicinity of \( m_k \), \( \tilde{C}(m,m_k)\) majorizes \( C(m) \) and converges toward the minimizer of \( C(m) \) as k tends to \( \infty \) (see the
textbook of K. Lange for more details).
Let's differentiate \( \tilde{C} \) with respect to the parameter \( m_l \)
$$\frac{\partial \tilde{C}(\boldsymbol{m},\boldsymbol{m}_0)}{\partial m_l}=\frac{\partial C({\boldsymbol{m}}_0)}{\partial m_l} +\sum_{j=1}^{M} \frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial m_j \partial m_l} \Delta m_j. \tag{2.3}\label{2.3}$$
In matrix form,
$$\frac{\partial \tilde{C}({\boldsymbol{m}},\boldsymbol{m}_0)}{\partial {\boldsymbol{m}} }=\frac{\partial C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}}+\frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial {\boldsymbol{m}}^2} \Delta {\boldsymbol{m}}. \tag{2.4}\label{2.4}$$
The minimum of \( \tilde{C} \) is reached when the first derivative of \( \tilde{C} \) vanishes.
$$\frac{\partial \tilde{C}(\boldsymbol{m},\boldsymbol{m}_0)}{\partial \boldsymbol{m}}=\boldsymbol{0}. \tag{2.5}\label{2.5}$$
This leads to the so-called Newton system:
$$\frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial {\boldsymbol{m}}^2} \, \Delta {\boldsymbol{m}} = - \frac{\partial C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}}. \tag{2.6}\label{2.6}$$
The right-hand side is the opposite of the gradient of C (the
steepest descent direction ). The left operator is the
Hessian matrix , the second derivative of \( C \).
A
Newton algorithm iterative updates m through the recursion
$$\boldsymbol{m}^{k+1}= \boldsymbol{m}^k + \alpha \Delta \boldsymbol{m}^k,\tag{2.7}\label{2.7}$$
where the
descent direction \( \Delta \boldsymbol{m}^k \) is given by
$$ \Delta {\boldsymbol{m}}^{k} = - \left( \frac{\partial^2 C({\boldsymbol{m}}^k)}{\partial {\boldsymbol{m}}^2} \right)^{-1} \, \frac{\partial C({\boldsymbol{m}}^k)}{\partial \boldsymbol{m}}, \tag{2.8}\label{2.8}$$
where \( k \) denotes the iteration count, and \( \alpha \) the
step length . The step length controls the quantity of descent as illustrated in
Figure 2.2.

Figure 2.2: Sketch illustration of the impact of the step length on the quantity of descent at a given iteration.
Remark: It is generally not possible to build explicitly the inverse of the Hessian in (\ref{2.6}). Therefore, we generally resort to iterative methods to solve the Newton system (\ref{2.6}) such as steepest-descent (along the parabola) or linear conjugate gradient method. This iterative algorithm introduces an inner loop within the outer nonlinear FWI loop, which requires to perform Hessian-vector product. To check this, note that
$$\nabla_{\boldsymbol{m}} \tilde{C}({\boldsymbol{m}},{\boldsymbol{m}}_0) = \nabla_{\boldsymbol{m}} C({\boldsymbol{m}}_0)
+\left( \frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}} \right) ( \boldsymbol{m} - \boldsymbol{m}_0). \tag{2.9}\label{2.9}$$
This Hessian-vector product can be computed with second-order adjoint state method
(Metivier et al., 2017) or the BFGS algorithm (
Nocedal, 2006).
Alternatively, we use steepest descent \( ( \partial^2 C({\boldsymbol{m}}^k) / \partial {\boldsymbol{m}}^2 \approx \boldsymbol{I} ) \) , nonlinear conjugate gradient or limited-memory quasi-Newton method (l-BFGS) to approximate the Newton search direction (\ref{2.8}) in the outer FWI loop.
Back to \ref{Content}
$$\tag{III}\label{III}$$
Generic expression of the gradient and Hessian
Gradient:
Let's compute the \( k^{th} \) element of the gradient of the misfit function.
$$\frac{\partial C({\boldsymbol{m}}) }{\partial m_k} = \frac{1}{2} \frac{ \partial \left( \left( \phi({\boldsymbol{m}}) - {\boldsymbol{d}}^* \right)^\dagger \left( \phi({\boldsymbol{m}}) - {\boldsymbol{d}}^* \right) \right) }{\partial m_k}\tag{3.1}\label{3.1}$$
In discrete form,
$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}=\frac{1}{2} \sum_{i=1}^{M} \frac{ \partial \left( \left(\phi_{i}-d_{i}^*\right) \left(\overline{\phi_{i}-{d}_{i}^*}\right) \right) }{\partial m_k},\tag{3.2}\label{3.2}$$
where \( \phi_{i} \) denotes the elements of the vector \( \phi(\boldsymbol{m}) \) (the synthetic data computed in the model \( {\boldsymbol{m}}) \). \( M \) denotes the number of observations. \( \bar{a} \) denotes the conjugate of \( a \). \( ^\dagger \) denotes the Hermitian (conjugate) transpose.
$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}=-\frac{1}{2} \sum_{i=1}^{M} \left\{ \frac{\partial \phi_{i}}{\partial m_k} \left(\overline{\phi_{i}-d^*}\right)+ \left(\phi_{i}-d_{i}^*\right) \left( \overline{\frac{\partial \phi_{i}}{\partial m_k} }\right) \right\}.\tag{3.3}\label{3.3}$$
$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}= -\sum_{i=1}^{M} \Re \left\{\left( \overline{\frac{\partial \phi_{i}}{\partial m_k}} \right) \left(\phi_{i}-d^*_{i}\right) \right\},\tag{3.4}\label{3.4}$$
where \( \Re\left\{a\right\} \) denotes the real part of \( a \) and we assume that \( m \) is real valued.
Going back to matrix notation,
$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}= -\Re \left\{ \left( \frac{\partial \phi({\boldsymbol{m}}) }{\partial m_k} \right)^\dagger \left( \phi({\boldsymbol{m}}) - {\boldsymbol{d}}^* \right) \right\}.\tag{3.5}\label{3.5}$$
$$\frac{\partial C({\boldsymbol{m}})}{\partial \boldsymbol{m}}= -\Re \left\{ \left( \frac{\partial \phi({\boldsymbol{m}}) }{\partial {\boldsymbol{m}} } \right)^\dagger \left( \phi({\boldsymbol{m}}) - {\boldsymbol{d}}^* \right) \right\}=-\Re \left\{ {\boldsymbol{J}}^\dagger ({\boldsymbol{m}}) \Delta {\boldsymbol{d}}({\boldsymbol{m}}) \right\}.\tag{3.6}\label{3.6}$$
The \( M \times N \) matrix \( \boldsymbol{J} \) is called the
sensitivity matrix or the
Jacobian matrix or the
Fréchet derivative matrix . The \(M\) vector \(\Delta {\boldsymbol{d}}({\boldsymbol{m}}) \) contains the data residuals. The gradient is formed by the zero-lag correlation between the partial derivative of the data with respect to the parameters and the data residuals.
Remark:
- If we consider real-valued time-dependent data (as for seismograms), the gradient becomes
$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}= - {\boldsymbol{J}}^T ({\boldsymbol{m}}) \Delta {\boldsymbol{d}}({\boldsymbol{m}}), \tag{3.7}\label{3.7}$$
which should be interpretated as a
zero-lag correlation between the partial derivative of \( \phi \) and \( \Delta \boldsymbol{d} \).
Explicitly,
$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}= -\int_t \sum_r \frac{\partial \phi_r(t + \tau)}{\partial m_k} \Delta d_r(t) dt, \tag{3.8}\label{3.8}$$
with \( \tau \) = 0 and \( r \) labels the positions where the measurements are performed.
Behind correlation, there is indeed the idea of comparison of two signals, one of them been delayed by \( \tau \). When there is similarity, the correlation function has a high amplitude. Here, we perform the comparison of the signals \( \partial \phi / \partial \boldsymbol{m} \) and \( \Delta \boldsymbol{d} \) for only one lag equal to 0 because the output signal (the gradient) is independent to time. Note that FWI approaches that extend the search space re-introduces a dependency of the gradient to time to extend the search space of FWI and relax the requirement of accurate velocity model (for example, see
Biondi and Almomin (2014) and section XI).
- In the frequency domain, if multiple frequencies are processed jointly, the vector \( \Delta \boldsymbol{d} \) gathers all the frequencies involved in the inversion and the gradient is computed by summation over frequencies (for the same reason as above, the gradient is independent to frequencies for a full bandwidth signal).
- If \( m \) is complex-valued (for example, when attenuation is considered in frequency-domain FWI), we need to use Wirtinger derivative.
$$\frac{\partial }{\partial z } = \frac{1}{2} \left( \frac{\partial }{\partial x } -i \frac{\partial }{\partial y } \right) ~~~~ \frac{\partial }{\partial \bar{z} } = \frac{1}{2} \left( \frac{\partial }{\partial x } +i \frac{\partial }{\partial y } \right), \tag{3.9}\label{3.9}$$
where \( z = x + i y \).
Hessian:
Let's now compute the element \( m,l \) of the second-derivative of the misfit function.
$$\frac{\partial^2 C({\boldsymbol{m}})}{\partial m_k \partial m_l}=-\sum_{i=1}^{M} \Re \left\{ \left( \overline{\frac{\partial^2 \phi_i }{\partial m_k \partial m_l} } \right) (\phi_i-d_i^*) \right\}
+\sum_{i=1}^{M} \Re \left\{ \left( \overline{ \frac{\partial \phi_i }{\partial m_k} } \right) \frac{\partial \phi_i }{\partial m_l} \right\}.\tag{3.10}\label{3.10}$$
Going back to matrix notation,
$$\frac{\partial^2 C({\boldsymbol{m}})}{\partial {\boldsymbol{m}}^2}=\Re \left\{ {\boldsymbol{J}}^\dagger {\boldsymbol{J}} \right\}+\Re \left\{ \frac{\partial {\boldsymbol{J}}^t}{\partial {\boldsymbol{m}}^T} \left(\overline{\Delta {\boldsymbol{d}}} ~ ... ~ \overline{\Delta {\boldsymbol{d}}} \right) \right\}.\tag{3.11}\label{3.11}$$
The linear term of the Hessian, which is sometimes referred to as approximate Hessian or
Gauss-Newton Hessian , is given by:
$${\boldsymbol{H}}_a = \Re \left\{ {\boldsymbol{J}}^\dagger {\boldsymbol{J}} \right\}.\tag{3.12}\label{3.12}$$
The linear term in the Hessian represents the sum of the zero-lag correlation between the partial derivative of the data with respect to two different
model parameters \( m_l \) and \( m_k \).
$$h_{a_{lk}} = \Re \left\{ \sum_{i=1}^{M} \left( \overline{ \frac{ \partial \phi_i }{\partial m_l} } \right) \, \frac{ \partial \phi_i }{\partial m_k}. \right\}. \tag{3.13}\label{3.13}$$
The second term is non linear. It means that it would be zero if the forward problem equation was linear. It corrects the gradient from artifacts resulting from the erroneous interpretation of double-scattering effects as single-scattering ones.
Back to \ref{Content}
$$\tag{IV}\label{IV}$$
The gradient and Hessian of the FWI problem: imaging principle
This section is inspired from
Pratt et al. (1998). To develop the formal expression of the FWI gradient, we can compute the partial derivative of \( \boldsymbol{u} \) with respect to \( \boldsymbol{m} \); then, sample them at the receiver positions with \( \boldsymbol{P} \) to build the sensitivity matrix. Then, perform the matrix vector product between the sensitivity matrix and the conjugate of the data residuals according to (\ref{3.6}).
We start by differentiating the forward-problem equation with respect to \( \boldsymbol{m} \)
$$\boldsymbol{A}(\boldsymbol{m}) \boldsymbol{u} = \boldsymbol{b}.\tag{4.1}\label{4.1}$$
$$\frac{ \partial \boldsymbol{A}(\boldsymbol{m})}{\partial m_j} \boldsymbol{u} + \boldsymbol{A}(\boldsymbol{m}) \frac{\partial \boldsymbol{u}}{\partial m_j} = \boldsymbol{0}. \tag{4.2}\label{4.2}$$
We indeed use the fact that \( \boldsymbol{b} \) does not depend on \( \boldsymbol{m}\).
$$\boldsymbol{A}(\boldsymbol{m}) \frac{\partial \boldsymbol{u}}{\partial m_j} = - \frac{ \partial \boldsymbol{A}(\boldsymbol{m})}{\partial m_j} \boldsymbol{u}. \tag{4.3}\label{4.3}$$
The partial derivative of \( \boldsymbol{u} \) with respect to \( m_j \) is a wavefield that satisfies the wave equation for the
virtual scattering source \( \boldsymbol{f}^{(j)} = - \left( \partial \boldsymbol{A}(\boldsymbol{m}) / \partial m_j \right) \boldsymbol{u} \). The spatial support of this source is centred on \( m_j \) and its temporal support is centred on the arrival time of \( \boldsymbol{u}\) at \(m_j\). The first statement is shown by the fact that the radiation pattern matrix \( \partial \boldsymbol{A}(\boldsymbol{m}) / \partial m_j \) is a sparse matrix, whose non zero coefficients sample \( \boldsymbol{u} \) in the close vicinity of \( m_j \)
(Figure 4.1). The second one is shown by the fact that \( \partial \boldsymbol{A}(\boldsymbol{m}) / \partial m_j \) is applied on \( \boldsymbol{u} \).

Figure 4.1: Pattern of the impedance matrix when discretized by finite differences on Cartesian grid with a compact 9-point stencil. If we take the partial derivative of the coefficients of the matrix with respect to \( m_i \) where \( m \) is the squared slowness and if the mass matrix is discretized with a lumped mass, then the resulting matrix has one non zero coefficient on the diagonal \( i \) whose amplitude is \( \omega^2 \).
The partial derivative wavefield has an obvious interpretation: it is the wavefield single scattered by a point diffractor located at the position of the model parameter which respect to which we compute the partial derivative of the wavefield (
Figure 4.2). The radiation pattern of the scattering source depends on the subsurface parametrization with which the wave equation is formulated (for example, wavespeed-density, wavespeed-impedance, density-impedance for the acoustic wave equation) and the parameter type in this parametrization with respect to which the partial derivative of
u is computed (
Figure 4.3).

Figure 4.2: Sketch of the partial derivative wavefield. In a homogeneous background model for a surface acquisition,the partial derivative of the data with respect to one parameter is shown by a diffraction hyperbolae (top panel). If the parameter is the wavespeed for a (velocity-density) parametrization the scattering source has en isotropic
radiation pattern (as a pressure source).

Figure 4.3: Snapshot of partial derivative wavefields computed numerically for a source located at the vertex of the model parameter. (a): Vp parameter in the (Vp-rho) parametrization. (b) Rho parameter in the (Vp-rho) parametrization. (c) Vp parameter in the (Vp-Ip) parametrization. (d) Ip parameter in the (Vp-Ip) parametrization. Vp: wavespeed. Rho: Density. Ip: impedance. The green curves are analytical radiation pattern computed in the framework of the ray+Born approximation. In (a), the virtual source associated with Vp is a isotropic pressure source. In (b) the virtual source associated with rho is a force pointing toward the source (in this paticular case, a vertical force).
The FWI gradient can be interpreted in the framework of diffraction tomography as reviewed below. The gradient of the misfit function is the the
zero-lag correlation between the Fréchet-derivative seismograms and the data-residual seismograms (see previous section)/
$$\frac{\partial C(\boldsymbol{m}) }{\partial m_j} = -\Re \left[ \left( \frac{ \partial \phi }{\partial m_j} \right)^T \overline{ \Delta \boldsymbol{d} } \right], \tag{4.4}\label{4.4}$$
where
$$\frac{\partial \phi }{\partial m_j} = \boldsymbol{P} \frac{\partial \boldsymbol{u} }{\partial m_j} = - \boldsymbol{P} \boldsymbol{A}^{-1} \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j } \right) \boldsymbol{u} = - \boldsymbol{P} \boldsymbol{A}^{-1} \boldsymbol{f}^{(j)},\tag{4.5}\label{4.5}$$
and \( \boldsymbol{P} \boldsymbol{u} = \boldsymbol{u}(\boldsymbol{x}_r) \) where \( \boldsymbol{x}_r \in {\mathcal{D}}_r \) denotes the receiver positions.
The
data-residual seismograms (the difference by the observations and the simulated data in the initial model) can be interpreted as the recording at receiver positions of wavefield scattered by all the missing (unknown) heterogeneities in the initial model. The
Fréchet-derivative seismograms can be interpreted as the recording at receiver positions of the wavefield scattered by a small perturbation located at \( m_j \), all the other parameters being kept fixed. The
zero-lag correlation aims to pick in the data residuals (assuming traveltime correspondance of the partial derivative wavefields and the data residuals) the piece of information that must be transformed into a model perturbation at position of \( m_j \). This interpretation is illustrated schematically in
Figure 4.4.

Figure 4.4: Schematic interpretation of the FWI gradient. The true medium is homogeneous with three small circular inclusions (circles in (c)). The background model \( \boldsymbol{m}_0 \) lacks the three inclusions. The data residual seismograms are therefore the three diffraction hyperbolae shown in (b). The partial derivative of the data with respect to the parameter located at the position of the yellow inclusion is the diffraction hyperbolae shown in (a). The zero lag correlation between the seismograms shown in (a) and (b) aims to extract from (b) the residuals that have been generated by a missing heterogenity located at the position of the yellow circle: if we take a zero-like correlation of (a) and (b) (time wise multiplication) and integrate over time, most of the contribution will come from the diffraction hyperbole in (b) that matches that shown in (a). Indeed, this imaging principles works if the background model m0 is sufficiently kinematically accurate (see the following of the tutorial for the meaning of "sufficiently").
Back to \ref{Content}
$$\tag{V}\label{V}$$
Efficient Computation of the sensitivity matrix
The sensitivity matrix can be efficiently computed capitalizing on the spatial reciprocity of the Green functions as shown below.
Let's plug the expression of the Fréchet derivatives in the gradient.
$$\frac{ \partial C(\boldsymbol{m}) }{ \partial m_j } = -\Re \left\{ \left( \frac{ \partial \phi ({\boldsymbol{m}}) }{\partial m_j } \right)^T \overline{\Delta \boldsymbol{d}} \right\} = - \Re \left\{ \left( \boldsymbol{P} \boldsymbol{A}^{-1} \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \right) \boldsymbol{u} \right)^T \overline{\Delta \boldsymbol{d}} \right\} = - \Re \left\{ \boldsymbol{u}^T \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \right)^T \boldsymbol{A}^{-T} \boldsymbol{P}^T \overline{\Delta \boldsymbol{d}} \right\}\tag{5.1}\label{5.1}$$
By analogy between the second and third terms, the coefficient \( ij \) of the Fréchet derivative matrix is given by:
$$J_{ij} = \boldsymbol{u}_s^T \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \right)^T \boldsymbol{A}^{-T} \boldsymbol{\delta}_r = \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \boldsymbol{u}_s \right)^T \boldsymbol{A}^{-T} \boldsymbol{\delta}_r = \left( \boldsymbol{f}_s^{(j)} \right)^T \left( \boldsymbol{A}^{-T} \boldsymbol{\delta}_r \right), \tag{5.2a}\label{5.2a}$$
where source-receiver pair \( (s,r) \) corresponds to data \( i \) and \( \boldsymbol{\delta}_r \) denotes an impulse source at receiver \( r \). One coefficient of the sensitivity matrix can be build by the inner product of the virtual source by a Green function computed from the receiver position.
In the particular case where \( \boldsymbol{m} \) is the squared slowness and \( \boldsymbol{A} \) is self adjoint as for the Hemholtz operator ( \( \boldsymbol{A}^T = \boldsymbol{A} \) ), the coefficient
ij of the sensitivity matrix can be computed by the multiplication of the monochromatic Green functions generated from the source
s and receiver
r and taken at the position of the model parameter
j, weighted by \( \omega^2 \) and the source signature of the source wavefield \( s(\omega) \).
$$J_{i j} = \omega^2 s(\omega) \left( \boldsymbol{A}^{-1} \boldsymbol{\delta}_s \right)_j \left( \boldsymbol{A}^{-1} \boldsymbol{\delta}_r \right)_j, \tag{5.2b}\label{5.2b}$$.
The sensitivity matrix \( \boldsymbol{J} \) can be built explicitly by
performing one forward problem per non redundant positions of shot and receivers . This is far less expensive than computing \( N \) forward problems per shot where \( N \) denotes the number of model parameters.
We can now compute the sensitivity kernel (one row of the sensitivity matrix) of acoustic FWI in the frequency-space domain for Vp parameter in the (Vp,rho) parametrization (
Figure 5.1). This amounts to compute Green functions for two point sources located at the position of the shot and the receiver and multiply the two Green functions. The product of the two monochromatic Green functions has a phase equal to the sum of the phase of the two Green functions and an amplitude equal to the product of their amplitudes consistently with the definition of the single-scattered wavefield computed in the framework of the Born approximation.
Figure 5.1 shows an interference picture, called
wavepath (Woodward, 1992), which shows isophase surfaces of elliptical shape called
isochrones (since we are in the frequency domain, all the traveltimes are synthetized in the figure). They represent
Fresnel zones on which a residuals of temporal support spaning over t ± T/2 where t is the central traveltime and T is the period are back-projected. All the diffractions occurring within a Fresnel zone have a traveltime that do not differ by more than half a period. This implies that the background velocity model should be accurate enough to predict recorded traveltimes with an error lower than half a period (if this condition is not fulfilled, the simulated and recorded component of the residual will be back-projected onto two different isochrones). Anticipating the resolution analysis of FWI, one can note that the width of the isochrones decreases as the scattering angle Θ decreases.

Figure 5.1: Sensitivity kernel of frequency-domain acoustic FWI for Vp parameter in the (Vp-rho) parametrization. The source is denoted by the star, the receiver by the triangle. The frequency is 5Hz. The interference picture shows isophase surfaces (Fresnel zones) on which residuals are backprojected during FWI. All the diffractions occuring within one Fresnel zone have traveltimes that do not differ by more than half a period. Some diffraction paths are illustrated. Note how the width of the Fresnel zones decreases as the scattering angle θ decreases. The width of the Fresnel zone gives the resolution of the FWI in the spatial direction pointed by the wavenumber vector k.
Back to \ref{Content}
$$\tag{VI}\label{VI}$$
Efficient computation of the gradient
We now show how to compute the gradient of FWI without building the sensitivity matrix. Computing the gradient does not require to compute explicitly the sensitivity matrix.
$$\frac{ \partial C(\boldsymbol{m}) }{ \partial m_j } = - \Re \left\{ \boldsymbol{u}^T \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \right)^T \left( \boldsymbol{A}^{-1} \right)^T {\boldsymbol{P}}^T \overline{\Delta \boldsymbol{d}} \right\} = - \Re \left\{ \boldsymbol{u}^T \left( \frac{ \partial\boldsymbol{A} }{ \partial m_j} \right)^T \overline{\boldsymbol{v}} \right\} = - \Re \left\{ \left( \frac{ \partial\boldsymbol{A} }{ \partial m_j} \boldsymbol{u} \right)^T \overline{\boldsymbol{v}} \right\} = - \Re \left( \left\{ \boldsymbol{f}_s^{(j)} \right)^T \overline{\boldsymbol{v}} \right\}, \tag{6.1}\label{6.1}$$
where the adjoint wavefield satisfies the wave equation
$$\boldsymbol{A}^T \, \overline{\boldsymbol{v}} = {\boldsymbol{P}}^T \overline{\Delta \boldsymbol{d}},\tag{6.2}\label{6.2}$$
with a source term corresponding to the assemblage of all of the residuals (we exploit here the linearity of the wave equation). We see now that we can compute \( \partial C \) by taking the zero-lag correlation between the virtual source \( \frac{ \partial\boldsymbol{A} }{ \partial m_j} \boldsymbol{u} \) and the backpropagated adjoint wavefield \( \boldsymbol{v} \) instead of taking the zero-lag correlation between the partial derivative wavefield at the receiver positions with the data residuals as indicated by (\ref{4.4}).
To achieve this, we have used the property of the adjoint of the wave equation operator \( \boldsymbol{A} \)
$$ < \boldsymbol{P} \boldsymbol{A} \boldsymbol{f} | \Delta \boldsymbol{d} >_{\partial \Omega} = < \boldsymbol{f} | \boldsymbol{A}^{\dagger} \boldsymbol{P}^T \Delta \boldsymbol{d} >_\Omega = < \boldsymbol{f} | \boldsymbol{A}^{T} \overline{\Delta \boldsymbol{d}} >_\Omega ,$$
where \( \boldsymbol{f} = \frac{ \partial\boldsymbol{A} }{ \partial m_j} \boldsymbol{u} \) is the virtual source, \(\Omega\) is the object domain spanned by \(\boldsymbol{m}\), \(\partial \Omega\) is the acquisition domain.
The adjoint of the wave equation operator
backpropagates in time the data residuals (Figure 6.1). This is shown by the conjugate of the data residuals.
$$ITF ( \overline{u}(\omega)) = \int_{-\infty}^{+\infty} \overline{u}(\omega) \, \exp\left(i \omega t\right) d\omega = \int_{-\infty}^{+\infty} u(-\omega) \, \exp \left(i \omega t\right) d\omega = \int_{-\infty}^{+\infty} u(\omega) \, \exp \left(i \omega (-t)\right) d\omega = u(-t). \tag{6.3}\label{6.3}$$
It is reminded that, when \( u(t) \) is real valued, \( u(-\omega) = \bar{u}(\omega) \) as \( \bar{u}(\omega) = \overline{ \int_t u(t) e^{-i \omega t} dt } = \int_t u(t) e^{i \omega t} dt = \int_t u(t) e^{-i (-\omega) t} dt = u(-\omega) \).
Computing the gradient requires
two forward problems per source: one to compute the incident wavefield \( \boldsymbol{u} \), one to compute the adjoint wavefield \( {\boldsymbol{v}} \).
Figure 6.1: Sketch illustration of the imaging principle of FWI based upon adjoint simulation. The gradient is built by the zero-lag correlation between the incident wavefield and the adjoint wavefield back propagated in time from the receiver positions. The source of the adjoint equation is illustrated in the top left panel. It represents the data residuals reversed in time. The incident wavefield and the back-propagated adjoint wavefield should be in phase at the position of the missing heterogeneity at T=0.775s (right figure). FWI can be viewed as a time reversal source localization method, where the targeted sources are the virtual scattering sources in depth.
Back to \ref{Content}
$$\tag{VII}\label{VII}$$
Computing the gradient of the misfit function with the adjoint-state method
We review what has been shown above in a more generic way using the so-called adjoint-state method as developed by
G. Chavent and reviewed by
Plessix (2006) and
Kern (2016).
We consider the least-squares functional.
$$C(\boldsymbol{m}) = \| \phi(\boldsymbol{m}) - \boldsymbol{d}^* \|_2^2.\tag{7.1}\label{7.1}$$
The gradient of the functional as a function of the Fréchet derivatives is given by
$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = \left[ \left( \frac{\partial \phi(\boldsymbol{m}) }{\partial \boldsymbol{m} } \right)^\dagger \right]_{N \times M} \left( \phi(\boldsymbol{m}) - \boldsymbol{d}^* \right)_M.\tag{7.2}\label{7.2}$$
Let's differentiate the state equation \( F(\boldsymbol{u},\boldsymbol{m})=0 \).
$$\frac{\partial F(\boldsymbol{u},\boldsymbol{m})}{\partial \boldsymbol{u} } \delta \boldsymbol{u} = - \frac{\partial F(\boldsymbol{u},\boldsymbol{m})}{\partial \boldsymbol{m} } \delta \boldsymbol{m}.\tag{7.3}\label{7.3}$$
$$\left(\frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} } \right) \frac{ \partial \boldsymbol{u} }{ \partial \boldsymbol{m} } = - \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }.\tag{7.4}\label{7.4}$$
From the observation equation \( \phi(\boldsymbol{m}) = \boldsymbol{P} \boldsymbol{u}(\boldsymbol{m}) \rightarrow \partial_{\boldsymbol{m}} \phi(\boldsymbol{m}) =\boldsymbol{P} \partial_{\boldsymbol{m}} \boldsymbol{u}(\boldsymbol{m}) \).
$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = -\left[ \left( \boldsymbol{P} \frac{\partial \boldsymbol{u}(\boldsymbol{m}) }{\partial \boldsymbol{m} } \right)^\dagger \right]_{N \times M} \left( \phi(\boldsymbol{m}) - \boldsymbol{d}^* \right)_M.\tag{7.5}\label{7.5}$$
Computing the gradient, written under the form of (\ref{7.5}), requires to solve the linearized equation (\ref{7.4}). This linearized equation has \( N \) right-hand sides and hence is expensive to solve.
Let's plug the closed-form solution of equation (\ref{7.4}) in the expression of the gradient, equation (\ref{7.5}).
$$ \nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = - \left[ \left( \boldsymbol{P} \left(\frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} } \right)^{-1} \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} } \right)^\dagger \right]_{M \times N} \left( \boldsymbol{d} - \boldsymbol{d}^* \right)_N.\tag{7.6}\label{7.6}$$
$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = - \left[ \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }^\dagger \left( \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }^{-1} \right)^\dagger \boldsymbol{P}^T \right]_{M \times N} \left( \boldsymbol{d} - \boldsymbol{d}^* \right)_N.\tag{7.7}\label{7.7}$$
Let's parenthesize (\ref{7.7}) as follow
$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = - \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }^\dagger \left[ \left( \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }^\dagger \right)^{-1} \boldsymbol{P}^T \left( \boldsymbol{d} - \boldsymbol{d}^* \right) \right].\tag{7.8}\label{7.8}$$
With inner-product notations, we have used the identity
$$< \boldsymbol{P} \left(\frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }\right)^{-1} \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }, \boldsymbol{d} - \boldsymbol{d}^* >=< \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} } , \left( \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }^\dagger \right)^{-1} \boldsymbol{P}^T \left( \boldsymbol{d} - \boldsymbol{d}^* \right) >.\tag{7.9}\label{7.9}$$
Let's introduce the adjoint state variable \( \boldsymbol{v} \).
$$\frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }^\dagger \boldsymbol{v}= \boldsymbol{P}^T \left( \boldsymbol{d} - \boldsymbol{d}^* \right).\tag{7.10}\label{7.10}$$
With this new variables, the gradient reads
$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = - \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }^\dagger \boldsymbol{v}. \tag{7.11}\label{7.11}$$
The gradient requires now to solve one state equation (to compute \( \boldsymbol{u} \) from which \( \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} } \) follows) and one adjoint state equation (\ref{7.10}).
Back to \ref{Content}
$$\tag{VIII}\label{VIII}$$
The adjoint-state method with Lagrangian functional
Let's come back to the original full-space formulation of FWI as a constrained optimization where the state equation is the equality constraint (\ref{1.4b}).
$$ \min_{\boldsymbol{m},\boldsymbol{u}} \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \|_2^2 ~~ \text{subject to } ~~ F(\boldsymbol{u},\boldsymbol{m})=0\tag{8.1}\label{8.1}.$$
We solve this constrained problem with the
method of Lagrange multiplier. Let's introduce the following Lagrangian function
$$\mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) = \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d} \|_2^2 + < \boldsymbol{v} , F(\boldsymbol{u},\boldsymbol{m}) >,\tag{8.2}\label{8.2}$$
where \( \boldsymbol{v} \) is the Lagrange multiplier (or adjoint-state variable), \( < > \) denotes inner product, and \( \boldsymbol{u} \), \( \boldsymbol{v} \) and \( \boldsymbol{m} \) are independent variables.
We solve the following min max (saddle point) problem
$$ \min_{\boldsymbol{m},\boldsymbol{u}} \max_{\boldsymbol{v}} \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \|_2^2 + < \boldsymbol{v} , F(\boldsymbol{u},\boldsymbol{m}) >. \tag{8.3}\label{8.3}$$
The first-optimality conditions state that the minimizer is reached at a saddle point of the Lagrangian function, namely,
$$ \partial \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) / \partial \boldsymbol{v} = 0, \tag{8.4}\label{8.4}$$
$$ \partial \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) / \partial \boldsymbol{u} = 0, \tag{8.5}\label{8.5}$$
$$ \partial \mathcal{L} (\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})/ \partial \boldsymbol{m} = 0. \tag{8.6}\label{8.6}$$
Instead of jointly updating \( \boldsymbol{v} \), \( \boldsymbol{u} \), and \( \boldsymbol{m} \) , we restrict the search space to the case when \( \boldsymbol{u} \) is a realization of the state equation for any \( \boldsymbol{m} \). Let's denote by \( \tilde{\boldsymbol{u}} \) this realization.
In the full space spanned by the independent variables \( \boldsymbol{m} \), \( \boldsymbol{u} \), and \(\boldsymbol{v}\), the partial derivative of the Lagrangian with respect to \( \boldsymbol{v} \) is the state equation
$$\frac{\partial \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})}{\partial \boldsymbol{v}}=F(\boldsymbol{u},\boldsymbol{m}). \tag{8.7}\label{8.7}$$
When \( \boldsymbol{u} \) is a realization of the state equation, the first optimality condition \( \partial \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) / \partial \boldsymbol{v} \) is satisfied since \( F(\tilde{\boldsymbol{u}},\boldsymbol{m}) = 0 \).
Let's now consider the second optimality condition. In the full space spanned by the independent variables \( \boldsymbol{m} \), \( \boldsymbol{u} \), and \(\boldsymbol{v}\), zeroing the Lagrangian with respect to the state variable gives the adjoint-state equation
$$\left( \frac{\partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u}} \right)^\dagger \boldsymbol{v} = \boldsymbol{P}^T (\boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^*). \tag{8.8}\label{8.8}$$.
In limiting ourselves to the reduced space spanned by \( \tilde{\boldsymbol{u}} \), the associated adjoint-state variables, denoted by \( \tilde{\boldsymbol{v}} \) satisfy
$$\left( \frac{\partial F(\tilde{\boldsymbol{u}},\boldsymbol{m}) }{\partial \boldsymbol{u}} \right)^\dagger \tilde{\boldsymbol{v}} = \boldsymbol{P}^T (\boldsymbol{P} \tilde{\boldsymbol{u}} - \boldsymbol{d}^*). \tag{8.9}\label{8.9}$$,
Moreover, when \( \boldsymbol{u} \) is a realization of the state equation, we have
$$C(\boldsymbol{m}) = \left[ \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})\right]_{\tilde{\boldsymbol{u}}}, \tag{8.10}\label{8.10}$$
where \( C(\boldsymbol{m}) \) is the misfit function of the unconstrained problem (\ref{2.1}).
Indeed, we have also
$$C(\boldsymbol{m}) = \left[ \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})\right]_{\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}}}.\tag{8.11}\label{8.11}$$
Therefore, in virtue of the definition of partial derivative, taking the partial derivative of \( \mathcal{L} \) with respect to \( \boldsymbol{m} \) and taking the values at \( (\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}}) \) gives the gradient of \( \mathcal{L} \) with respect to \( \boldsymbol{m} \)
$$\nabla_\boldsymbol{m} C(\boldsymbol{m}) = \left[ \partial_\boldsymbol{m} \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})\right]_{\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}}}.\tag{8.12}\label{8.12}$$
We find
$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = \left[ \partial_{\boldsymbol{m}} \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) \right] _{\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}}}= \left[ \left( \frac{\partial F(\boldsymbol{u},\boldsymbol{m})}{\partial \boldsymbol{m}} \right)^\dagger \boldsymbol{v} \right]_{\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}}},\tag{8.13}\label{8.13}$$
which is identical to (\ref{7.11}). The above expression of the gradient can be used to iteratively update \( \boldsymbol{m} \) with any local optimization algorithm (steepest-descent, conjugate gradient, quasi Newton, Newton).
We can now apply this recipe to frequency-domain and time-domain FWI.
Computing the gradient of frequency-domain FWI with Lagrangian function
The Lagrangian function is given by
$${\mathcal{L}} \left(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}\right) = \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \|_2^2 + \Re \left\{ < \boldsymbol{v} , \boldsymbol{A}(\boldsymbol{m}) \, \boldsymbol{u} - \boldsymbol{b} > \right\},\tag{8.14}\label{8.14}$$
where it is reminded that \( \boldsymbol{u} \) and \( \boldsymbol{v} \) are complex valued. Taking the derivative of a function with respect to complex-valued parameter makes use of
Wirtinger derivative. See also the
paper of Brandwood about computing the gradient of a function of complex-valued parameters.
The adjoint-state equation is
$$\partial_{\boldsymbol{u}} {\mathcal{L}} \left(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}\right) = 0\tag{8.15}\label{8.15}$$
Before taking the derivative with respect to \( \boldsymbol{u} \), we introduce the adjoint of \( \boldsymbol{A} \).
$${\mathcal{L}} \left(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}\right) = \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \|_2^2 + \Re \left\{ < \boldsymbol{A}(\boldsymbol{m})^\dagger \boldsymbol{v} , \boldsymbol{u} > - < \boldsymbol{v} , \boldsymbol{b} > \right\}.\tag{8.16}\label{8.16}$$
The adjoint-state equation is given by
$$\partial_{\boldsymbol{u}} {\mathcal{L}} \left(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}\right) = 0 \rightarrow \boldsymbol{A}(\boldsymbol{m})^\dagger \boldsymbol{v} = \boldsymbol{P}^T \left( \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \right) \rightarrow \boldsymbol{A}(\boldsymbol{m})^T \overline{\boldsymbol{v}} = \boldsymbol{P}^T \left( \overline{\boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^*} \right),\tag{8.17}\label{8.17}$$
which is identical to (\ref{8.8}).
The gradient is given by
$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m})= \Re \left\{ < \boldsymbol{v} , \frac{\partial \boldsymbol{A}(\boldsymbol{m}) }{\partial \boldsymbol{m}} \, \boldsymbol{u} > \right\},\tag{8.18}\label{8.18}$$
which is identifcal to (\ref{8.13}).
Computing the gradient of time-domain FWI with Lagrangian function
We review how to compute the gradient of time-domain FWI with Lagrangian function when the wave equation is formulated as a first-order hyperbolic velocity-stress system.
$$\left\{ \begin{array}{l} \partial_t v(x,t) = b(x) \partial_x \sigma(x,t) + s(x,t) \\ \partial_t \sigma(x,t) = \kappa(x) \partial_x v(x,t) \end{array} \right.\tag{8.19}\label{8.19}$$
The wave equation can be written in matrix compact form as
$$\partial_t \boldsymbol{u}(x,t) = \boldsymbol{M}(m(x))\boldsymbol{D} \boldsymbol{u}(x,t) + \boldsymbol{b}(x,t), \tag{8.20}\label{8.20}$$
where
$$\boldsymbol{u}(x,t)=\left(\begin{array}{c} v(x,t) \\ \sigma(x,t)\end{array}\right), ~~ \boldsymbol{b}(x,t)=\left(\begin{array}{c} s(x,t) \\ 0 \end{array}\right), ~~ \boldsymbol{M}(x) = \left(\begin{array}{cc}b(x) & 0 \\ 0 & \kappa(x)\end{array}\right), ~~ \boldsymbol{D}= \left(\begin{array}{cc} 0 & \partial_x \\ \partial_x & 0 \end{array} \right).\tag{8.21}\label{8.21}$$
with initial conditions: \( \boldsymbol{u}(x,0)=\boldsymbol{0}; ~~ \partial_t \boldsymbol{u}(x,0)=\boldsymbol{0} \).
We rewrite the wave equation such that it is self-adjoint by moving the subsurface properties in the left-hand side
$$\boldsymbol{C}(m(x)) \partial_t \boldsymbol{u}(x,t) = \boldsymbol{D} \boldsymbol{u}(x,t) + \boldsymbol{C}(m(x)) \boldsymbol{b}(x,t) ~~~ \text{with} ~~~ \boldsymbol{C}(x) = \left(\begin{array}{cc}\rho(x) & 0 \\ 0 & 1/\kappa(x)\end{array}\right)\tag{8.22}\label{8.22}$$
If we write the observation equation as
$$d(x_r,t) = d_r(t) = \boldsymbol{P}_r \boldsymbol{u}(x,t)\tag{8.23}\label{8.23}$$
then, the Lagrangian function is given by
$$L(\boldsymbol{u}(x,t),\boldsymbol{v}(x,t),\boldsymbol{m}(x)) = \frac{1}{2} \int_0^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right)^2 dt + \int_{t=0}^T < \boldsymbol{v}(x,t), \underbrace{\boldsymbol{C}(\boldsymbol{m}(x)) \partial_t \boldsymbol{u}(x,t) - \boldsymbol{D} \boldsymbol{u}(x,t) - \boldsymbol{C}(m(x)) \boldsymbol{b}(x,t)}_{State ~ equation} > dt,\tag{8.24}\label{8.24}$$
where \( \boldsymbol{v} \) is the Lagrange multiplier.
We apply an integration by part for the time and space derivatives.
- Integration by part for time derivative
$$\int_0^T < \boldsymbol{v}(x,t), \partial_t \boldsymbol{u}(x,t) > = \left[ < \boldsymbol{v}(x,t) , \boldsymbol{u}(x,t) > \right]_0^T - \int_0^T < \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) = < \boldsymbol{v}(x,T) , \boldsymbol{u}(x,T) > - \int_0^T < \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) >.\tag{8.25}\label{8.25}$$
- Integration by part for space derivative
$$\int \partial_x \boldsymbol{u}(x,t) \boldsymbol{v}(x,t) dx = - \int \boldsymbol{u}(x,t) \partial_x \boldsymbol{v}(x,t) dx. \tag{8.26}\label{8.26}$$
$$< \boldsymbol{D} \boldsymbol{u}(x,t), \boldsymbol{v}(x,t) > = < \boldsymbol{u}(x,t), \boldsymbol{D}^\dagger \boldsymbol{v}(x,t) > ~~ \text{with} ~~ \boldsymbol{D}^\dagger = - \boldsymbol{D}. \tag{8.27}\label{8.27}$$
Using these results, the Lagrangian after a double integration by part reads
$$\begin{array}{l}L(\boldsymbol{u}(x,t),\boldsymbol{v}(x,t),\boldsymbol{m}(x)) = \frac{1}{2} \int_0^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right)^2 dt + < \boldsymbol{C}(\boldsymbol{m}(x)) \boldsymbol{v}(x,T), \boldsymbol{u}(x,T) > \\ + \int_{t=0}^T \left[ - < \boldsymbol{C}(\boldsymbol{m}(x)) \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) > + < \boldsymbol{D} \boldsymbol{v}(x,t) , \boldsymbol{u}(x,t) > - < \boldsymbol{v}(x,t) , \boldsymbol{C}(\boldsymbol{m}(x)) \boldsymbol{b}(x,t) > \right] dt \end{array} \tag{8.28}\label{8.28}$$.
The adjoint-state equation is given by
$$\boldsymbol{C}(\boldsymbol{m}(x)) \partial_t \boldsymbol{v}(x,t) = \boldsymbol{D} \boldsymbol{v}(x,t) + \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right).\tag{8.29}\label{8.29}$$
or, equivalently in a form suitable for explicit time stepping
$$\partial_t \boldsymbol{v}(x,t) = \boldsymbol{M}(\boldsymbol{m}(x)) \boldsymbol{D} \boldsymbol{v}(x,t) + \boldsymbol{M}(\boldsymbol{m}(x)) \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right),\tag{8.30}\label{8.30}$$
with a final condition \( \boldsymbol{v}(x,T) = 0 \) (this final condition results from the second term of the Lagrangian).
We perform a change of variable to transform the final condition into an initial condition
$$\boldsymbol{w}(x,t)=\boldsymbol{v}(x,T-t)~~~ \text{such that} ~~~ \boldsymbol{w}(x,0) = \boldsymbol{v}(x,T).\tag{8.31}\label{8.31}$$
Sustituting \( \boldsymbol{v} \) by \( \boldsymbol{w} \) in the adjoint-state equation gives
$$\partial_t \boldsymbol{w}(x,T-t) = \boldsymbol{M}(\boldsymbol{m}(x)) \boldsymbol{D} \boldsymbol{w}(x,T-t) + \boldsymbol{M}(\boldsymbol{m}(x)) \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right).\tag{8.32}\label{8.32}$$
Then, we apply the change of variable \( t' = T - t \).
$$-\partial_t' \boldsymbol{w}(x,t') = \boldsymbol{M}(\boldsymbol{m}(x)) \boldsymbol{D} \boldsymbol{w}(x,t') + \boldsymbol{M}(\boldsymbol{m}(x)) \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,T-t') - \boldsymbol{d}_r(T-t')^* \right).\tag{8.33}\label{8.33}$$
$$\partial_t' \boldsymbol{w}(x,t') = - \boldsymbol{M}(\boldsymbol{m}(x)) \boldsymbol{D} \boldsymbol{w}(x,t') - \boldsymbol{M}(\boldsymbol{m}(x)) \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}^v_r \boldsymbol{u}(x,T-t') - \boldsymbol{d}_r(T-t')^* \right),\tag{8.34}\label{8.34}$$
with \( \boldsymbol{v}(x,0)=0 \).
In summary, the practical procedure to implement the adjoint-state equation consists of
- Reverse in time the data residuals.
- Reverse the sign of \( dt\) in the wavefield simulation code.
Let's now compute the gradient. From (\ref{8.24}), we have
$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) \approx - \int_{t=0}^T < \frac{ \partial_{\boldsymbol{m}} \boldsymbol{C}(\boldsymbol{m}(x)) }{ \partial \boldsymbol{m} } \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) > dt, \tag{8.35}\label{8.35}$$
More explicitly,
$$\begin{array}{l}\nabla_{\rho} C(\boldsymbol{m}) \approx - \int_{t=0}^T < \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) > dt, \\ \nabla_{\kappa} C(\boldsymbol{m}) \approx \int_{t=0}^T < \frac{1}{\kappa^2(x)} \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) > dt. \end{array}\tag{8.36}\label{8.36}$$
To compute the gradient with respect to other classes of parameters, use the chain rule of derivatives.
Back to \ref{Content}
$$\tag{IX}\label{IX}$$
Role of the Hessian
Let's remind the Newton system providing the solution of the descent direction.
$$\frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial {\boldsymbol{m}}^2} \, \Delta {\boldsymbol{m}}=- \frac{\partial C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}}. \tag{9.1}\label{9.1}$$
We have just seen the role of the gradient in FWI. We now analyze the role of the Hessian.
Let's now compute the element \( m,l \) of the second-derivative of the misfit function.
$$\frac{\partial^2 C({\boldsymbol{m}})}{\partial m_k \partial m_l}=-\sum_{i=1}^{N} \Re \left\{ \left( \overline{ \frac{\partial^2 d_i }{\partial m_k \partial m_l} } \right) (d_i-d_i^*) \right\}
+\sum_{i=1}^{N} \Re \left\{ \left( \overline{ \frac{\partial d_i}{\partial m_k} } \right) \frac{\partial d_i}{\partial m_l} \right\}. \tag{9.2}\label{9.2}$$
Going back to matrix notation,
$$\frac{\partial^2 C({\boldsymbol{m}})}{\partial {\boldsymbol{m}}^2}=\Re \left\{ \boldsymbol{J}^\dagger \boldsymbol{J} \right\}+\Re \left\{ \frac{\partial \boldsymbol{J}^t}{\partial {\boldsymbol{m}}^t} (\overline{ \Delta \boldsymbol{d} } ~ ... ~ \overline{\Delta \boldsymbol{d}} ) \right\}. \tag{9.3}\label{9.3}$$
In particular,at the point \( \boldsymbol{m}_0 \) of the model space, we have
$$\frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}^2}=\Re \left\{ \boldsymbol{J}_0^\dagger \boldsymbol{J}_0 \right\}+\Re \left\{ \frac{\partial \boldsymbol{J}_0^t}{\partial \boldsymbol{m}^t} ( \overline{\Delta \boldsymbol{d}} ~ ... ~ \overline{ \Delta \boldsymbol{d} }) \right\}. \tag{9.4}\label{9.4}$$
The linear term of the Hessian, which is sometimes referred to as approximate or Gauss-Newton Hessian, is given by:
$${\boldsymbol{H}}_a = \Re \left\{ \boldsymbol{J}^\dagger \boldsymbol{J} \right\}. \tag{9.5}\label{9.5}$$
The linear term in the Hessian represents the sum of the zero-lag correlation between partial derivative wavefields with respect to two different
model parameters \( m_l \) and \( m_k \)$ taken at the receiver positions.
$$h_{a_{lk}} = \Re \left\{ \sum_{i=1}^{N} \overline{\frac{ \partial d_i }{\partial m_l}} \, \frac{ \partial d_i }{\partial m_k} \right\}, \tag{9.6}\label{9.6}$$
where \( N\) is the total number of data.
As seismic sources are band-limited, the partial derivative wavefields associated with close parameters are correlated. This limits the spatial resolution of the imaging.
Applying the inverse of the approximate Hessian on the gradient aims to correct for these limited bandwidth effects as in deconvolution.
For sake of illustration, w e consider a point diffractor in a homogeneous background model
(Figure 9.1). Grid dimensions are \( 31 \times 31\). The grid interval is 50 m. The acquisition is circular and centered on the point diffractor. There 41 sources and 41 receivers evenly distributed on the acquisition circle. We invert 15 frequencies within the frequency band \( \left[4.5;20\right] \) Hz.
We perform 15 iterations per frequency. The source wavelet is a Dirac. We compare the convergence of Gauss-Newton and steepest-descent methods in
Figure 9.2.
Figure 9.1: (Left) Inclusion velocity model. (Right) Gauss-Newton Hessian for the 4.5 Hz frequency and a full circular acquisition surrounding the inclusion.

Figure 9.2: Gauss-Newton versus steepest-descent reconstruction of the inclusion in the 4.5-20 Hz frequency band from the circular acquisition (Figure 7).
The diagonal terms of the approximate Hessian are the sum of the auto-correlations of the partial derivative wavefields (square amplitude) at the receiver positions:
$$h_{a_{ll}} = \Re \left\{ \sum_{i=1}^{M} \overline{\frac{ \partial d_i }{\partial m_l}} \, \frac{ \partial d_i }{\partial m_l} \right\}= \sum_{i=1}^{M} \| \frac{ \partial d_i }{\partial m_l} \|_2^2$$
The diagonal terms of the Hessian aims to remove from the gradient the geometrical-amplitude effects of the partial derivative wave. It can be used as a very efficient preconditioner in FWI from surface acquisitions
(Figure 9.3). It also mitigates the acquisition footprint
(Figure 9.4).

Figure 9.3: (Top) FWI velocity model of the Baragiano thrust belt. (Bottom) Diagonal Hessian as a FWI preconditioner. (See Ravaut et al., 2004).

Figure 9.4: Diagonal Hessian for the inclusion test. The image shows the acquisition footprint.
We now have a look to the nonlinear term of the Hessian. Let's remember the expression of the Hessian:
$${\boldsymbol{H}} = \Re \left\{ \boldsymbol{J}^\dagger \boldsymbol{J} \right\}+\Re \left\{ \frac{\partial \boldsymbol{J}^T}{\partial {\boldsymbol{m}}^T} ( \overline{\Delta \boldsymbol{d}} ~ ... ~ \overline{\Delta \boldsymbol{d}} ) \right\} \tag{9.7}\label{9.7}$$
and of the partial derivative wavefield:
$$\boldsymbol{A} \left( \frac{ \partial \boldsymbol{u} }{\partial m_l} \right) = {\boldsymbol{f}}^{(l)} = - \left( \frac{ \partial \boldsymbol{A} }{ \partial m_l} \right) \boldsymbol{u}. \tag{9.8}\label{9.8}$$
Differentiation of the partial derivative wavefield with respect to \( m_k \) gives:
$$\boldsymbol{A} \frac{ \partial^2 \boldsymbol{u} }{\partial m_l \partial m_k} = - {\boldsymbol{f}}^{(lk)}. \tag{9.9}\label{9.9} $$
with
$${\boldsymbol{f}}^{(lk)} = \left( \frac{\partial \boldsymbol{A} }{\partial m_l} \right) \left( \frac{\partial \boldsymbol{u} }{\partial m_k} \right) + \left( \frac{\partial \boldsymbol{A} }{\partial m_k} \right) \left( \frac{\partial \boldsymbol{u} }{\partial m_l} \right) + \left( \frac{\partial^2 \boldsymbol{A} }{\partial m_l \partial m_k} \right) \boldsymbol{u}. \tag{9.10}\label{9.10}$$
$${\boldsymbol{A}} \frac{ \partial^2 \boldsymbol{u} }{\partial m_l \partial m_k} = - {\boldsymbol{f}}^{(lk)}. \tag{9.11}\label{9.11} $$
with
$${\boldsymbol{f}}^{(lk)} = \left( \frac{\partial {\boldsymbol{A}} }{\partial m_l} \right) \left( \frac{\partial \boldsymbol{u} }{\partial m_k} \right) + \left( \frac{\partial {\boldsymbol{A}} }{\partial m_k} \right) \left( \frac{\partial \boldsymbol{u} }{\partial m_l} \right) + \left( \frac{\partial^2 {\boldsymbol{A}} }{\partial m_l \partial m_k} \right) \boldsymbol{u}. \tag{9.12}\label{9.12} $$
The second partial derivative of the wavefields describe doubly-scattered wavefield by the model perturbations \( m_l \) and \( m_k \)
(Figure 9.5).
These second derivatives are multiplied by the data residuals in the Hessian. This terms refocuses the footprint of double-scattered
events in the single-scattering gradients at the correct position of model parameters \( m_l \) and \( m_k \).
$$\Delta {\boldsymbol{m}} = \left[ \Re \left\{ \boldsymbol{J}_0^\dagger \boldsymbol{J}_0 \right\}+\Re \left\{ \frac{\partial \boldsymbol{J}_0^T}{\partial {\boldsymbol{m}}^T} \left( \overline{\Delta \boldsymbol{d}} ~ ... ~ \overline{\Delta \boldsymbol{d} } \right) \right\} \right]^{-1} \Re \left\{ \boldsymbol{J}_0^\dagger \Delta \boldsymbol{d} \right\}. \tag{9.13}\label{9.13}$$
Figure 9.5: Double scattering effects in FWI.
$$\left[ M \right] = \left( \frac{ \left[M\right] \ \left[M\right] }{ \left[D\right] \ \left[D\right] } \right) \left( \frac{ \left[D\right] }{ \left[M\right] } \left[D\right] \right). \tag{9.14}\label{9.14} $$
The gradient of the misfit function has not the units of the model. Only the action of the Hessian on the gradient gives the physical units to
the model perturbation. Therefore, accounting for the Hessian is particularly important in multiparameter FWI where the gradients associated with each parameter class need to be scaled properly.
Back to \ref{Content}
$$\tag{X}\label{X}$$
Resolution analysis of FWI
To state the problem of seismic imaging, let's start with the equation of the reflection traveltime curve
$$t(o)= \frac{1}{c} \sqrt{4 \, z^2 + o^2}, \tag{10.1}\label{10.1}$$
where \( t \) is traveltime, \( o \) is offset, \( z \) is the depth of the reflector, \( c \) is the wavespeed of the half-space above the reflector (
Figure 10.1).

Figure 10.1: Key attributes in seismic imaging.
For \( o \rightarrow 0 \), \(~~~~ t(o) \approx \frac{2 z }{c} \). This equation highlights the ill-famed velocity-depth ambiguity.
For \( o \rightarrow \infty \) \(~~~~ t(o) \approx \frac{o }{c} \). This equation highlights that waves recorded at long offsets (diving waves, refracted waves, post-critical reflections) carried out information on the kinematic properties of the subsurface, represented by the wavespeed. Wavespeed also represents the average property of the upper medium, its long wavelengths.
These equations provide the basic guidelines to design hierarchical multiscale FWI of wide-aperture data.
Step 1: invert the wide-aperture data to solve the long-wavelengths of the velocity model.
Step 2: invert the short-spread reflection data to image reflectivity.
These two end-members tasks can be implemented in a single integrated seismic imaging workflow with FWI when wide-aperture data are available, while they are performed in alternating mode with conventional reflection data through an explicit scale separation between the long and short wavelengths. The intermediate wavelengths belong to the null space of the seismic imaging problem even if broadband wide-azimuth data contribute to shrink this null space.
Resolution analysis in the framework of diffraction tomography
This section is inspired by
Sirgue and Pratt (2004) who provided a tutorlal of the principles of diffraction tomography (
Devaney, 1984;
Wu and Toksoz, 1987 ). Let's assume distant sources and receivers and a homogeneous background medium.
In this framework, incident and scattered Green functions can be approximated by monochromatic plane waves:
$$\begin{array}{l} g_0\left(\omega,\boldsymbol{x};{\boldsymbol{s}}\right) \approx \exp \left( i k \hat{\boldsymbol{s}} . \boldsymbol{x} \right), \\ g_0\left(\omega,\boldsymbol{x};{\boldsymbol{r}}\right) \approx \exp \left( i k \hat{\boldsymbol{r}}\boldsymbol{x} \right), \end{array}\tag{10.2}\label{10.2}$$
where \( k = \omega / c_0 \), \( \hat{\boldsymbol{s}} \) is unit vector in the incident propagation direction (source to scatterer), \( \hat{\boldsymbol{r}} \) is the unit vector in the inverse scattering direction (receiver to scatterer)
(Figure 10.2).

Figure 10.2: incident (left) and scattered (middle) plane waves. (Right) Corresponding wavenumber mapped in the image.
Let's remind the expression of the gradient of the misfit function:
$$\frac{ \partial C(\boldsymbol{m}) }{ \partial m_j } = - \Re \left\{ \left( \overbrace{ {\boldsymbol{A}}^{-1} \frac{ \partial{\boldsymbol{A}} }{ \partial m_j} \boldsymbol{u} }^{partial \;\; derivative \;\; wavefields} \right)^\dagger \overbrace{ {\boldsymbol{P}}^T \Delta \boldsymbol{d} }^{Data \;\; residuals} \right\}. \tag{10.3}\label{10.3}$$
In the case of the scalar wave equation for squared-slowness update, this simplifies to:
$$\frac{ \partial C(\boldsymbol{x}_j) }{ \partial m_j } = \sum_\omega \omega^2 \sum_{s,r} \Re \left\{ s(\omega) \overline{g_0}\left(\omega,\boldsymbol{x}_j;{\bf{s}}\right) \Delta {\bf{d}}(\omega,{\boldsymbol{r}},{\boldsymbol{s}}) \overline{g_0}\left(\omega,\boldsymbol{x}_j;{\boldsymbol{r}}\right) \right\}. \tag{10.4}\label{10.4}$$
Substitution of the expression of the Green functions in the gradient gives:
$$\frac{ \partial C(\boldsymbol{x}_l) }{ \partial m_l } = \sum_\omega \omega^2 \sum_{s,r} \Re \left\{ s(\omega) \Delta {\boldsymbol{d}}(\omega,{\boldsymbol{r}},{\boldsymbol{s}}) \exp \left( - i k_0 \left( {\hat{\boldsymbol{s}}} + {\hat{\boldsymbol{r}}} \right) . \boldsymbol{x} \right) \right\}. \tag{10.5}\label{10.5}$$
The wavenumber vector \( k_0 \left( {\hat{\boldsymbol{s}}} + {\hat{\boldsymbol{r}}} \right) \) can be written in polar form as:
$${\boldsymbol{k}} = k \left( \cos (\phi), \sin (\phi) \right), \tag{10.6}\label{10.6}$$
where \( \phi = \phi_s + \phi_r \) (see
Figure 10.3).

Figure 10.3: Scattering and dip angles.
For a source-receiver pair and a frequency, the modulus of the wavenumber injected in the FWI model can be related to the local wavelength \( \lambda_0 = c_0 / f \) and the scattering angle \( \theta \) by:
$$k = \frac{ 2 \, \pi \, f }{c_0} \cos \left( \theta / 2 \right). \tag{10.7}\label{10.7}$$
From (\ref{10.5}), we conclude that the gradient has the form of a truncated inverse Fourier series.
The argument of the basis functions is the local wavenumber vector
$$ {\boldsymbol{k}} = k \left( \cos (\phi), \sin (\phi) \right) = k \left( {\hat{\boldsymbol{s}}} + {\hat{\boldsymbol{r}}} \right) \tag{10.8}\label{10.8}$$
injected in the FWI model at position \( \boldsymbol{x}_j\).
Truncation of the Fourier series is controlled by the source bandwidth, the scattering-angle \( \theta \) and the dip angle \( \phi \) illumination. The scattering-angle and the dip-angle illuminations are themselves controlled by the source-receiver acquisition geometry (
Figure 10.4).

Figure 10.4: Wavenumber coverage versus source-receiver geometry and frequency. (left) Long-offset surface acquisition combining transmission and reflection (16km maximum offset) for four frequencies (3.5, 5, 7, 10 Hz). The velocity linearly increases with depth in the background velocity model. Vertical wavenumber coverage is fostered at the expense of small horizontal wavenumbers. (Right) Short offset acquisition. A flat reflector is injected in the background velocity model. It behaves as a secondary source. This secondary source allows for small horizontal wavenumber coverage along the transmission paths connecting the reflector and the sources/receivers. Migration of the short spread reflections maps high wavenumbers. There is a wavenumber gap between the small wavenumbers mapped by reflection tomography and the high wavenumbers mapped by migration of short-spread reflections. Classical seismic imaging workflow relies on this alternating update of the velocity macro model by tomography and the reflectivity by migration. Each of the two subproblems uses the output of the other subproblem as a passive quantity: the tomography uses the reflectivity as a secondary source, migration uses the velocity macromodel as the propagator.
In 3D, the azimuth angle should be added
(Figure 10.5).
$$\boldsymbol{k} = k
\left(
\begin{array}{l}
\sin \left( \phi \right) \cos \left( \varphi \right) \\
\sin \left( \phi \right) \sin \left( \varphi \right) \\
\cos \left( \phi \right)
\end{array}
\right) ~~ \text{with} ~~ k = \frac{\omega}{V_P} \cos \left( \theta / 2 \right).
\tag{10.9}\label{10.9}$$

Figure 10.5: Angles controlling the FWI resolution in 3D: scattering, dip and azimuth angles.
Fundamental conclusions can be inferred from \ref{10.7} and \ref{10.8}
- One frequency and one scattering angle map one wavenumber.
- Low frequencies and wide-scattering angles map low wavenumbers. Conversely, high frequencies and small-scattering angles map high wavenumbers (Figure 10.6).

Figure 10.6: Low frequencies - wide angles map low wavenumbers. Conversely, high frequencies - short angles map high wavenumbers.
- This leads to a redundant control of frequencies and scattering angle on wavenumber coverage for wide-aperture acquisitions, i.e., many (frequency - \( \theta \)) pairs map the same frequency (Figure 10.7).

Figure 10.7: Several pairs of frequency and scattering angles map the same wavenumber.
- CPU-efficient FWI algorithms can be designed by decimating the redundancy of the wavenumber coverage. This amounts to limit the inversion to a few discrete frequencies (Figure 10.8). Sirgue& Pratt (2004) provides some guidelines to choose a frequency sampling in FWI (Figure 20). This criterion assumes that a dense array of point sources will allow for a fine sampling of the scattering angles. Under this condition, one frequency maps one band of wavenumber in the model space in virtue of the redundant control of frequency and scattering angle on the wavenumber coverage. Sirgue& Pratt (2004) considered the simple case of a reflection from a flat interface. In this framework, only the vertical wavenumbers of the target (the interface) are non zero. Formally, the criterion of Sirgue& Pratt (2004) has no practical applications in complex heterogeneous media; Yet, it provides useful guidelines to design efficient frequency-domain FWI algorithms.
$$k_z(f) = \frac{2 f}{c} \cos \left( \theta / 2 \right). \tag{10.10}\label{10.10}$$
The critetion of Sirgue& Pratt (2004) proposes that the maximum wavenumber mapper by a given frequency shoulb be equal to the minimum wavenumber mapped by the next frequency to remove the redundancy in the wavenumber coverage and minimize the number of frequencies involved in frequency-domain FWI.
$$k_{max}(f)= k_{min}(f + \Delta f). \tag{10.11}\label{10.11}$$
$$\frac{2 f}{c} \cos \left( \theta_{min} / 2 \right) = \frac{2 ( f + \Delta f )}{c} \cos \left( \theta_{max} / 2 \right). \tag{10.12}\label{10.12}$$
From this equality, we infer easily the adaptive frequency interval to be used to remove wavenumber redundancy.
$$\Delta f = w . f \; \; \; \text{where} \; \; \; w = \frac{ \cos (\theta_{min}/2) }{ \cos (\theta_{max}/2) - 1}. \tag{10.13}\label{10.13}$$
Basic conclusions:
- frequency interval increases with frequency.
- frequency interval increases when the wide-angle coverage increases.

Figure 10.8: Vertical wavenumber bands mapped by a reflection from a flat reflector at the base of a homogeneous half space and by a finite number of frequenciess such that the wavenumber bands do not overlap. A smaller number of frequencies are necessary to continuously sample the wavenumber spectrum for long offset acquisitions relative to short offset counterpart according to (\ref{10.13}).
The wavepath in the frequency domain is computed by multiplying the monochromatic Green functions emitted from the source and the receiver
(Figure 21). The real part of the complex-valued monochromatic wavepath is shown here. Note how the width of the first Fresnel zone and of the isochrones decrease with source-receiver offset.

Figure 10.9: Sensitivity kernel of FWI based upon single-scattering Born approximation. The top and middle panels show the shot and receiver monochromatic Green functions in homogeneous media for two different offsets. Bottom panels show the corresponding wavepaths (
Woodward, 1992). The width of the Fresnel zones correlate with the scattering angfe.

Figure 10.10: Wavepath, isochrone, Fresnel zones.
The wavepath in Figure 10.10 can be described from two slightly different perspectives, in the framework of the inverse scattering theory or reverse-time (adjoint) imaging principle.
- The monochromatic wavepath for a point source represents an interference picture that shows the equi-phase surfaces associated with traveltimes of waves scattered by all the point diffractors in the model (i.e., the partial derivative wavefields). Cross-correlation of these partial derivative wavefields with the data residuals provide the contribution of the seismic trace to the gradient: a residual positioned at a given traveltime will be backprojected in space on the corresponding equi-phase surface. The width of the equi-phase surfaces (i.e., the modulus of the wave number vector) gives the resolution in the perpendicular direction to the surface.
- Let's assume that the data residuals is a delta Dirac function in the frequency domain. The corresponding residual in the time domain has an infinite support. The monochromatic wavepath provides a complete picture over time of the spatial locations where residuals observed at different times will be backprojected.
The relationship between frequency, scattering angle, and wavenumber can be established in the framework of asymptotic ray+Born migration/inversion. Ray+Born migration/inversion is a migration method which has been recast in the theoretical framework of least-squares inverse problem theory by
Lambare et al. (1992),
Jin et al. (1992),
Thierry et al. (1999a),
Thierry et al. (1999b),
Operto et al. (2000),
Xu et al. (2001) and
Lambare et al. (2003). Ray+Born inversion consists in a waveform inversion of the scattered wavefield where the Green's functions are computed with ray theory and the forward problem is linearized with the Born approximation. In the high frequency limit, the relationship between frequency, scattering angle and wavenumber can be clearly established through a diagonalization of the Hessian operator (see also the original work of
Beylkin (1985) and
Miller et al. (1987). The kernel of the ray+Born migration inversion can be used also to precondition reverse-time migration (
Metivier et al., 2015).We seek to image a point diffractor located in the middle of \( 101 \times 101 \) grid. The optimization algorithm is L-BFGS. Step length is estimated by line search. All of the frequencies are inverted in one go. Thirty iterations are performed. The prewhitening damping factor of the diagonal Hessian is \( 0.01 \times max ( diag \left\{ H_a \right\} ) \). We first illustrate band-pass filtering effects induced by narrow illumination of the wavenumber spectrum.
$$ \begin{array}{|c|c|c|c|} \hline V_{P_0} (m/s) & \Delta V_P (m/s) & \text{Frequency band (Hz)} & \text{h (m)} \\ \hline 4000 & 5000 & 0.2 - 20 Hz & 50 \\ \hline \end{array} \tag{10.14}\label{10.14}$$
The source wavelet is a Ricker of main frequency 5 Hz (
Figure 10.11). Frequencies between 0.2 and 20 Hz are inverted simultaneously with a frequency sampling of 0.2 Hz.

Figure 10.11: Spectrum of the Ricker wavelet for the Dirac test.
We compare the FWI results for a single-shot acquisition and for a circular acquisition centered on the point diffractor. In the second case, bandpass filtering effects only result from the limited bandwidth of the source (Figure 10.12).
Figure 10.12: Reconstruction of a scatterer. (Left) Single-source surface acquisition. (Right) Full circular acquisition.
The lest panel in
Figure 10.12 shows artifacts resulting from the sharp truncature of the FWI sensitivity kernel (the Fourier series) resulting from the incomplete illumionation of the dip and scattering angles. The radius of the first Fresnel zone in reflection mode is:
$$R = \sqrt{ \frac{ \lambda \, z }{ 2 } }\tag{10.15}\label{10.15}$$
This gives for this particular case (\( f_{max} \) = 20 Hz; \( z \approx 2 km\); \(V_P\) = 4 km/s): R = 447 m. This is consistent witht the width of the migrated segment shown in
Figure 10.13. The circular inclusion gives a low-pass reconstruction of the scatterer. Sime mild artifacts are also generated by aliasing (see the following).

Figure 10.13: Size of the Fresnel zone associated with the reflection sampling the scatterer.
We assess now more precisely the resolution with which the scatterer has been reconstructed in Figure 10.13 (right). We look for the trapezoidal filter which allows us to roughly match the vertical log across the reconstructed scatterer with the log across the true scatterer after application of the trapezoidal filter (Figure 10.14). The filter shows that FWI reconstructs the low wavenumbers thanks to the full angular illumination. Moreover, the high wavenumber cutoff is roughly consistent with the maximum frequency of the Ricker wavelet.

Figure 10.14: Resolution of the scatterer reconstruction (Ricker wavelet).
We repeat the experiment by replacing the Ricker wavelet with a Dirac wavelet. The results are shown in Figure 10.15. Clearly, the inversion manages to reconstruct higher wavenumbers compared to Figure 10.14. This results because the spectral components injected in FWI are not anymore weighted by the spectrum of the Ricker wavelet (the amplitude spectrum of the wavelet can be seen as a weighting operator in the data misfit function).

Figure 10.15: Resolution of the scatterer reconstruction (Dirac wavelet).
Figure 10.16 compares the results of three experiments. In the first, multiple frequencies are simultaneously inverted with a Ricker wavelet as in
Figure 10.14. In the second the Dirac wavelet replaces the Ricker wavelet as in
Figure 10.15. In the third, the frequencies are inverted sequentially (frequency continuation) with the Ricker wavelet. In this case, FWI is blind to the amplitude spectrum of the source signature. The best resolution is achieved in this latter case because the information carried out by each frequency is optimally extracted.

Figure 10.16: Resolution of the scatterer reconstruction. (Left) Ricker wavelet. Simultaneous inversion of multiple frequencies. (Middle) Dirac wavelet. Simultaneous inversion of multiple frequencies. (Right) Ricker wavelet. Sequential inversion of multiple frequencies.
Aliasing issue in FWI
Beyond illumination, sparse acquisition may raise sampling issues: In frequency-domain FWI, the gradient of the misfit function is sampled in the wavenumber domain. If the wavenumber spectrum (either \( k \) or \( \phi \)) is not sampled sufficiently-finely in some directions, some spatial aliasing (wraparound) will occur along these directions with a period of \( 1 / \Delta k \) or \( 1 / \Delta \phi \).
This is illustrated in
Figure 10.17 with the diffractor test. On the left, we subsample the sources in the circular acquisition. In this case, the dip angle is subsampled, leading to complex wraparound pattern. On the right, we subsample the frequencies in the surface acquisition. In this case, we subsample the wavenumber modulus, generating wraparound of the reflector segment along vertical and subvertical directions.
Aghamiry and al. (2019) have shown how to mitigate these artifacts wtih sparsity-promoting \( \ell{1} \) regularization.
Acquisition design in the framework of diffraction tomography as above reviewed in discussed in section 8 of
Versteeg (2012).

Figure 10.17: Aliasing artifacts in FWI. (Left): Single-offset circular acquisition. Sources are downsampled. This translates into a downsampling of the dip angle. (Right) Single-source surface acquisition. Frequencies are downsampled. This translates into subsampling of the wavenumber modulus, which manifest by wraparoung of the reflector segment in the space domain.
Back to \ref{Content}
$$\tag{XI}\label{XI}$$
The cycle skipping pathology
Definition of cycle skipping
FWI is a nonlinear and ill-posed problem, mainly due to the oscillating nature of seismic waves and the uncomplete illumination of the underground from surface acquisition. The misfit function has many local minima (
Figure 11.1).

Figure 11.1: Multimodality of the
l2 data-difference based distance in FWI for a two parameter problem. The targeted medium is a velocity gradient model. The two parameters are the velocity v0 on top the model and the vertical velocity gradient \(\alpha\) (
Mulder and Plessix, 2008).
Therefore, a sufficiently-accurate initial model is required to converge towards the global minimum. A criterion that needs to be satisfied is that, in the framework of the Born approximation, the initial model should allow to match the traveltime of a given arrival with an error that does not exceed half the period.This condition results from the width of the Fresnel zones, on which data residuals are backprojected: Fresnel zones define surface over which scattered waves whose traveltime differ by no more than half a period interfere constructively to build an interpretable arrival
(Figure 10.10). If the kinematic error exceeds the previous criterion, the difference-based residuals will be unconsistently back-projected on wrong isochrones. More precisely, the recorded and simulated component of a difference-based residual will be back-projected on different Fresnel zones. Moreover, the residuals associated with a given phase recorded at different offsets will be back-projected unconsistently from one offset to the next.

Figure 11.2: Heuristic interpretation of cycle skipping from two different perspectives. (Left) If the recorded and simulated components of the data residuals (top right inset) differ by more than one half a period, they will be back-projected onto two different isochrones. (Right) Phase ambiguity beyond \(\pi\). Half a period time lag between recorded and simulated
data means a phase shift of \(\pi\) beyond which the phase is ambiguous.
The previous criterion can be translated in terms of relative time error:
$$\frac{\Delta t}{T_L} < \frac{1}{2 \, N_\lambda}\tag{11.1}\label{11.1}$Equation (\ref{11.1}) shows that the number of propagated wavelengths \( N_\lambda \) should be as small as possible during the early stages
of FWI.
How to mitigate cycle skipping?
Data-driven hierarchical approaches
The most obvious way to decrease \( N_\lambda \) is to decrease frequencies. This prompts many authors to design multiscale approaches of FWI by hierarchical inversion of subdatasets of increasing high-frequency contents
(Figure 10.8). In virtue of the relationships between local wavenumber, frequency and scattering angle, hierarchical inversions of increasing frequencies leads to a multiscale approach which proceeds from the long-wavelength to the short-wavelength reconstruction (
Figure 11.3). A complementary approach consists of limiting the propagation distances. This can be achieved by jointly limiting source-receiver offset and traveltime (we also need to limit traveltimes to avoid migrating a deep short spread reflection; if the background velocity model is inaccurate, the migration will badly positioned the deep reflector and FWI may face difficulties to move this reflector; on the other hand, this deep reflector can be used by FWI as a deep scattering source which enrichs the low wavenumber content of the sensitivity kernel along the so-called rabbit ears (see Reflection Waveform Inversion (RWI) section). This leads to some layer-stripping approach where the subsurface model is built from top to bottom.
Figure 11.3: Illustration of mutiscale-FWI by frequency continuation with the SEG/EAGE overthrust model. (top left) Initial model. (Bottom right). True model. Other panels whos FWO models recontructed after successivve processing of frequencies 3Hz, 5Hz, 8Hz and 20Hz.
The misfit function can be rescaled to introduce a weighting operator \( {\bf{W}}_d\) and design a better-contioned optimization problem:
$$C(\bf{m}) = \frac{1}{2} \Delta {\bf{d}}^\dagger \, {\bf{W}}_d \, \Delta {\bf{d}} = \frac{1}{2} \Delta {\bf{d}}^\dagger \, {\bf{S}}_d^\dagger \, {\bf{S}}_d \, \Delta {\bf{d}} = \| {\bf{S}}_d \, \Delta {\bf{d}} \|^2.\tag{11.2}\label{11.2}
$$
The weighting operator \( {\bf{W}}_d = {\bf{S}}_d^\dagger \, {\bf{S}}_d \) in the data domain is written as the auto-correlation of operator \( {\bf{S}}_d \). \( {\bf{S}}_d \) is an operator that is applied to the data misfit vector, with the aim to weight the contribution of each element of this vector.
Minimization of the weighted misfit function leads to:
$$\Delta \bf{m} = \gamma
\Re \left( \bf{J}^\dagger \bf{W}_{d} \bf{J} + \left( \frac{\partial \bf{J} }{\partial \bf{m}^T } \right)^\dagger \bf{W}_d \left( \Delta \bf{d} ~ ... ~ \Delta \bf{d} \right) \right)^{-1}
\Re \left( \bf{J}^\dagger \bf{W}_d \Delta \bf{d} \right)
\tag{11.3}\label{11.3}
$$
In the gradient, \(\bf{S}_d\) is applied both on the data residuals and the partial derivative wavefield (in terms of implementation \(\bf{W}_d\) can be applied directly to the
source of the adjoint-state equation (i.e., the data residuals).
In the linear Hessian, \(\bf{S}_d\) is applied to the partial derivative wavefields before taking the cross-correlation.
In the non linear Hessian term, \(\bf{W}_d\) is applied to the sequence of data residuals.
As an illustration, short offsets can have much stronger amplitudes than long-offset data. In this case, short-offset residuals (and partial derivative wavefields) have a dominant contribution in the misfit function and hence in the model building. In particular, the Hessian will correct the gradient for illumination effects only for these dominant arrivals because these arrivals will have a dominant effect in the computation of \( \bf{J}^\dagger \, \textbf{J} \) (each coefficient of this matrix is computed by a summation over the data space).
One possible data preconditioning is to correct the data for geometrical spreading through a diagonal weighting matrix \( {\bf{S}}_{d} \), which describes
a linear gain with offset in 3D and a gain as a square-root power of offset in 2D. This is illustrated in Figure 11.4 with an ultra-long offset OBS gather from the eastenr Nankai through

Figure 11.5: Ultra-long offset gather collected in the eastern Nankai subduction zone with two different amplitude balancing. Top: no gain with offset. Bottom: linear gain with offset. In the laetr case, more weight will be given to the long-offset arrivals during FWI.
Example of Laplace-domain FWI
In the frequency domain, this data-driven continuation approaches lead to Laplace-Fourier domain FWI.
In frequency domain FWI where a very few frequencies are processed at a time, the trick to select subdataset with respect to
time is to manipulate {\bf{\textcolor{red}{complex-valued frequencies}}}, which turn out to be equivalent to time damping by decaying exponential functions.
$$TF ( p(t) e^{-(t-t0)/\tau)} ) = \int_{-\infty}^{+\infty} p(t) e^{-i(\omega + i / \tau)t} e^{t_0 / \tau} dt.\tag{11.4}\label{11.4}$$
This transform was called Laplace-Fourier transform by
Shin & Ha (2009). It was developed at the same time by
Brossier et al. (2009) to process land data
with surface waves.
Accordingly, the recorded data in the time domain must be preprocessed by exponential damping before Fourier transform.
If the time damping is applied from the first arrival travel time \(t_0\), then \({\bf{S}}_d\) is a diagonal matrix of coefficients \(e^{t_0/\tau}].
In
Figure 11.6, we show wavepaths associated with source and receiver monochromatic wavefields computed with complex-valued frequencies of different imaginary part. Indeed, the imaginary part of the complex-valued frequency damps the outer fringes of the wavepath, those associated with late-arriving phases. The effects of time damping is also illustrated in the time domain in
Figure 11.7 where we see that aggressive time damping leaves in the gather only the early arrivals. Data preconditioning by frequency selection and time damping are combined in Laplace-Fourier FWI algorithms with two nested loops over frequencies and time dampings (see
Gorszczyk et al., 2017 for a real-data application). This defines two levels of hierarchy over frequency and scattering angles in the FWI algorithm that controls the range of wavenumbers injected in the subsurface model during one multiscale step of the FWI. Put simply, applying time dampings from the first arrival is close in spirit to select events as a function of the aperture angle.
Also, a suitable choice of this imaginary part allows for the extraction of the phase of the first-arrival and hence makes frequency-domain waveform inversion algorithms amenable to first-arrival traveltime inversion or refraction tomography (
Shin et al., 2002; Min et al., 2006; Pyun et al., 2005; Ellefsen, 2009).




Figure 11.6:* Effect on time damping on Laplace-Fourier FWI. (Top) No time damping. (Middle) Low time damping. (Bottom) Strong time damping. Each figure shows the source and receiver Green functions (real part) and the associated sensitivity kernel (real part). In the bottom figure, the agressive time damping preserves mainly the first Fresnel zone.

Figure 11.7: Effects of time damping on time-domain gathers. Early (or wide-aperture) arrivals are selected to build long wavelengths. Relaxing the time damping progressively leads to a multi-resolution reconstruction. Damping values are 6s, 2s, 1s and 0.5s. The top panel os shown without time damping.
Beyond reduced-space FWI: Extending the linear regime of FWI by seach space extension
-
Wavefield Reconstruction Inversion (WRI) or extended source waveform inversion
-
Migration-based velocity analysis with space or time lag
Beyond difference-based residuals: Designing more convex distance
-
Deconvolution-based FWI
-
FWI based upon optimal transport
Back to \ref{Content}
$$\tag{XII}\label{XII}$$
Regularization of FWI
-
Regularization with penalty method
- Regularization by reparametrization
Back to \ref{Content}
$$\tag{XIII}\label{XIII}$$
FWI and Least-Squares Reverse Time Migration (RTM)
The goal of prestack depth migration is to build an image in depth of the reflectivity (the short wavelengths) of the subsurface from reflected waves. This reflectivity mainly represents lithological or tectonic (fault, thrust) discontinuities
(Figure 13.1). Migration is a linear process, which relies on the assumption of a scale separation between a background velocity model and the reflectivity
(Figure 13.2). Accordingly, it relies on the linearization of the wave equation with the Born approximation. Least-squares migration recasts migration in the framework of least-squares quadratic waveform inversion. Quantitative characterization of the reflectivity is achieved, beyond a purely geometrical description. Least-squares Reverse Time Migration computes Green functions with full-waveform modelling engines, as FWI. An alternative approach relies on ray theory such as ray+Born or ray+Kirchhoff migration/inversion to compute Green function. These approaches are much faster but can provide inaccurate images in complex media (for example, with sald bodies and sharp contrasts).
The goal of this section is to show the subtul differences between a linear and a nonlinear waveform inversion by comparing the FWI and RTM formalism.

Figure 13.1: An application of least-squares ray+Born migration/inversion. Characterization of the decollement, offshore Ecuador (Ribodetti et al., 2011).

Figure 13.2: Scale separation between smooth background velocity model and reflectivity
For now, we assume that we know the velocity macromodel and we seek to estimate the reflectivity by solving a linear inverse problem (RTM). In the next section, we will assume that we know the reflectivity and we will address the more difficult velocity macromodel building problem by nonlinear waveform inversion (RWI).
Linearized forward problem equation
Here, we linearize the wave equation around a known smooth background velocity model \( \boldsymbol{m}_0 \). This will allow us to recarst the migration problem as a quadratic optimization problem.The background wavefield \( \boldsymbol{u}_0 \) satisfies the time-harmonic wave equation $$\boldsymbol{A}(\boldsymbol{m}_0) \boldsymbol{u}_0 = \boldsymbol{b}, \tag{13.1}\label{13.1}$$ where $$\boldsymbol{A}(\boldsymbol{m}_0) = \omega^2 \text{diag}(\boldsymbol{m}_0) + \Delta. \tag{13.2}\label{13.2}$$and \( \boldsymbol{m}_0 \) is a smooth background velocity model.
The true wavefield can be written as the sum of the background wavefield \( \boldsymbol{u}_0 \) and a scattered wavefield \( \delta \boldsymbol{u} \) $$ \boldsymbol{u} = \boldsymbol{u}_0 + \delta \boldsymbol{u} \tag{13.3}\label{13.3}$$.
Also, the subsurface model \( \boldsymbol{m} \) can be written as the sum of the background model \( \boldsymbol{m}_0 \) and a perturbation model \( \delta \boldsymbol{m} \) $$ \boldsymbol{m} = \boldsymbol{m}_0 + \delta \boldsymbol{m} \tag{13.4}\label{13.4}$$.
The true wavefield satisfies the wave equation$$\boldsymbol{A}(\boldsymbol{m}) \boldsymbol{u} = \boldsymbol{b}. \tag{13.5}\label{13.5}$$
Taking the difference between (\ref{13.1}) and (\ref{13.5}) gives the wave equation satisfied by the scattered wavefield \( \delta \boldsymbol{u} \) $$\boldsymbol{A}(\boldsymbol{m}_0) \delta \boldsymbol{u} = - ( \boldsymbol{A}(\boldsymbol{m}) - \boldsymbol{A}(\boldsymbol{m}_0)) \boldsymbol{u} \tag{13.6}\label{13.6}$$.
If we replace \( \boldsymbol{u} \) by \( \boldsymbol{u}_0 \) in the right-hand side, we obtain the single-scattered approximation of \( \delta \boldsymbol{u} \) in the framework of the Born approximation.$$\boldsymbol{A}(\boldsymbol{m}_0) \delta \boldsymbol{u} \approx - (\boldsymbol{A}(\boldsymbol{m}) - \boldsymbol{A}(\boldsymbol{m}_0) ) \boldsymbol{u}_0 \tag{13.7}\label{13.7}$$or equivalently$$\delta \boldsymbol{u} \approx - \boldsymbol{A}(\boldsymbol{m}_0)^{-1} (\boldsymbol{A}(\boldsymbol{m}) - \boldsymbol{A}(\boldsymbol{m}_0)) \boldsymbol{u}_0. \tag{13.8}\label{13.8}$$
For the scalar Helmholtz equation and considering the squared slowness as model parameters, we have$$\boldsymbol{A}(\boldsymbol{m}) - \boldsymbol{A}(\boldsymbol{m}_0) = \omega^2 \text{diag} (\delta \boldsymbol{m}). \tag{13.9}\label{13.9}$$.
Accordingly the single scattered wavefield is given by$$\delta \boldsymbol{u} = - \omega^2 \boldsymbol{A}(\boldsymbol{m}_0)^{-1} \text{diag} (\delta \boldsymbol{m}) \boldsymbol{u}_0. \tag{13.10}\label{13.10}$$Indeed, \( \omega^2 \text{diag} (\delta \boldsymbol{m}) \boldsymbol{u}_0 \) is a scattering source. If the difference between \( \boldsymbol{m} \) and \( \boldsymbol{m}_0 \) reduces to a velocity perturbation at a diffractor point in the subsurface, then \( \delta \boldsymbol{u} / \delta \boldsymbol{m} \) is nothing else that the partial derivative wavefield (\ref{4.3}).
The single scattered data can be written as$$ \delta \boldsymbol{d} = \boldsymbol{P} \delta \boldsymbol{u} = -\omega^2 \boldsymbol{P} \boldsymbol{A}(\boldsymbol{m}_0)^{-1} \text{diag} (\delta \boldsymbol{m}) \boldsymbol{u}_0 = \boldsymbol{B}(\boldsymbol{m}_0) \delta \boldsymbol{m}, \tag{13.11}\label{13.11}$$where the observation operator \( \boldsymbol{P} \) samples the scattered wavefield at the receiver position.
Optimization problem
The least-squares migration problem is quadratic problem$$\min_{\delta \boldsymbol{m}} \| \boldsymbol{B}(\boldsymbol{m}_0) \delta \boldsymbol{m} - \delta \boldsymbol{d}^* \| + \epsilon \| \delta \boldsymbol{m} \|_2^2, $$where a zero-order Tikhonov regularization was added and \( \epsilon \) is the penalty parameter.
The solution of the quadratic minimization problem at first iteration is$$\delta \boldsymbol{m}^{(1)} = \left[ \boldsymbol{B}^\dagger \boldsymbol{B} + \epsilon \boldsymbol{I} \right]^{-1} \boldsymbol{B}^\dagger \delta \boldsymbol{d}^* = \boldsymbol{B}^{-g}\delta \boldsymbol{d}^*, \tag{13.12}\label{13.12}$$where \( \boldsymbol{B}^{-g} \) denotes the generalized or pseudo-inverse inverse of \( \boldsymbol{B} \).
The simulated data \( \delta \boldsymbol{d}^{(1)} \) from the first migrated image \( \delta \boldsymbol{m}^{(1)} \) are given by $$\delta \boldsymbol{d}^{(1)} = \boldsymbol{B} \delta \boldsymbol{m}^{(1)}. \tag{13.13}\label{13.13}$$
The data residuals to minimize are$$\Delta \delta \boldsymbol{d}^{(1)} = \delta \boldsymbol{d}^{(1)} - \delta \boldsymbol{d}^*. \tag{13.14}\label{13.14}$$
We seek the perturbation \( \Delta \delta m \) of \( \delta \boldsymbol{m}^{(1)} \) which cancels out the residuals \( \Delta \delta \boldsymbol{d}^{(1)} \).
$$\min_{\Delta \delta \boldsymbol{m}} \| \boldsymbol{B} \Delta \delta m - \Delta \delta \boldsymbol{d}^{(1)}\|. \tag{13.15}\label{13.15}$$
Indeed, the minimizer is$$\Delta \delta \boldsymbol{m} = \boldsymbol{B}^{-g} \Delta \delta \boldsymbol{d}^{(1)}. \tag{13.16}\label{13.16}$$
Second migrated image At iteration 2, we get$$\delta \boldsymbol{m}^{2} = \delta \boldsymbol{m}^{1} + \Delta \delta \boldsymbol{m} = \boldsymbol{B}^{-g} \left( \delta \boldsymbol{d}^* + \Delta \delta \boldsymbol{d}^{(1)} \right). \tag{13.17}\label{13.17}$$
At iteration k,
$$\delta \boldsymbol{m}^{k} = \boldsymbol{B}^{-g} \left( \delta \boldsymbol{d}^* + \sum_{i=1}^{k-1} \Delta \delta \boldsymbol{d}^{(i)} \right). \tag{13.18}\label{13.18}$$
We have iteratively refined the accuracy of the reflectivity by updating the right-hand side of the quadratic misfit function (namely, the reflection data) with the running sum of the reflection residuals. This is a well-known
iterative refinement procedure in linear algebra to improve the solution of ill-conditioned linear system.
Application on Marmousi (ray+Born waveform/inversion): Migrated images
An illustration of iterative asymptotic iterative lesat-squares migration is illustrated in
Figures 13.3 and
13.4 for the complex Marmousi model. In such complex model, the ray fields are multi-valued (i.e., folded due to caustics). In this case and for a single-arrival migration scheme, a significant number of iterations are necessary to recover the true amplitudes of the reflectivity although a good approximation of the Hessian has been used.

Figure 13.3: (a) True reflectivity. (b) Migrated reflectivity (iteration 1). (c) Migrated reflectivity (iteration 9).

Figure 13.4: (a-c) Ray+Born seismograms computed in the true reflectivity (a), the migrated reflectivities at iteration 1 (b) and 9 (c).
Back to \ref{Content}
$$\tag{XIV}\label{XIV}$$
Reflection waveform inversion
We have seen how to compute reflectivity images by iterative least-squares RTM. However, if the velocity macromodel is not kinematically accurate, the migrated image is not optimally focused. We seek now to improve the velocity model from reflection data and reflectivity. These two tasks are performed iteratively in alternating mode
(Figure 14.1). The FWI adaptation which achieves this goal is referred to as Reflection Waveform Inversion (RWI). For more details see,
Xu et al. (2012),
Brossier et al. (2015),
Zhou et al. (2015) and
Zhou et al. (2018) .

Figure 14.1: Reflection workflow with alternating-direction update of velocity macromodel and reflectivity.
We have seen during the resolution analysis that long wavelengths can be built from large scattering angles and low frequencies. How can we build the long wavelengths of the subsurface from short-offset reflection data without very low frequencies? The fundamentals of seismic reflection processing is to alternate the velocity macromodel building and the reflectivity imaging by migration
(Figure 14.1). During the velocity macromodel building, the reflectivity is used as buried virtual sources to mimic a transmission experiment from depth to surface (horizontal cross-hole experiment) and the long-wavelengths of the velocity model are updated along the two paths connecting the reflector to the source and receiver
(Figure 14.2).

Figure 14.2: Sketch of velocity macromodel building along the two one-way transmitted paths followed by a reflected wave. The position of the reflector is assumed known.
In the framework of waveform inversion, this approach requires however a slight reformulation of the FWI as explained in the following. A reflected wave contains a coupled information on the reflector properties (geometry and impedance contrast) and the smooth distribution of wavespeeds above the reflector. The issue is that the inprint of the former information (related to the migration task) often overprints the latter in the FWI sensitivity kernel
(Figures 14.3). Therefore, we need to explicitly separate the two components of the sensitivity kernel as shown in
Figure 14.3f where the transmitted wavepaths from the reflector to the source and the receiver have been extracted. We show now how to achieve this goal mathematically. The governing idea relies on the explicit separation between the velocity macromodel and the reflectivity and the exclusive simulation of reflected waves with the Born approximation.

Figure 14.3: RWI principle. Filtering out the migration isochrone from the FWI sensitivity kernel. (a) Source wavefield in homogeneous background. (b) Sensitivity kernel for a homogeneous background model. Outer fringes are migration isochrones on which reflected waves are back-projected during prestack depth migration and FWI. (c) Source wavefield for a homogeneous background model with a reflector (green line). (d) Sensitivity kernel for the homogeneous background model with a reflector. This kernel is dominated by the migration isocrhones as those shown in (b). (e) Difference between (a) and (c). (f) Difference between (b) and (d). This shows the sensitivity kernel with a rabbit-ear shape that should be used to update the velocity macromodel by RWI.
We now assume that \( \delta \boldsymbol{m} \) is known and we seek to update \( \boldsymbol{m}_0 \). We model the reflected wavefield with the Born approximation according to (\ref{13.9}). Accordingly, we need to involve now three state equations.
State equation 1: The background wavefield \( \boldsymbol{u}_0 \) satisfies the time-harmonic wave equation
$$\boldsymbol{A}(\boldsymbol{m}_0) \boldsymbol{u}_0 = \boldsymbol{b}, \tag{14.1}\label{14.1}$$
where$$\boldsymbol{A}(\boldsymbol{m}_0) = \omega^2 \text{diag}(\boldsymbol{m}_0) + \Delta. \tag{14.2}\label{14.2}$$
State equation 2: The reflected wavefield \( \delta \boldsymbol{u} \) satisfies the linearized wave equation (\ref{13.9})
$$\boldsymbol{A}(\boldsymbol{m}_0) \delta \boldsymbol{u} \approx - \omega^2 \text{diag}(\delta \boldsymbol{m}) \boldsymbol{u}_0, \tag{14.3}\label{14.3} $$
where we have used $$\boldsymbol{A}(\boldsymbol{m}_0 + \delta \boldsymbol{m}) - \boldsymbol{A}(\boldsymbol{m}_0) = \omega^2 \text{diag}(\delta \boldsymbol{m}). \tag{14.4}\label{14.4}$$
State equation 3: Observation equation
$$\delta \boldsymbol{d} = \boldsymbol{P} \delta \boldsymbol{u}. \tag{14.5}\label{14.5}$$
We seek to update \( \delta \boldsymbol{m} \) by lminimizing the least-squares misfit between \( \delta \boldsymbol{d} \) and \( \delta \boldsymbol{d}^* \).
Let's compute the gradient of the misfit function with the adjoint-state method.
Lagrangian function$$ {\mathcal{L}}(\boldsymbol{m}_0,\overbrace{\boldsymbol{u}_0,\delta \boldsymbol{u}, \delta \boldsymbol{d}}^{State ~ variables}, \overbrace{\boldsymbol{v}, \delta \boldsymbol{v}, \delta \boldsymbol{w}}^{Adjoint ~ variables} )=
\frac{1}{2}\| \delta \boldsymbol{d} - \delta \boldsymbol{d}^* \|_2^2 + \Re < \boldsymbol{v} , \boldsymbol{A}(\boldsymbol{m}_0) \boldsymbol{u}_0 - \boldsymbol{b} > + \Re < \delta \boldsymbol{v} , \boldsymbol{A}(\boldsymbol{m}_0) \delta \boldsymbol{u} + \omega^2 \text{diag}(\delta \boldsymbol{m}) \boldsymbol{u}_0 >
+ < \delta \boldsymbol{w}, \delta \boldsymbol{d} - \boldsymbol{P} \delta \boldsymbol{u} >. \tag{14.6}\label{14.6}$$
Adjoint-state equation 1: \( \partial_{\delta \boldsymbol{d}} {\mathcal{L}} = 0 \)
$$\delta \boldsymbol{w} = \boldsymbol{P}^T \left(\delta \boldsymbol{d} - \delta \boldsymbol{d}^* \right). \tag{14.7}\label{14.7}$$
Adjoint-state equation 2: \( \partial_{\delta \boldsymbol{u}} {\mathcal{L}} = 0 \)
$$\boldsymbol{A}(\boldsymbol{m}_0)^\dagger \delta \boldsymbol{v} = \boldsymbol{P}^T \delta \boldsymbol{w} = \boldsymbol{P}^T \left( \delta \boldsymbol{d} - \delta \boldsymbol{d}^* \right). \tag{14.8}\label{14.8}$$
Adjoint-state equation 3: \( \partial_{\boldsymbol{u}_0} {\mathcal{L}} = 0 \)
$$\boldsymbol{A}(\boldsymbol{m}_0)^\dagger \boldsymbol{v} = \omega^2 \text{diag}(\delta \boldsymbol{m}) \delta \boldsymbol{v}. \tag{14.9}\label{14.9} $$
Gradient: \( \partial_{\boldsymbol{m}_0} {\mathcal{L}} \)
$$\nabla_{\boldsymbol{m}_0} { {\mathcal{C}}(\boldsymbol{m}_0)} = \Re \left\{ \boldsymbol{u}_0^T \left( \frac{ \partial \boldsymbol{A}(\boldsymbol{m}_0)}{\partial \boldsymbol{m}_0} \right)^T \overline{\boldsymbol{v}} \right\} + \Re \left\{ \delta \boldsymbol{u}^T \left( \frac{ \partial \boldsymbol{A}(\boldsymbol{m}_0)}{\partial \boldsymbol{m}_0} \right)^T \overline{\delta \boldsymbol{v}} \right\}. \tag{14.10}\label{14.10}$$
The first term in the gradient is the zero-lag correlation between downgoing incident wavefield \( \boldsymbol{u}_0 \) and upgoing adjoint wavefield \( \boldsymbol{v} \), while the second one is the zero lag correlation between upgoing reflected wavefield \( \delta \boldsymbol{u} \) and downgoing back-propagated residual wavefield \( \delta \boldsymbol{v} \)
(Figure 14.4).

Figure 14.4: Sensitivity wavepath generated by a reflector. Left: Correlation between the source wavefield and the reflection of the receiver wavefield, first term in (\ref{14.10}). Right: correlation of the receiver wavefield with the reflection of the source wavefield from the reflector, second term in (\ref{14.10}).
Back to \ref{Content}
$$\tag{XV}\label{XV}$$
Source estimation in FWI
The temporal source signature is often an unknown in FWI. The source wavelet is estimated by solving a linear inverse problem, assuming the subsurface model known. We assume that we estimate a source signature \( s(\omega) \) per common-shot gather. The source signature estimation is straightforwardly solved in the frequency domain as shown hereafter. If the signature should be estimated in the time domain, each spectral component is estimated as shown hereafter before taking an inverse Fourier transform.
Forward problem$$\boldsymbol{A}(\omega,\boldsymbol{m}(\boldsymbol{x})) \boldsymbol{u}(\omega,\boldsymbol{x}) = s(\omega) \delta (\boldsymbol{x}-\boldsymbol{x}_s). \tag{15.1}\label{15.1} $$
Inverse problemWe seek to minimize the misfit function for the complex-valued unknown \( s\left(\omega\right) \):
$$\min_{s(\omega)} C(s(\omega)) = \min_{s(\omega)} \sum_{r} \lVert \boldsymbol{P}_r \boldsymbol{u}(\omega,\boldsymbol{x}) - d_r^*(\omega) \rVert^2 = \min_{s(\omega)} \sum_r \lVert \boldsymbol{P}_r \boldsymbol{A}^{-1}(\omega) \, s(\omega) \, \delta (\boldsymbol{x}-\boldsymbol{x}_s) - d_r^*(\omega) \rVert^2. \tag{15.2}\label{15.2}$$
Gradient of \( C \)$$\nabla_{s(\omega)} C(s(\omega)) = \sum_{r} \left( \boldsymbol{P}_r \left(\overline{\boldsymbol{A}^{-1}(\omega)}\right) \, \bar{s}(\omega) \, \delta (\boldsymbol{x}-\boldsymbol{x}_s) - \bar{d^*}_r(\omega) \right) \boldsymbol{P}_r \boldsymbol{A}^{-1}(\omega) \, \delta \left(\boldsymbol{x}-\boldsymbol{x}_s\right). \tag{15.3}\label{15.3}$$
To derive the previous equation, I use the property of derivation with respect to complex variables:
$$\partial_z z = 1 \; \; \; \; \; \; \partial_{z} \bar{z} = 0. \tag{15.4}\label{15.4}$$
Zeroing the gradient gives for \( s \):
$$s(\omega) = \frac{\sum_r \boldsymbol{P}_r \boldsymbol{A}^{-1}(\omega) \, \delta (\boldsymbol{x}-\boldsymbol{x}_s) \cdot d^*_r}{\sum_r \lVert \boldsymbol{P}_r \boldsymbol{A}^{-1}(\omega) \, \delta (\boldsymbol{x}-\boldsymbol{x}_s) \rVert^2} = \frac{\sum_r \bar{g}(\omega,\boldsymbol{x}_r;\boldsymbol{x}_s) \cdot d_r^* }{ \sum_r \bar{g}(\omega,\boldsymbol{x}_r;\boldsymbol{x}_s) \cdot g(\omega,\boldsymbol{x}_r;\boldsymbol{x}_s) }, \tag{15.5}\label{15.5} $$
where \( g(\omega,\boldsymbol{x}_r;\boldsymbol{x}_s) \) denotes the value of the Green functions generated by the source and recorded at the receiver \( r \).
Above, we assume the medium is known. In a FWI algorithm, the source wavelet estimation is often nested with the update of the subsurface medium according to the following iterative sequence:
- Compute incident Green functions by solving (\ref{15.1}).
- Estimate source wavelet for each shot gather with (\ref{15.5}).
- Compute the data residuals and the misfit function.
- Compute the adjoint wavefields.
- Compute the gradient for the subsurface model perturbations and estimate the step length.
- Update the subsurface medium.
Plessix and Cao (2011) recast the source-estimation problem in the framework of the adjoint-state method with two possible implementations. We now introduce a weighting operator \( \boldsymbol{W}\) in the data misfit function.
Implementation 1: The source signature is processed as an optimization parameter.
$${\mathcal{L}}\left(\boldsymbol{m},s,\boldsymbol{g},\boldsymbol{d},\boldsymbol{v}_1,\boldsymbol{v}_2 \right) = \sum_r w_r \lVert d_r - d^*_r \rVert^2 + \Re <\boldsymbol{v}_1 , \boldsymbol{A} \boldsymbol{g} - \delta \left( \boldsymbol{x} - \boldsymbol{x}_s \right) > + \sum_r \overline{{v}_2}_r \cdot \left( d_r - s \,\boldsymbol{P}_r \boldsymbol{g} \right), \tag{15.6}\label{15.6}$$
where \( \boldsymbol{g} \) denotes the Green functions.
$$\boldsymbol{A}^T \overline{\boldsymbol{v}}_1 = s \, \overline{\boldsymbol{v}_2}, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
\boldsymbol{v}_2 = {\boldsymbol{W}}_d \, \left( \boldsymbol{d} - \boldsymbol{d}^* \right). \tag{15.7}\label{15.7}$$
- Gradient of the misfit function with respect to \( s \)
$$\nabla C_{s}(\boldsymbol{m},s) = \left[ \boldsymbol{W} \, (\boldsymbol{d} - \boldsymbol{d}^*) \right]^\dagger \boldsymbol{P}_r \boldsymbol{g} = 0. \tag{15.8}\label{15.8}$$
This gives (remember that \( \boldsymbol{d} = s \,\boldsymbol{P}_r \boldsymbol{g} \) ):
$$s = \frac{ \left(\boldsymbol{P} \boldsymbol{g}\right)^\dagger \, \boldsymbol{W} \, \boldsymbol{d}^* }{ \left( \boldsymbol{P} \boldsymbol{g} \right)^\dagger \boldsymbol{W} \boldsymbol{P} \boldsymbol{g} }. \tag{15.9}\label{15.9}$$
- Gradient of the misfit function with respect to \( \boldsymbol{m} \)
$$\nabla_\boldsymbol{m} C(\boldsymbol{m},s) = \Re \left\{ \boldsymbol{g}^T \left( \frac{ \partial \boldsymbol{A} }{\partial \boldsymbol{m} } \right)^T \bar{\boldsymbol{v}} \right\}
= \Re \left\{ s \boldsymbol{g}^T \left( \frac{ \partial \boldsymbol{A} }{\partial \boldsymbol{m} } \right)^T \left( \boldsymbol{A}^{-1} \right)^T \boldsymbol{W} \, ( \overline{\boldsymbol{d} - \boldsymbol{d}^*} ) \right\}. \tag{15.10}\label{15.10}$$
Implementation 2: the source is processed as a state variable.
The previous implementation requires to use the same data-space weighting operator \( \boldsymbol{W} \) to compute \( s\) and \( \nabla_{\boldsymbol{m}} C \). This can put some limitations for specific applications (for example, see Operto et al., (2006): for example, data weighting may not be wished to estimate \( s \) if the short-offset data are the most suitable for source estimation.
Plessix and Cao (2011) proposes to use \( s \) as a state to gain some flexibility in the inversion design. The state equation satisfied by \( s \) is the solution of the linear inverse problem for \( s \) where we have the flexibility to choose a suitable weighting function in the functional.
$$C_s(s)= \left( s \,\boldsymbol{P} \boldsymbol{g} - \boldsymbol{d}^* \right)^\dagger {\boldsymbol{W}}_s \left( s \,\boldsymbol{P} \boldsymbol{g} - \boldsymbol{d}^* \right). \tag{15.11}\label{15.11}$$
- State equation satisfied by the state \( s \):
$${\boldsymbol{r}}^\dagger {\boldsymbol{W}}_s {\boldsymbol{r}} \, s - {\boldsymbol{r}}^\dagger {\boldsymbol{W}}_s \boldsymbol{d}^* = 0, \tag{15.12}\label{15.12} $$
where we introduce the state variable \( {\boldsymbol{r}} =\boldsymbol{P} \boldsymbol{g} \).
$${\mathcal{L}}\left( \boldsymbol{m},\boldsymbol{g},{\boldsymbol{r}},\boldsymbol{d},s,\boldsymbol{v}_1,\boldsymbol{v}_2,\boldsymbol{v}_3, v_4\right) = \left( \boldsymbol{d} - \boldsymbol{d}^* \right)^\dagger {\boldsymbol{W}}_d \left( \boldsymbol{d} - \boldsymbol{d}^* \right) + \Re < \boldsymbol{v}_1 , \boldsymbol{A} \boldsymbol{g} - \delta \left( \boldsymbol{x} - \boldsymbol{x}_s \right) >
+ < \boldsymbol{v}_2 , {\boldsymbol{r}} -\boldsymbol{P} \, \boldsymbol{g} > + < \boldsymbol{v}_3 , \boldsymbol{d} - s \, {\boldsymbol{r}} > + < v_4, {\boldsymbol{r}}^\dagger {\boldsymbol{W}}_s {\boldsymbol{r}} \, s - {\boldsymbol{r}}^\dagger {\boldsymbol{W}}_s \boldsymbol{d}^* >. \tag{15.13}\label{15.13}$$
$$\frac{\partial {\mathcal{L}} }{\partial \boldsymbol{g} } = 0 \Longrightarrow \boldsymbol{A}^T \, \boldsymbol{v}_1 =\boldsymbol{P}^T \boldsymbol{v}_2. \tag{15.14}\label{15.14}$$
$$\frac{\partial {\mathcal{L}} }{\partial \boldsymbol{r} } = 0 \Longrightarrow \boldsymbol{v}_2 = s \, \left( \boldsymbol{v}_3- \, v_4 {\boldsymbol{W}}_s \boldsymbol{r} \right). \tag{15.15}\label{15.15}$$
$$\frac{\partial {\mathcal{L}} }{\partial \boldsymbol{d} } = 0 \Longrightarrow \boldsymbol{v}_3 = - {\boldsymbol{W}}_d \left( \boldsymbol{d} - \boldsymbol{d}^* \right). \tag{15.16}\label{15.16}$$
$$\frac{\partial {\mathcal{L}} }{\partial s } = 0 \Longrightarrow
v_4 = \frac{ \boldsymbol{v}_3^\dagger \boldsymbol{r}}{ \boldsymbol{r}^\dagger {\boldsymbol{W}}_s \boldsymbol{r} } = - \bar{s} \frac{ \boldsymbol{r}^\dagger {\boldsymbol{W}}_d \boldsymbol{r} }{ \boldsymbol{r}^\dagger {\boldsymbol{W}}_s \boldsymbol{r} } +\frac{ {\boldsymbol{d}^*}^\dagger {\boldsymbol{W}}_d \boldsymbol{r} }{ \boldsymbol{r}^\dagger {\boldsymbol{W}}_s \boldsymbol{r}}. \tag{15.17}\label{15.17}$$
If \( {\boldsymbol{W}}_d = {\boldsymbol{W}}_s \), \(v_4\) = 0. Same expression of the adjoint wavefield as for implementation 1.
$$\nabla_m C = \Re < \boldsymbol{v}_1 , \frac{ \partial \boldsymbol{A} }{ \partial \boldsymbol{m} } \, \boldsymbol{g} >. \tag{15.18}\label{15.18}$$
Wavelet estimation as a QC of FWI results.Sequences: [1] compute \( s \) with equation XXX, [2] \( v_4 \) , [3] \( \boldsymbol{v}_3 \), [4] \( \boldsymbol{v}_2\) and [5] \( \boldsymbol{v}_1 \).
From Malinoswki et al. (2011) Other examples are shown in \citet{Prieux_2011_FAI}, \citep{Operto_2015_ETF}.
A remark: how to numerically check the adjoint-state equation: the inner product test Kern, 2010, page 110.
$$\partial_u F(a,u) \, v = b$$
$$\partial_u F(a,u)^\dagger \, p = c $$
By definition of the adjoint,
$$<\partial_u F(a,u) \, v, p > = < \partial_u F(a,u)^\dagger p , v>$$
that is
$$<b,p> = <c,v>$$
Back to \ref{Content}
$$\tag{XVI}\label{XVI}$$
Multi-parameter FWI
$$\tag{XVII}\label{XVII}$$
Appendix 1: Few words about the method of Lagrange multiplier