WIND>Main/FWItuto>WebHome (29 Jan 2023, operto)

# FWI in Exploration Geophysics: a Tutorial

This a tutorial on the principles of FWI (in perpetual construction). We develop the theory in the frequency domain for sake of compact notations. However, we will periodically make the link with the time domain formulation since the frequency-domain matrix formulation requires a higher level of abstraction and hence may be difficult to conceptualized after a first reading.

========================================================

$$\tag{Content}\label{Content}$$

### Content

click on the section number to get there

$$\ref{I}$$ Notation and problem statement

$$\ref{II}$$ FWI with local gradient-based optimization methods

$$\ref{III}$$ Generic expression of gradient and Hessian

$$\ref{IV}$$ The gradient and Hessien of FWI: imaging principle

$$\ref{V}$$ Efficient computation of the sensitivity matrix

$$\ref{VI}$$ Efficient computation of the gradient

$$\ref{VII}$$ Computing the gradient of the misfit function with the adjoint-state method

$$\ref{VIII}$$ The adjoint-state method with Lagrangian functional

$$\ref{IX}$$ Role of the Hessian

$$\ref{X}$$ Resolution analysis

$$\ref{XI}$$ The cycle skipping pathology

$$\ref{XII}$$ Regularization of FWI

$$\ref{XIII}$$ FWI and least-squares Reverse Time Migration

$$\ref{XIV}$$ Reflection Waveform Inversion (RWI)

$$\ref{XV}$$ Source estimation in FWI

$$\ref{XVI}$$ Multi-parameter FWI

$$\ref{XVII}$$ Appendix 1: Few words about the method of Lagrange multiplier

========================================================

$$\tag{I}\label{I}$$

### Notation and problem statement

Let's consider the time-harmonic (Helmholtz) equation written in matrix form as

$$\boldsymbol{A} \left( \omega,\boldsymbol{m}(\boldsymbol{x}) \right)\boldsymbol{u}(\omega,\boldsymbol{x}) = \boldsymbol{b}(\omega,\boldsymbol{x}), \tag{1.1}\label{1.1}$$

where $$\boldsymbol{A} = \omega^2 \text{diag}(\boldsymbol{m}) + \Delta$$ denotes the Helmholtz operator, $$\omega$$ is the angular frequency, $$\Delta = \partial_{xx}+\partial_{zz}$$ is the Laplace operator, $$\boldsymbol{m} \in \Bbb{R}^{N \times 1}$$ is the squared slownesses, $$\boldsymbol{u} \in \Bbb{C}^{N \times 1}$$ is the monochromatic wavefield and $$\boldsymbol{b} \in \Bbb{C}^{N \times 1}$$ is the monochromatic source.

u and b are vectors of dimension N, where N denotes the number of degrees of freedom discretizing the subsurface medium m. Accordingly, A is a full rank N x N squared matrix.

Unless explicitely stated otherwise, we will drop the dependency to ω in equation $$(\ref{1.1})$$ for sake of compact notations

$$\boldsymbol{A} (\boldsymbol{m}) \boldsymbol{u}= \boldsymbol{b}. \tag{1.2}\label{1.2}$$

This means that we consider a mono-frequency problem. However, multiple frequencies can be jointly processed by FWI through discrete summation in a similar manner to the time-domain formulation which proceeds by summation over time.

Equation $$(\ref{1.2})$$ is called the state equation which can be written in generic form as

$$F(\boldsymbol{u},\boldsymbol{m}) =0, \tag{1.3}\label{1.3}$$

where u is the state variable and m the model parameters.

FWI aims to estimate the subsurface parameters m from parsimonious measurements of u.

Accordingly, FWI can be formulated as the following multivariate PDE-constrained optimization problem

$$\min_{\boldsymbol{u},\boldsymbol{m}} \| \boldsymbol{P u} - \boldsymbol{d}^*\|_2^2 ~~ \text{subject to} ~~ \boldsymbol{A}(\boldsymbol{m}) \boldsymbol{u}= \boldsymbol{b}, \tag{1.4a}\label{1.4a}$$

or equivalently, $$\min_{\boldsymbol{u},\boldsymbol{m}} \| \boldsymbol{P u} - \boldsymbol{d}^*\|_2^2 ~~ \text{subject to} ~~ F(\boldsymbol{u},\boldsymbol{m}) =0,\tag{1.4b}\label{1.4b}$$

where d* denotes the observations (the recorded data) and the linear observation operator P samples the wavefield $$\boldsymbol{ u}$$ at the receiver positions. For sake of compactness, we consider a single source problem but the extension to multiple sources follows easily through a summation over sources. P is a M x N matrix where M << N denotes the number of measurements.

The equation relating the wavefield solution u (the state variable) to the measurements $$\boldsymbol{d}^*$$ through the observation operator P is called the observation equation

$$\boldsymbol{P u} - \boldsymbol{d}^*. \tag{1.5}\label{1.5}$$

From equation $$(\ref{1.2})$$, let's consider the closed-form expression of u as a nonlinear function of m

$$\boldsymbol{u}= \boldsymbol{A}^{-1}(\boldsymbol{m})\boldsymbol{b}, \tag{1.6}\label{1.6}$$

where the columns of $$\boldsymbol{A}^{-1}$$ are the Green functions for point sources located at each grid point of the computational domain (Indeed, $$\boldsymbol{A} \boldsymbol{A}^{-1} = \boldsymbol{I}$$, where the identity matrix $$\boldsymbol{I}$$ gathers the impulsional sources).

In compact form,

$$\boldsymbol{u}= \boldsymbol{S}(\boldsymbol{m}). \tag{1.7}\label{1.7}$$

With this notation, the observation equation reads

$$\boldsymbol{d}= \boldsymbol{P} \boldsymbol{S}(\boldsymbol{m})=\phi(\boldsymbol{m}). \tag{1.8}\label{1.8}$$

In the framework of the so-called reduced-space formulation of FWI, we can recast the multi-variate constrained problem, $$(\ref{1.4a})$$, as a mono-variate unconstrained problem by enforcing the closed-form expression of u as a function of m, $$(\ref{1.6})$$, in the data misfit function (the first term of $$(\ref{1.4a})$$). Simply put, we force the constraint to be strictly satisfied. This shrinks the FWI search space for the benefit of computational efficiency. This variable elimination is called variable projection .

$$\min_{\boldsymbol{m}} \| \boldsymbol{P A}(\boldsymbol{m})^{-1} \boldsymbol﻿{b} - \boldsymbol{d}^*\|_2^2. \tag{1.9}\label{1.9}$$

namely,

$$\min_{\boldsymbol{m}} \| \phi(\boldsymbol{m}) - \boldsymbol{d}^*\|_2^2. \tag{1.10}\label{1.10}$$

Back to \ref{Content}

$$\tag{II}\label{II}$$

### FWI with local gradient-based optimization methods

Let's denote by $$C(\boldsymbol{m})$$ the nonlinear objective function associated with the minimization problem $$(\ref{1.10})$$,

$$C(\boldsymbol{m}) = \| \phi(\boldsymbol{m}) - \boldsymbol{d}^*\|_2^2. \tag{2.1}\label{2.1}$$

To solve $$(\ref{1.10})$$ with local optimization techniques, we consider a surrogate locally-quadratic approximation of $$C(\boldsymbol{m})$$. This surrogate function $$\tilde{C}$$ majorizes C (Figure 1) and therefore Newton methods belong to the category of Majorization-Minimization (MM) method (Lange, 2004). To build the quadratic surrogate function around an initial guess $${\boldsymbol{m}}_0$$, we apply a second-order Taylor-Lagrange development on $$C(\boldsymbol{m})$$

$$\tilde{C}({\boldsymbol{m}}_0 + \Delta {\boldsymbol{m}},{\boldsymbol{m}}_0)=C({\boldsymbol{m}}_0)+\sum_{j=1}^{M} \frac{\partial C({\boldsymbol{m}}_0)}{\partial m_j} \Delta m_j + \frac{1}{2} \sum_{j=1}^{M} \sum_{k=1}^{M} \frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial m_j \partial m_k} \Delta m_j \Delta m_k, \tag{2.2}\label{2.22}$$

where $$\Delta ({\boldsymbol{m}})$$ is a small perturbation of $${\boldsymbol{m}}_0$$. The second argument of $$\tilde{C}$$ is here to remind the point around which we linearize the problem. This point changes at each FWI iteration (Figure 2.1).

In matrix form,

$$\tilde{C}({\boldsymbol{m}},{\boldsymbol{m}}_0)=C({\boldsymbol{m}}_0)+\left( \frac{\partial C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}} \right)^T (\boldsymbol{m}-\boldsymbol{m}_0) +(\boldsymbol{m}-{\boldsymbol{m}}_0)^T \left( \frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}} \right) ( \boldsymbol{m} - \boldsymbol{m}_0). \tag{2.2}\label{2.2}$$

Note that $$\tilde{C}({\boldsymbol{m}}_0,{\boldsymbol{m}}_0)=C({\boldsymbol{m}}_0)$$, $$\nabla_{m} \tilde{C}({\boldsymbol{m}}_0,{\boldsymbol{m}}_0) = \nabla_{m} C({\boldsymbol{m}}_0)$$ and $$\partial^2_m \tilde{C}({\boldsymbol{m}}_0,{\boldsymbol{m}}_0) = \partial^2_m C({\boldsymbol{m}}_0)$$. Figure 2.1: At each iterate, the misfit function $$C(m)$$ is majorized by a quadratic function $$\tilde{C}(m,m_k)$$ such that $$\tilde{C}(m_k,m_k)=C(m_k)$$ and $$\nabla_m \tilde{C}(m_k,m_k) = \nabla_m C(m_k)$$. Provided that $$\tilde{C}(m,m_k)$$ is sufficiently close to $$C(m)$$ in the vicinity of $$m_k$$, $$\tilde{C}(m,m_k)$$ majorizes $$C(m)$$ and converges toward the minimizer of $$C(m)$$ as k tends to $$\infty$$ (see the textbook of K. Lange for more details).

Let's differentiate $$\tilde{C}$$ with respect to the parameter $$m_l$$

$$\frac{\partial \tilde{C}(\boldsymbol{m},\boldsymbol{m}_0)}{\partial m_l}=\frac{\partial C({\boldsymbol{m}}_0)}{\partial m_l} +\sum_{j=1}^{M} \frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial m_j \partial m_l} \Delta m_j. \tag{2.3}\label{2.3}$$

In matrix form,

$$\frac{\partial \tilde{C}({\boldsymbol{m}},\boldsymbol{m}_0)}{\partial {\boldsymbol{m}} }=\frac{\partial C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}}+\frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial {\boldsymbol{m}}^2} \Delta {\boldsymbol{m}}. \tag{2.4}\label{2.4}$$

The minimum of $$\tilde{C}$$ is reached when the first derivative of $$\tilde{C}$$ vanishes.

$$\frac{\partial \tilde{C}(\boldsymbol{m},\boldsymbol{m}_0)}{\partial \boldsymbol{m}}=\boldsymbol{0}. \tag{2.5}\label{2.5}$$

This leads to the so-called Newton system:

$$\frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial {\boldsymbol{m}}^2} \, \Delta {\boldsymbol{m}} = - \frac{\partial C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}}. \tag{2.6}\label{2.6}$$

The right-hand side is the opposite of the gradient of C (the steepest descent direction ). The left operator is the Hessian matrix , the second derivative of $$C$$.

A Newton algorithm iterative updates m through the recursion

$$\boldsymbol{m}^{k+1}= \boldsymbol{m}^k + \alpha \Delta \boldsymbol{m}^k,\tag{2.7}\label{2.7}$$

where the descent direction $$\Delta \boldsymbol{m}^k$$ is given by

$$\Delta {\boldsymbol{m}}^{k} = - \left( \frac{\partial^2 C({\boldsymbol{m}}^k)}{\partial {\boldsymbol{m}}^2} \right)^{-1} \, \frac{\partial C({\boldsymbol{m}}^k)}{\partial \boldsymbol{m}}, \tag{2.8}\label{2.8}$$

where $$k$$ denotes the iteration count, and $$\alpha$$ the step length . The step length controls the quantity of descent as illustrated in Figure 2.2. Figure 2.2: Sketch illustration of the impact of the step length on the quantity of descent at a given iteration.

Remark: It is generally not possible to build explicitly the inverse of the Hessian in (\ref{2.6}). Therefore, we generally resort to iterative methods to solve the Newton system (\ref{2.6}) such as steepest-descent (along the parabola) or linear conjugate gradient method. This iterative algorithm introduces an inner loop within the outer nonlinear FWI loop, which requires to perform Hessian-vector product. To check this, note that

$$\nabla_{\boldsymbol{m}} \tilde{C}({\boldsymbol{m}},{\boldsymbol{m}}_0) = \nabla_{\boldsymbol{m}} C({\boldsymbol{m}}_0) +\left( \frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}} \right) ( \boldsymbol{m} - \boldsymbol{m}_0). \tag{2.9}\label{2.9}$$

This Hessian-vector product can be computed with second-order adjoint state method (Metivier et al., 2017) or the BFGS algorithm ( Nocedal, 2006).﻿

Alternatively, we use steepest descent $$( \partial^2 C({\boldsymbol{m}}^k) / \partial {\boldsymbol{m}}^2 \approx \boldsymbol{I} )$$ , nonlinear conjugate gradient or limited-memory quasi-Newton method (l-BFGS) to approximate the Newton search direction (\ref{2.8}) in the outer FWI loop.

Back to \ref{Content}

$$\tag{III}\label{III}$$

### Generic expression of the gradient and Hessian

Let's compute the $$k^{th}$$ element of the gradient of the misfit function.

$$\frac{\partial C({\boldsymbol{m}}) }{\partial m_k} = \frac{1}{2} \frac{ \partial \left( \left( \phi({\boldsymbol{m}}) - {\boldsymbol{d}}^* \right)^\dagger \left( \phi({\boldsymbol{m}}) - {\boldsymbol{d}}^* \right) \right) }{\partial m_k}\tag{3.1}\label{3.1}$$

In discrete form,

$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}=\frac{1}{2} \sum_{i=1}^{M} \frac{ \partial \left( \left(\phi_{i}-d_{i}^*\right) \left(\overline{\phi_{i}-{d}_{i}^*}\right) \right) }{\partial m_k},\tag{3.2}\label{3.2}$$

where $$\phi_{i}$$ denotes the elements of the vector $$\phi(\boldsymbol{m})$$ (the synthetic data computed in the model $${\boldsymbol{m}})$$. $$M$$ denotes the number of observations. $$\bar{a}$$ denotes the conjugate of $$a$$. $$^\dagger$$ denotes the Hermitian (conjugate) transpose.

$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}=-\frac{1}{2} \sum_{i=1}^{M} \left\{ \frac{\partial \phi_{i}}{\partial m_k} \left(\overline{\phi_{i}-d^*}\right)+ \left(\phi_{i}-d_{i}^*\right) \left( \overline{\frac{\partial \phi_{i}}{\partial m_k} }\right) \right\}.\tag{3.3}\label{3.3}$$

$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}= -\sum_{i=1}^{M} \Re \left\{\left( \overline{\frac{\partial \phi_{i}}{\partial m_k}} \right) \left(\phi_{i}-d^*_{i}\right) \right\},\tag{3.4}\label{3.4}$$

where $$\Re\left\{a\right\}$$ denotes the real part of $$a$$ and we assume that $$m$$ is real valued. Going back to matrix notation,

$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}= -\Re \left\{ \left( \frac{\partial \phi({\boldsymbol{m}}) }{\partial m_k} \right)^\dagger \left( \phi({\boldsymbol{m}}) - {\boldsymbol{d}}^* \right) \right\}.\tag{3.5}\label{3.5}$$

$$\frac{\partial C({\boldsymbol{m}})}{\partial \boldsymbol{m}}= -\Re \left\{ \left( \frac{\partial \phi({\boldsymbol{m}}) }{\partial {\boldsymbol{m}} } \right)^\dagger \left( \phi({\boldsymbol{m}}) - {\boldsymbol{d}}^* \right) \right\}=-\Re \left\{ {\boldsymbol{J}}^\dagger ({\boldsymbol{m}}) \Delta {\boldsymbol{d}}({\boldsymbol{m}}) \right\}.\tag{3.6}\label{3.6}$$

The $$M \times N$$ matrix $$\boldsymbol{J}$$ is called the sensitivity matrix or the Jacobian matrix or the Fréchet derivative matrix . The $$M$$ vector $$\Delta {\boldsymbol{d}}({\boldsymbol{m}})$$ contains the data residuals. The gradient is formed by the zero-lag correlation between the partial derivative of the data with respect to the parameters and the data residuals.

Remark:
• If we consider real-valued time-dependent data (as for seismograms), the gradient becomes
$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}= - {\boldsymbol{J}}^T ({\boldsymbol{m}}) \Delta {\boldsymbol{d}}({\boldsymbol{m}}), \tag{3.7}\label{3.7}$$

which should be interpretated as a zero-lag correlation between the partial derivative of $$\phi$$ and $$\Delta \boldsymbol{d}$$.

Explicitly,

$$\frac{\partial C(\boldsymbol{m})}{\partial m_k}= -\int_t \sum_r \frac{\partial \phi_r(t + \tau)}{\partial m_k} \Delta d_r(t) dt, \tag{3.8}\label{3.8}$$

with $$\tau$$ = 0 and $$r$$ labels the positions where the measurements are performed.

Behind correlation, there is indeed the idea of comparison of two signals, one of them been delayed by $$\tau$$. When there is similarity, the correlation function has a high amplitude. Here, we perform the comparison of the signals $$\partial \phi / \partial \boldsymbol{m}$$ and $$\Delta \boldsymbol{d}$$ for only one lag equal to 0 because the output signal (the gradient) is independent to time. Note that FWI approaches that extend the search space re-introduces a dependency of the gradient to time to extend the search space of FWI and relax the requirement of accurate velocity model (for example, see Biondi and Almomin (2014) and section XI).
• In the frequency domain, if multiple frequencies are processed jointly, the vector $$\Delta \boldsymbol{d}$$ gathers all the frequencies involved in the inversion and the gradient is computed by summation over frequencies (for the same reason as above, the gradient is independent to frequencies for a full bandwidth signal).
• If $$m$$ is complex-valued (for example, when attenuation is considered in frequency-domain FWI), we need to use Wirtinger derivative.
$$\frac{\partial }{\partial z } = \frac{1}{2} \left( \frac{\partial }{\partial x } -i \frac{\partial }{\partial y } \right) ~~~~ \frac{\partial }{\partial \bar{z} } = \frac{1}{2} \left( \frac{\partial }{\partial x } +i \frac{\partial }{\partial y } \right), \tag{3.9}\label{3.9}$$

where $$z = x + i y$$.
##### Hessian:

Let's now compute the element $$m,l$$ of the second-derivative of the misfit function.

$$\frac{\partial^2 C({\boldsymbol{m}})}{\partial m_k \partial m_l}=-\sum_{i=1}^{M} \Re \left\{ \left( \overline{\frac{\partial^2 \phi_i }{\partial m_k \partial m_l} } \right) (\phi_i-d_i^*) \right\} +\sum_{i=1}^{M} \Re \left\{ \left( \overline{ \frac{\partial \phi_i }{\partial m_k} } \right) \frac{\partial \phi_i }{\partial m_l} \right\}.\tag{3.10}\label{3.10}$$

Going back to matrix notation,

$$\frac{\partial^2 C({\boldsymbol{m}})}{\partial {\boldsymbol{m}}^2}=\Re \left\{ {\boldsymbol{J}}^\dagger {\boldsymbol{J}} \right\}+\Re \left\{ \frac{\partial {\boldsymbol{J}}^t}{\partial {\boldsymbol{m}}^T} \left(\overline{\Delta {\boldsymbol{d}}} ~ ... ~ \overline{\Delta {\boldsymbol{d}}} \right) \right\}.\tag{3.11}\label{3.11}$$

The linear term of the Hessian, which is sometimes referred to as approximate Hessian or Gauss-Newton Hessian , is given by:

$${\boldsymbol{H}}_a = \Re \left\{ {\boldsymbol{J}}^\dagger {\boldsymbol{J}} \right\}.\tag{3.12}\label{3.12}$$

The linear term in the Hessian represents the sum of the zero-lag correlation between the partial derivative of the data with respect to two different
model parameters $$m_l$$ and $$m_k$$.

$$h_{a_{lk}} = \Re \left\{ \sum_{i=1}^{M} \left( \overline{ \frac{ \partial \phi_i }{\partial m_l} } \right) \, \frac{ \partial \phi_i }{\partial m_k}. \right\}. \tag{3.13}\label{3.13}$$

The second term is non linear. It means that it would be zero if the forward problem equation was linear. It corrects the gradient from artifacts resulting from the erroneous interpretation of double-scattering effects as single-scattering ones.

Back to \ref{Content}

$$\tag{IV}\label{IV}$$

### The gradient and Hessian of the FWI problem: imaging principle

This section is inspired from Pratt et al. (1998). To develop the formal expression of the FWI gradient, we can compute the partial derivative of $$\boldsymbol{u}$$ with respect to $$\boldsymbol{m}$$; then, sample them at the receiver positions with $$\boldsymbol{P}$$ to build the sensitivity matrix. Then, perform the matrix vector product between the sensitivity matrix and the conjugate of the data residuals according to (\ref{3.6}).

We start by differentiating the forward-problem equation with respect to $$\boldsymbol{m}$$

$$\boldsymbol{A}(\boldsymbol{m}) \boldsymbol{u} = \boldsymbol{b}.\tag{4.1}\label{4.1}$$

$$\frac{ \partial \boldsymbol{A}(\boldsymbol{m})}{\partial m_j} \boldsymbol{u} + \boldsymbol{A}(\boldsymbol{m}) \frac{\partial \boldsymbol{u}}{\partial m_j} = \boldsymbol{0}. \tag{4.2}\label{4.2}$$

We indeed use the fact that $$\boldsymbol{b}$$ does not depend on $$\boldsymbol{m}$$.

$$\boldsymbol{A}(\boldsymbol{m}) \frac{\partial \boldsymbol{u}}{\partial m_j} = - \frac{ \partial \boldsymbol{A}(\boldsymbol{m})}{\partial m_j} \boldsymbol{u}. \tag{4.3}\label{4.3}$$

The partial derivative of $$\boldsymbol{u}$$ with respect to $$m_j$$ is a wavefield that satisfies the wave equation for the virtual scattering source $$\boldsymbol{f}^{(j)} = - \left( \partial \boldsymbol{A}(\boldsymbol{m}) / \partial m_j \right) \boldsymbol{u}$$. The spatial support of this source is centred on $$m_j$$ and its temporal support is centred on the arrival time of $$\boldsymbol{u}$$ at $$m_j$$. The first statement is shown by the fact that the radiation pattern matrix $$\partial \boldsymbol{A}(\boldsymbol{m}) / \partial m_j$$ is a sparse matrix, whose non zero coefficients sample $$\boldsymbol{u}$$ in the close vicinity of $$m_j$$ (Figure 4.1). The second one is shown by the fact that $$\partial \boldsymbol{A}(\boldsymbol{m}) / \partial m_j$$ is applied on $$\boldsymbol{u}$$. Figure 4.1: Pattern of the impedance matrix when discretized by finite differences on Cartesian grid with a compact 9-point stencil. If we take the partial derivative of the coefficients of the matrix with respect to $$m_i$$ where $$m$$ is the squared slowness and if the mass matrix is discretized with a lumped mass, then the resulting matrix has one non zero coefficient on the diagonal $$i$$ whose amplitude is $$\omega^2$$.

The partial derivative wavefield has an obvious interpretation: it is the wavefield single scattered by a point diffractor located at the position of the model parameter which respect to which we compute the partial derivative of the wavefield (Figure 4.2). The radiation pattern of the scattering source depends on the subsurface parametrization with which the wave equation is formulated (for example, wavespeed-density, wavespeed-impedance, density-impedance for the acoustic wave equation) and the parameter type in this parametrization with respect to which the partial derivative of u is computed (Figure 4.3). Figure 4.2: Sketch of the partial derivative wavefield. In a homogeneous background model for a surface acquisition,the partial derivative of the data with respect to one parameter is shown by a diffraction hyperbolae (top panel). If the parameter is the wavespeed for a (velocity-density) parametrization the scattering source has en isotropic radiation pattern (as a pressure source). Figure 4.3: Snapshot of partial derivative wavefields computed numerically for a source located at the vertex of the model parameter. (a): Vp parameter in the (Vp-rho) parametrization. (b) Rho parameter in the (Vp-rho) parametrization. (c) Vp parameter in the (Vp-Ip) parametrization. (d) Ip parameter in the (Vp-Ip) parametrization. Vp: wavespeed. Rho: Density. Ip: impedance. The green curves are analytical radiation pattern computed in the framework of the ray+Born approximation. In (a), the virtual source associated with Vp is a isotropic pressure source. In (b) the virtual source associated with rho is a force pointing toward the source (in this paticular case, a vertical force).

The FWI gradient can be interpreted in the framework of diffraction tomography as reviewed below. The gradient of the misfit function is the the zero-lag correlation between the Fréchet-derivative seismograms and the data-residual seismograms (see previous section)/

$$\frac{\partial C(\boldsymbol{m}) }{\partial m_j} = -\Re \left[ \left( \frac{ \partial \phi }{\partial m_j} \right)^T \overline{ \Delta \boldsymbol{d} } \right], \tag{4.4}\label{4.4}$$

where

$$\frac{\partial \phi }{\partial m_j} = \boldsymbol{P} \frac{\partial \boldsymbol{u} }{\partial m_j} = - \boldsymbol{P} \boldsymbol{A}^{-1} \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j } \right) \boldsymbol{u} = - \boldsymbol{P} \boldsymbol{A}^{-1} \boldsymbol{f}^{(j)},\tag{4.5}\label{4.5}$$

and $$\boldsymbol{P} \boldsymbol{u} = \boldsymbol{u}(\boldsymbol{x}_r)$$ where $$\boldsymbol{x}_r \in {\mathcal{D}}_r$$ denotes the receiver positions.

The data-residual seismograms (the difference by the observations and the simulated data in the initial model) can be interpreted as the recording at receiver positions of wavefield scattered by all the missing (unknown) heterogeneities in the initial model. The Fréchet-derivative seismograms can be interpreted as the recording at receiver positions of the wavefield scattered by a small perturbation located at $$m_j$$, all the other parameters being kept fixed. The zero-lag correlation aims to pick in the data residuals (assuming traveltime correspondance of the partial derivative wavefields and the data residuals) the piece of information that must be transformed into a model perturbation at position of $$m_j$$. This interpretation is illustrated schematically in Figure 4.4. Figure 4.4: Schematic interpretation of the FWI gradient. The true medium is homogeneous with three small circular inclusions (circles in (c)). The background model $$\boldsymbol{m}_0$$ lacks the three inclusions. The data residual seismograms are therefore the three diffraction hyperbolae shown in (b). The partial derivative of the data with respect to the parameter located at the position of the yellow inclusion is the diffraction hyperbolae shown in (a). The zero lag correlation between the seismograms shown in (a) and (b) aims to extract from (b) the residuals that have been generated by a missing heterogenity located at the position of the yellow circle: if we take a zero-like correlation of (a) and (b) (time wise multiplication) and integrate over time, most of the contribution will come from the diffraction hyperbole in (b) that matches that shown in (a). Indeed, this imaging principles works if the background model m0 is sufficiently kinematically accurate (see the following of the tutorial for the meaning of "sufficiently").

Back to \ref{Content}

$$\tag{V}\label{V}$$

### Efficient Computation of the sensitivity matrix

The sensitivity matrix can be efficiently computed capitalizing on the spatial reciprocity of the Green functions as shown below.

Let's plug the expression of the Fréchet derivatives in the gradient.

$$\frac{ \partial C(\boldsymbol{m}) }{ \partial m_j } = -\Re \left\{ \left( \frac{ \partial \phi ({\boldsymbol{m}}) }{\partial m_j } \right)^T \overline{\Delta \boldsymbol{d}} \right\} = - \Re \left\{ \left( \boldsymbol{P} \boldsymbol{A}^{-1} \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \right) \boldsymbol{u} \right)^T \overline{\Delta \boldsymbol{d}} \right\} = - \Re \left\{ \boldsymbol{u}^T \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \right)^T \boldsymbol{A}^{-T} \boldsymbol{P}^T \overline{\Delta \boldsymbol{d}} \right\}\tag{5.1}\label{5.1}$$

By analogy between the second and third terms, the coefficient $$ij$$ of the Fréchet derivative matrix is given by:

$$J_{ij} = \boldsymbol{u}_s^T \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \right)^T \boldsymbol{A}^{-T} \boldsymbol{\delta}_r = \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \boldsymbol{u}_s \right)^T \boldsymbol{A}^{-T} \boldsymbol{\delta}_r = \left( \boldsymbol{f}_s^{(j)} \right)^T \left( \boldsymbol{A}^{-T} \boldsymbol{\delta}_r \right), \tag{5.2a}\label{5.2a}$$

where source-receiver pair $$(s,r)$$ corresponds to data $$i$$ and $$\boldsymbol{\delta}_r$$ denotes an impulse source at receiver $$r$$. One coefficient of the sensitivity matrix can be build by the inner product of the virtual source by a Green function computed from the receiver position.

In the particular case where $$\boldsymbol{m}$$ is the squared slowness and $$\boldsymbol{A}$$ is self adjoint as for the Hemholtz operator ( $$\boldsymbol{A}^T = \boldsymbol{A}$$ ), the coefficient ij of the sensitivity matrix can be computed by the multiplication of the monochromatic Green functions generated from the source s and receiver r and taken at the position of the model parameter j, weighted by $$\omega^2$$ and the source signature of the source wavefield $$s(\omega)$$.

$$J_{i j} = \omega^2 s(\omega) \left( \boldsymbol{A}^{-1} \boldsymbol{\delta}_s \right)_j \left( \boldsymbol{A}^{-1} \boldsymbol{\delta}_r \right)_j, \tag{5.2b}\label{5.2b}$$.

The sensitivity matrix $$\boldsymbol{J}$$ can be built explicitly by performing one forward problem per non redundant positions of shot and receivers . This is far less expensive than computing $$N$$ forward problems per shot where $$N$$ denotes the number of model parameters.

We can now compute the sensitivity kernel (one row of the sensitivity matrix) of acoustic FWI in the frequency-space domain for Vp parameter in the (Vp,rho) parametrization (Figure 5.1). This amounts to compute Green functions for two point sources located at the position of the shot and the receiver and multiply the two Green functions. The product of the two monochromatic Green functions has a phase equal to the sum of the phase of the two Green functions and an amplitude equal to the product of their amplitudes consistently with the definition of the single-scattered wavefield computed in the framework of the Born approximation. Figure 5.1 shows an interference picture, called wavepath , which shows isophase surfaces of elliptical shape called isochrones (since we are in the frequency domain, all the traveltimes are synthetized in the figure). They represent Fresnel zones on which a residuals of temporal support spaning over t ± T/2 where t is the central traveltime and T is the period are back-projected. All the diffractions occurring within a Fresnel zone have a traveltime that do not differ by more than half a period. This implies that the background velocity model should be accurate enough to predict recorded traveltimes with an error lower than half a period (if this condition is not fulfilled, the simulated and recorded component of the residual will be back-projected onto two different isochrones). Anticipating the resolution analysis of FWI, one can note that the width of the isochrones decreases as the scattering angle Θ decreases. Figure 5.1: Sensitivity kernel of frequency-domain acoustic FWI for Vp parameter in the (Vp-rho) parametrization. The source is denoted by the star, the receiver by the triangle. The frequency is 5Hz. The interference picture shows isophase surfaces (Fresnel zones) on which residuals are backprojected during FWI. All the diffractions occuring within one Fresnel zone have traveltimes that do not differ by more than half a period. Some diffraction paths are illustrated. Note how the width of the Fresnel zones decreases as the scattering angle θ decreases. The width of the Fresnel zone gives the resolution of the FWI in the spatial direction pointed by the wavenumber vector k.

Back to \ref{Content}

$$\tag{VI}\label{VI}$$

### Efficient computation of the gradient

We now show how to compute the gradient of FWI without building the sensitivity matrix. Computing the gradient does not require to compute explicitly the sensitivity matrix.

$$\frac{ \partial C(\boldsymbol{m}) }{ \partial m_j } = - \Re \left\{ \boldsymbol{u}^T \left( \frac{ \partial \boldsymbol{A} }{ \partial m_j} \right)^T \left( \boldsymbol{A}^{-1} \right)^T {\boldsymbol{P}}^T \overline{\Delta \boldsymbol{d}} \right\} = - \Re \left\{ \boldsymbol{u}^T \left( \frac{ \partial\boldsymbol{A} }{ \partial m_j} \right)^T \overline{\boldsymbol{v}} \right\} = - \Re \left\{ \left( \frac{ \partial\boldsymbol{A} }{ \partial m_j} \boldsymbol{u} \right)^T \overline{\boldsymbol{v}} \right\} = - \Re \left( \left\{ \boldsymbol{f}_s^{(j)} \right)^T \overline{\boldsymbol{v}} \right\}, \tag{6.1}\label{6.1}$$

where the adjoint wavefield satisfies the wave equation

$$\boldsymbol{A}^T \, \overline{\boldsymbol{v}} = {\boldsymbol{P}}^T \overline{\Delta \boldsymbol{d}},\tag{6.2}\label{6.2}$$

with a source term corresponding to the assemblage of all of the residuals (we exploit here the linearity of the wave equation). We see now that we can compute $$\partial C$$ by taking the zero-lag correlation between the virtual source $$\frac{ \partial\boldsymbol{A} }{ \partial m_j} \boldsymbol{u}$$ and the backpropagated adjoint wavefield $$\boldsymbol{v}$$ instead of taking the zero-lag correlation between the partial derivative wavefield at the receiver positions with the data residuals as indicated by (\ref{4.4}).

To achieve this, we have used the property of the adjoint of the wave equation operator $$\boldsymbol{A}$$

$$< \boldsymbol{P} \boldsymbol{A} \boldsymbol{f} | \Delta \boldsymbol{d} >_{\partial \Omega} = < \boldsymbol{f} | \boldsymbol{A}^{\dagger} \boldsymbol{P}^T \Delta \boldsymbol{d} >_\Omega = < \boldsymbol{f} | \boldsymbol{A}^{T} \overline{\Delta \boldsymbol{d}} >_\Omega ,$$

where $$\boldsymbol{f} = \frac{ \partial\boldsymbol{A} }{ \partial m_j} \boldsymbol{u}$$ is the virtual source, $$\Omega$$ is the object domain spanned by $$\boldsymbol{m}$$, $$\partial \Omega$$ is the acquisition domain.
The adjoint of the wave equation operator backpropagates in time the data residuals (Figure 6.1). This is shown by the conjugate of the data residuals.

$$ITF ( \overline{u}(\omega)) = \int_{-\infty}^{+\infty} \overline{u}(\omega) \, \exp\left(i \omega t\right) d\omega = \int_{-\infty}^{+\infty} u(-\omega) \, \exp \left(i \omega t\right) d\omega = \int_{-\infty}^{+\infty} u(\omega) \, \exp \left(i \omega (-t)\right) d\omega = u(-t). \tag{6.3}\label{6.3}$$

It is reminded that, when $$u(t)$$ is real valued, $$u(-\omega) = \bar{u}(\omega)$$ as $$\bar{u}(\omega) = \overline{ \int_t u(t) e^{-i \omega t} dt } = \int_t u(t) e^{i \omega t} dt = \int_t u(t) e^{-i (-\omega) t} dt = u(-\omega)$$.

Computing the gradient requires two forward problems per source: one to compute the incident wavefield $$\boldsymbol{u}$$, one to compute the adjoint wavefield $${\boldsymbol{v}}$$.  Figure 6.1: Sketch illustration of the imaging principle of FWI based upon adjoint simulation. The gradient is built by the zero-lag correlation between the incident wavefield and the adjoint wavefield back propagated in time from the receiver positions. The source of the adjoint equation is illustrated in the top left panel. It represents the data residuals reversed in time. The incident wavefield and the back-propagated adjoint wavefield should be in phase at the position of the missing heterogeneity at T=0.775s (right figure). FWI can be viewed as a time reversal source localization method, where the targeted sources are the virtual scattering sources in depth.

Back to \ref{Content}

$$\tag{VII}\label{VII}$$

### Computing the gradient of the misfit function with the adjoint-state method

We review what has been shown above in a more generic way using the so-called adjoint-state method as developed by G. Chavent and reviewed by Plessix (2006) and Kern (2016).

We consider the least-squares functional.

$$C(\boldsymbol{m}) = \| \phi(\boldsymbol{m}) - \boldsymbol{d}^* \|_2^2.\tag{7.1}\label{7.1}$$

The gradient of the functional as a function of the Fréchet derivatives is given by

$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = \left[ \left( \frac{\partial \phi(\boldsymbol{m}) }{\partial \boldsymbol{m} } \right)^\dagger \right]_{N \times M} \left( \phi(\boldsymbol{m}) - \boldsymbol{d}^* \right)_M.\tag{7.2}\label{7.2}$$

Let's differentiate the state equation $$F(\boldsymbol{u},\boldsymbol{m})=0$$.

$$\frac{\partial F(\boldsymbol{u},\boldsymbol{m})}{\partial \boldsymbol{u} } \delta \boldsymbol{u} = - \frac{\partial F(\boldsymbol{u},\boldsymbol{m})}{\partial \boldsymbol{m} } \delta \boldsymbol{m}.\tag{7.3}\label{7.3}$$

$$\left(\frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} } \right) \frac{ \partial \boldsymbol{u} }{ \partial \boldsymbol{m} } = - \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }.\tag{7.4}\label{7.4}$$

From the observation equation $$\phi(\boldsymbol{m}) = \boldsymbol{P} \boldsymbol{u}(\boldsymbol{m}) \rightarrow \partial_{\boldsymbol{m}} \phi(\boldsymbol{m}) =\boldsymbol{P} \partial_{\boldsymbol{m}} \boldsymbol{u}(\boldsymbol{m})$$.

$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = -\left[ \left( \boldsymbol{P} \frac{\partial \boldsymbol{u}(\boldsymbol{m}) }{\partial \boldsymbol{m} } \right)^\dagger \right]_{N \times M} \left( \phi(\boldsymbol{m}) - \boldsymbol{d}^* \right)_M.\tag{7.5}\label{7.5}$$

Computing the gradient, written under the form of (\ref{7.5}), requires to solve the linearized equation (\ref{7.4}). This linearized equation has $$N$$ right-hand sides and hence is expensive to solve.

Let's plug the closed-form solution of equation (\ref{7.4}) in the expression of the gradient, equation (\ref{7.5}).

$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = - \left[ \left( \boldsymbol{P} \left(\frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} } \right)^{-1} \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} } \right)^\dagger \right]_{M \times N} \left( \boldsymbol{d} - \boldsymbol{d}^* \right)_N.\tag{7.6}\label{7.6}$$

$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = - \left[ \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }^\dagger \left( \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }^{-1} \right)^\dagger \boldsymbol{P}^T \right]_{M \times N} \left( \boldsymbol{d} - \boldsymbol{d}^* \right)_N.\tag{7.7}\label{7.7}$$

Let's parenthesize (\ref{7.7}) as follow

$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = - \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }^\dagger \left[ \left( \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }^\dagger \right)^{-1} \boldsymbol{P}^T \left( \boldsymbol{d} - \boldsymbol{d}^* \right) \right].\tag{7.8}\label{7.8}$$

With inner-product notations, we have used the identity

$$< \boldsymbol{P} \left(\frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }\right)^{-1} \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }, \boldsymbol{d} - \boldsymbol{d}^* >=< \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} } , \left( \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }^\dagger \right)^{-1} \boldsymbol{P}^T \left( \boldsymbol{d} - \boldsymbol{d}^* \right) >.\tag{7.9}\label{7.9}$$

Let's introduce the adjoint state variable $$\boldsymbol{v}$$.

$$\frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u} }^\dagger \boldsymbol{v}= \boldsymbol{P}^T \left( \boldsymbol{d} - \boldsymbol{d}^* \right).\tag{7.10}\label{7.10}$$

$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = - \frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }^\dagger \boldsymbol{v}. \tag{7.11}\label{7.11}$$

The gradient requires now to solve one state equation (to compute $$\boldsymbol{u}$$ from which $$\frac{ \partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{m} }$$ follows) and one adjoint state equation (\ref{7.10}).

Back to \ref{Content}

$$\tag{VIII}\label{VIII}$$

### The adjoint-state method with Lagrangian functional

Let's come back to the original full-space formulation of FWI as a constrained optimization where the state equation is the equality constraint (\ref{1.4b}).

$$\min_{\boldsymbol{m},\boldsymbol{u}} \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \|_2^2 ~~ \text{subject to } ~~ F(\boldsymbol{u},\boldsymbol{m})=0\tag{8.1}\label{8.1}.$$

We solve this constrained problem with the method of Lagrange multiplier. Let's introduce the following Lagrangian function

$$\mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) = \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d} \|_2^2 + < \boldsymbol{v} , F(\boldsymbol{u},\boldsymbol{m}) >,\tag{8.2}\label{8.2}$$

where $$\boldsymbol{v}$$ is the Lagrange multiplier (or adjoint-state variable), $$< >$$ denotes inner product, and $$\boldsymbol{u}$$, $$\boldsymbol{v}$$ and $$\boldsymbol{m}$$ are independent variables.

We solve the following min max (saddle point) problem

$$\min_{\boldsymbol{m},\boldsymbol{u}} \max_{\boldsymbol{v}} \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \|_2^2 + < \boldsymbol{v} , F(\boldsymbol{u},\boldsymbol{m}) >. \tag{8.3}\label{8.3}$$

The first-optimality conditions state that the minimizer is reached at a saddle point of the Lagrangian function, namely,

$$\partial \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) / \partial \boldsymbol{v} = 0, \tag{8.4}\label{8.4}$$

$$\partial \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) / \partial \boldsymbol{u} = 0, \tag{8.5}\label{8.5}$$

$$\partial \mathcal{L} (\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})/ \partial \boldsymbol{m} = 0. \tag{8.6}\label{8.6}$$

Instead of jointly updating $$\boldsymbol{v}$$, $$\boldsymbol{u}$$, and $$\boldsymbol{m}$$ , we restrict the search space to the case when $$\boldsymbol{u}$$ is a realization of the state equation for any $$\boldsymbol{m}$$. Let's denote by $$\tilde{\boldsymbol{u}}$$ this realization.

In the full space spanned by the independent variables $$\boldsymbol{m}$$, $$\boldsymbol{u}$$, and $$\boldsymbol{v}$$, the partial derivative of the Lagrangian with respect to $$\boldsymbol{v}$$ is the state equation

$$\frac{\partial \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})}{\partial \boldsymbol{v}}=F(\boldsymbol{u},\boldsymbol{m}). \tag{8.7}\label{8.7}$$

When $$\boldsymbol{u}$$ is a realization of the state equation, the first optimality condition $$\partial \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) / \partial \boldsymbol{v}$$ is satisfied since $$F(\tilde{\boldsymbol{u}},\boldsymbol{m}) = 0$$.

Let's now consider the second optimality condition. In the full space spanned by the independent variables $$\boldsymbol{m}$$, $$\boldsymbol{u}$$, and $$\boldsymbol{v}$$, zeroing the Lagrangian with respect to the state variable gives the adjoint-state equation

$$\left( \frac{\partial F(\boldsymbol{u},\boldsymbol{m}) }{\partial \boldsymbol{u}} \right)^\dagger \boldsymbol{v} = \boldsymbol{P}^T (\boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^*). \tag{8.8}\label{8.8}$$.

In limiting ourselves to the reduced space spanned by $$\tilde{\boldsymbol{u}}$$, the associated adjoint-state variables, denoted by $$\tilde{\boldsymbol{v}}$$ satisfy

$$\left( \frac{\partial F(\tilde{\boldsymbol{u}},\boldsymbol{m}) }{\partial \boldsymbol{u}} \right)^\dagger \tilde{\boldsymbol{v}} = \boldsymbol{P}^T (\boldsymbol{P} \tilde{\boldsymbol{u}} - \boldsymbol{d}^*). \tag{8.9}\label{8.9}$$,

Moreover, when $$\boldsymbol{u}$$ is a realization of the state equation, we have

$$C(\boldsymbol{m}) = \left[ \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})\right]_{\tilde{\boldsymbol{u}}}, \tag{8.10}\label{8.10}$$

where $$C(\boldsymbol{m})$$ is the misfit function of the unconstrained problem (\ref{2.1}).

Indeed, we have also

$$C(\boldsymbol{m}) = \left[ \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})\right]_{\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}}}.\tag{8.11}\label{8.11}$$

Therefore, in virtue of the definition of partial derivative, taking the partial derivative of $$\mathcal{L}$$ with respect to $$\boldsymbol{m}$$ and taking the values at $$(\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}})$$ gives the gradient of $$\mathcal{L}$$ with respect to $$\boldsymbol{m}$$

$$\nabla_\boldsymbol{m} C(\boldsymbol{m}) = \left[ \partial_\boldsymbol{m} \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v})\right]_{\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}}}.\tag{8.12}\label{8.12}$$

We find

$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) = \left[ \partial_{\boldsymbol{m}} \mathcal{L}(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}) \right] _{\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}}}= \left[ \left( \frac{\partial F(\boldsymbol{u},\boldsymbol{m})}{\partial \boldsymbol{m}} \right)^\dagger \boldsymbol{v} \right]_{\tilde{\boldsymbol{u}}, \tilde{\boldsymbol{v}}},\tag{8.13}\label{8.13}$$

which is identical to (\ref{7.11}). The above expression of the gradient can be used to iteratively update $$\boldsymbol{m}$$ with any local optimization algorithm (steepest-descent, conjugate gradient, quasi Newton, Newton).

We can now apply this recipe to frequency-domain and time-domain FWI.
##### Computing the gradient of frequency-domain FWI with Lagrangian function

The Lagrangian function is given by

$${\mathcal{L}} \left(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}\right) = \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \|_2^2 + \Re \left\{ < \boldsymbol{v} , \boldsymbol{A}(\boldsymbol{m}) \, \boldsymbol{u} - \boldsymbol{b} > \right\},\tag{8.14}\label{8.14}$$

where it is reminded that $$\boldsymbol{u}$$ and $$\boldsymbol{v}$$ are complex valued. Taking the derivative of a function with respect to complex-valued parameter makes use of Wirtinger derivative. See also the paper of Brandwood about computing the gradient of a function of complex-valued parameters.

$$\partial_{\boldsymbol{u}} {\mathcal{L}} \left(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}\right) = 0\tag{8.15}\label{8.15}$$

Before taking the derivative with respect to $$\boldsymbol{u}$$, we introduce the adjoint of $$\boldsymbol{A}$$.

$${\mathcal{L}} \left(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}\right) = \| \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \|_2^2 + \Re \left\{ < \boldsymbol{A}(\boldsymbol{m})^\dagger \boldsymbol{v} , \boldsymbol{u} > - < \boldsymbol{v} , \boldsymbol{b} > \right\}.\tag{8.16}\label{8.16}$$

The adjoint-state equation is given by

$$\partial_{\boldsymbol{u}} {\mathcal{L}} \left(\boldsymbol{m},\boldsymbol{u},\boldsymbol{v}\right) = 0 \rightarrow \boldsymbol{A}(\boldsymbol{m})^\dagger \boldsymbol{v} = \boldsymbol{P}^T \left( \boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^* \right) \rightarrow \boldsymbol{A}(\boldsymbol{m})^T \overline{\boldsymbol{v}} = \boldsymbol{P}^T \left( \overline{\boldsymbol{P} \boldsymbol{u} - \boldsymbol{d}^*} \right),\tag{8.17}\label{8.17}$$

which is identical to (\ref{8.8}).

$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m})= \Re \left\{ < \boldsymbol{v} , \frac{\partial \boldsymbol{A}(\boldsymbol{m}) }{\partial \boldsymbol{m}} \, \boldsymbol{u} > \right\},\tag{8.18}\label{8.18}$$

which is identifcal to (\ref{8.13}).
##### Computing the gradient of time-domain FWI with Lagrangian function

We review how to compute the gradient of time-domain FWI with Lagrangian function when the wave equation is formulated as a first-order hyperbolic velocity-stress system.

$$\left\{ \begin{array}{l} \partial_t v(x,t) = b(x) \partial_x \sigma(x,t) + s(x,t) \\ \partial_t \sigma(x,t) = \kappa(x) \partial_x v(x,t) \end{array} \right.\tag{8.19}\label{8.19}$$

The wave equation can be written in matrix compact form as

$$\partial_t \boldsymbol{u}(x,t) = \boldsymbol{M}(m(x))\boldsymbol{D} \boldsymbol{u}(x,t) + \boldsymbol{b}(x,t), \tag{8.20}\label{8.20}$$

where

$$\boldsymbol{u}(x,t)=\left(\begin{array}{c} v(x,t) \\ \sigma(x,t)\end{array}\right), ~~ \boldsymbol{b}(x,t)=\left(\begin{array}{c} s(x,t) \\ 0 \end{array}\right), ~~ \boldsymbol{M}(x) = \left(\begin{array}{cc}b(x) & 0 \\ 0 & \kappa(x)\end{array}\right), ~~ \boldsymbol{D}= \left(\begin{array}{cc} 0 & \partial_x \\ \partial_x & 0 \end{array} \right).\tag{8.21}\label{8.21}$$

with initial conditions: $$\boldsymbol{u}(x,0)=\boldsymbol{0}; ~~ \partial_t \boldsymbol{u}(x,0)=\boldsymbol{0}$$.

We rewrite the wave equation such that it is self-adjoint by moving the subsurface properties in the left-hand side

$$\boldsymbol{C}(m(x)) \partial_t \boldsymbol{u}(x,t) = \boldsymbol{D} \boldsymbol{u}(x,t) + \boldsymbol{C}(m(x)) \boldsymbol{b}(x,t) ~~~ \text{with} ~~~ \boldsymbol{C}(x) = \left(\begin{array}{cc}\rho(x) & 0 \\ 0 & 1/\kappa(x)\end{array}\right)\tag{8.22}\label{8.22}$$

If we write the observation equation as

$$d(x_r,t) = d_r(t) = \boldsymbol{P}_r \boldsymbol{u}(x,t)\tag{8.23}\label{8.23}$$

then, the Lagrangian function is given by

$$L(\boldsymbol{u}(x,t),\boldsymbol{v}(x,t),\boldsymbol{m}(x)) = \frac{1}{2} \int_0^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right)^2 dt + \int_{t=0}^T < \boldsymbol{v}(x,t), \underbrace{\boldsymbol{C}(\boldsymbol{m}(x)) \partial_t \boldsymbol{u}(x,t) - \boldsymbol{D} \boldsymbol{u}(x,t) - \boldsymbol{C}(m(x)) \boldsymbol{b}(x,t)}_{State ~ equation} > dt,\tag{8.24}\label{8.24}$$

where $$\boldsymbol{v}$$ is the Lagrange multiplier.

We apply an integration by part for the time and space derivatives.

• Integration by part for time derivative

$$\int_0^T < \boldsymbol{v}(x,t), \partial_t \boldsymbol{u}(x,t) > = \left[ < \boldsymbol{v}(x,t) , \boldsymbol{u}(x,t) > \right]_0^T - \int_0^T < \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) = < \boldsymbol{v}(x,T) , \boldsymbol{u}(x,T) > - \int_0^T < \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) >.\tag{8.25}\label{8.25}$$

• Integration by part for space derivative

$$\int \partial_x \boldsymbol{u}(x,t) \boldsymbol{v}(x,t) dx = - \int \boldsymbol{u}(x,t) \partial_x \boldsymbol{v}(x,t) dx. \tag{8.26}\label{8.26}$$

$$< \boldsymbol{D} \boldsymbol{u}(x,t), \boldsymbol{v}(x,t) > = < \boldsymbol{u}(x,t), \boldsymbol{D}^\dagger \boldsymbol{v}(x,t) > ~~ \text{with} ~~ \boldsymbol{D}^\dagger = - \boldsymbol{D}. \tag{8.27}\label{8.27}$$

Using these results, the Lagrangian after a double integration by part reads

$$\begin{array}{l}L(\boldsymbol{u}(x,t),\boldsymbol{v}(x,t),\boldsymbol{m}(x)) = \frac{1}{2} \int_0^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right)^2 dt + < \boldsymbol{C}(\boldsymbol{m}(x)) \boldsymbol{v}(x,T), \boldsymbol{u}(x,T) > \\ + \int_{t=0}^T \left[ - < \boldsymbol{C}(\boldsymbol{m}(x)) \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) > + < \boldsymbol{D} \boldsymbol{v}(x,t) , \boldsymbol{u}(x,t) > - < \boldsymbol{v}(x,t) , \boldsymbol{C}(\boldsymbol{m}(x)) \boldsymbol{b}(x,t) > \right] dt \end{array} \tag{8.28}\label{8.28}$$.

The adjoint-state equation is given by

$$\boldsymbol{C}(\boldsymbol{m}(x)) \partial_t \boldsymbol{v}(x,t) = \boldsymbol{D} \boldsymbol{v}(x,t) + \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right).\tag{8.29}\label{8.29}$$

or, equivalently in a form suitable for explicit time stepping

$$\partial_t \boldsymbol{v}(x,t) = \boldsymbol{M}(\boldsymbol{m}(x)) \boldsymbol{D} \boldsymbol{v}(x,t) + \boldsymbol{M}(\boldsymbol{m}(x)) \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right),\tag{8.30}\label{8.30}$$
with a final condition $$\boldsymbol{v}(x,T) = 0$$ (this final condition results from the second term of the Lagrangian).

We perform a change of variable to transform the final condition into an initial condition

$$\boldsymbol{w}(x,t)=\boldsymbol{v}(x,T-t)~~~ \text{such that} ~~~ \boldsymbol{w}(x,0) = \boldsymbol{v}(x,T).\tag{8.31}\label{8.31}$$

Sustituting $$\boldsymbol{v}$$ by $$\boldsymbol{w}$$ in the adjoint-state equation gives

$$\partial_t \boldsymbol{w}(x,T-t) = \boldsymbol{M}(\boldsymbol{m}(x)) \boldsymbol{D} \boldsymbol{w}(x,T-t) + \boldsymbol{M}(\boldsymbol{m}(x)) \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,t) - \boldsymbol{d}_r(t)^* \right).\tag{8.32}\label{8.32}$$

Then, we apply the change of variable $$t' = T - t$$.

$$-\partial_t' \boldsymbol{w}(x,t') = \boldsymbol{M}(\boldsymbol{m}(x)) \boldsymbol{D} \boldsymbol{w}(x,t') + \boldsymbol{M}(\boldsymbol{m}(x)) \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}_r \boldsymbol{u}(x,T-t') - \boldsymbol{d}_r(T-t')^* \right).\tag{8.33}\label{8.33}$$

$$\partial_t' \boldsymbol{w}(x,t') = - \boldsymbol{M}(\boldsymbol{m}(x)) \boldsymbol{D} \boldsymbol{w}(x,t') - \boldsymbol{M}(\boldsymbol{m}(x)) \left( \boldsymbol{P}_r \right)^T \sum_r \left( \boldsymbol{P}^v_r \boldsymbol{u}(x,T-t') - \boldsymbol{d}_r(T-t')^* \right),\tag{8.34}\label{8.34}$$

with $$\boldsymbol{v}(x,0)=0$$.

In summary, the practical procedure to implement the adjoint-state equation consists of

• Reverse in time the data residuals.
• Reverse the sign of $$dt$$ in the wavefield simulation code.

Let's now compute the gradient. From (\ref{8.24}), we have

$$\nabla_{\boldsymbol{m}} C(\boldsymbol{m}) \approx - \int_{t=0}^T < \frac{ \partial_{\boldsymbol{m}} \boldsymbol{C}(\boldsymbol{m}(x)) }{ \partial \boldsymbol{m} } \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) > dt, \tag{8.35}\label{8.35}$$

More explicitly,

$$\begin{array}{l}\nabla_{\rho} C(\boldsymbol{m}) \approx - \int_{t=0}^T < \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) > dt, \\ \nabla_{\kappa} C(\boldsymbol{m}) \approx \int_{t=0}^T < \frac{1}{\kappa^2(x)} \partial_t \boldsymbol{v}(x,t), \boldsymbol{u}(x,t) > dt. \end{array}\tag{8.36}\label{8.36}$$

To compute the gradient with respect to other classes of parameters, use the chain rule of derivatives.

Back to \ref{Content}

$$\tag{IX}\label{IX}$$

### Role of the Hessian

Let's remind the Newton system providing the solution of the descent direction.

$$\frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial {\boldsymbol{m}}^2} \, \Delta {\boldsymbol{m}}=- \frac{\partial C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}}. \tag{9.1}\label{9.1}$$

We have just seen the role of the gradient in FWI. We now analyze the role of the Hessian.

Let's now compute the element $$m,l$$ of the second-derivative of the misfit function.

$$\frac{\partial^2 C({\boldsymbol{m}})}{\partial m_k \partial m_l}=-\sum_{i=1}^{N} \Re \left\{ \left( \overline{ \frac{\partial^2 d_i }{\partial m_k \partial m_l} } \right) (d_i-d_i^*) \right\} +\sum_{i=1}^{N} \Re \left\{ \left( \overline{ \frac{\partial d_i}{\partial m_k} } \right) \frac{\partial d_i}{\partial m_l} \right\}. \tag{9.2}\label{9.2}$$

Going back to matrix notation,

$$\frac{\partial^2 C({\boldsymbol{m}})}{\partial {\boldsymbol{m}}^2}=\Re \left\{ \boldsymbol{J}^\dagger \boldsymbol{J} \right\}+\Re \left\{ \frac{\partial \boldsymbol{J}^t}{\partial {\boldsymbol{m}}^t} (\overline{ \Delta \boldsymbol{d} } ~ ... ~ \overline{\Delta \boldsymbol{d}} ) \right\}. \tag{9.3}\label{9.3}$$

In particular,at the point $$\boldsymbol{m}_0$$ of the model space, we have

$$\frac{\partial^2 C({\boldsymbol{m}}_0)}{\partial \boldsymbol{m}^2}=\Re \left\{ \boldsymbol{J}_0^\dagger \boldsymbol{J}_0 \right\}+\Re \left\{ \frac{\partial \boldsymbol{J}_0^t}{\partial \boldsymbol{m}^t} ( \overline{\Delta \boldsymbol{d}} ~ ... ~ \overline{ \Delta \boldsymbol{d} }) \right\}. \tag{9.4}\label{9.4}$$

The linear term of the Hessian, which is sometimes referred to as approximate or Gauss-Newton Hessian, is given by:

$${\boldsymbol{H}}_a = \Re \left\{ \boldsymbol{J}^\dagger \boldsymbol{J} \right\}. \tag{9.5}\label{9.5}$$

The linear term in the Hessian represents the sum of the zero-lag correlation between partial derivative wavefields with respect to two different

### Appendix 1: Few words about the method of Lagrange multiplier

Topic attachments
I Attachment Action Size Date Who Comment png cycle.png manage 72.6 K 18 Feb 2020 - 13:02 Main.UnknownUser png fig_adjoint_snap.png manage 118.2 K 03 Jan 2020 - 23:31 Main.UnknownUser png fig_aliasing.png manage 129.8 K 26 Jan 2020 - 16:46 Main.UnknownUser png fig_baragiano.png manage 645.8 K 23 Jan 2020 - 07:58 Main.UnknownUser png fig_data_weigth.png manage 6679.8 K 18 Apr 2020 - 12:35 Main.UnknownUser png fig_diag_hessian.png manage 56.0 K 23 Jan 2020 - 07:50 Main.UnknownUser png fig_doublescattering.png manage 173.1 K 23 Jan 2020 - 08:18 Main.UnknownUser png fig_freq_versus_angle.png manage 1097.5 K 23 Jan 2020 - 22:09 Main.UnknownUser png fig_fwi_Fresnelzone.png manage 52.8 K 24 Jan 2020 - 09:33 Main.UnknownUser png fig_fwi_pointdiffractor_1shot.png manage 34.4 K 24 Jan 2020 - 09:28 Main.UnknownUser png fig_fwi_pointdiffractor_fine_41shot.png manage 47.7 K 24 Jan 2020 - 09:28 Main.UnknownUser png fig_hessian1.png manage 360.7 K 23 Jan 2020 - 07:42 Main.UnknownUser png fig_imaging_principle.png manage 178.6 K 03 Jan 2020 - 21:42 Main.UnknownUser png fig_imagingprinciple2.png manage 43.0 K 17 Jan 2020 - 22:23 Main.UnknownUser png fig_imagingprinciple_adjoint.png manage 59.0 K 03 Jan 2020 - 23:10 Main.UnknownUser png fig_impedancematrix.png manage 194.4 K 19 Feb 2020 - 08:48 Main.UnknownUser png fig_inclusion_full.png manage 119.4 K 24 Jan 2020 - 09:44 Main.UnknownUser png fig_inclusion_full_dirac.png manage 118.9 K 24 Jan 2020 - 09:47 Main.UnknownUser png fig_inclusion_seq_versus_simul.png manage 84.9 K 24 Jan 2020 - 09:50 Main.UnknownUser png fig_inclusion_small_vtrue.png manage 50.4 K 23 Jan 2020 - 07:37 Main.UnknownUser png fig_marmousi_macro+mig.png manage 353.9 K 18 Feb 2020 - 12:30 Main.UnknownUser png fig_marmousi_tomo+mig.png manage 353.0 K 18 Feb 2020 - 13:00 Main.UnknownUser png fig_mig_ray+Born.png manage 372.3 K 18 Feb 2020 - 12:58 Main.UnknownUser png fig_migration_Marmousi.png manage 300.5 K 18 Feb 2020 - 13:03 Main.UnknownUser png fig_mulder.png manage 1094.0 K 18 Apr 2020 - 09:12 Main.UnknownUser png fig_overthrust.png manage 1025.7 K 18 Apr 2020 - 12:13 Main.UnknownUser png fig_param_acqui.png manage 256.8 K 20 Jan 2020 - 21:50 Main.UnknownUser png fig_partialderivativewavefield.png manage 65.4 K 03 Jan 2020 - 20:43 Main.UnknownUser png fig_plane_wave.png manage 588.4 K 20 Jan 2020 - 22:02 Main.UnknownUser png fig_planewave_angle.png manage 596.3 K 20 Jan 2020 - 22:10 Main.UnknownUser png fig_radiation_acoustic.png manage 261.3 K 03 Jan 2020 - 21:06 Main.UnknownUser png fig_radiation_pattern_acoustic.png manage 67.6 K 03 Jan 2020 - 21:02 Main.UnknownUser png fig_radiation_snap_h.png manage 66.3 K 03 Jan 2020 - 21:04 Main.UnknownUser png fig_rayborn_modeling_Marmousi.png manage 479.7 K 18 Feb 2020 - 13:04 Main.UnknownUser png fig_ricker.png manage 23.8 K 24 Jan 2020 - 09:23 Main.UnknownUser png fig_sd_versus_gn.png manage 159.7 K 23 Jan 2020 - 08:07 Main.UnknownUser png fig_sensitivity_kernel.png manage 465.9 K 24 Jan 2020 - 08:42 Main.UnknownUser png fig_sensitivity_kernel1.png manage 557.5 K 03 Jan 2020 - 22:18 Main.UnknownUser png fig_sensitivity_kernel_cycleskipping.png manage 794.6 K 18 Apr 2020 - 10:30 Main.UnknownUser png fig_sirgue_criterion.png manage 456.1 K 23 Jan 2020 - 22:23 Main.UnknownUser png fig_sketch_mbtt.png manage 90.9 K 18 Feb 2020 - 13:36 Main.UnknownUser png fig_step_length.png manage 29.8 K 04 Feb 2020 - 21:21 Main.UnknownUser png fig_time_damping.png manage 1284.0 K 18 Apr 2020 - 13:21 Main.UnknownUser png fig_wavenumber_3d.png manage 339.2 K 23 Jan 2020 - 21:40 Main.UnknownUser png fig_wavenumber_map.png manage 373.4 K 20 Jan 2020 - 22:26 Main.UnknownUser png fig_wavenumber_redundancy.png manage 1097.6 K 23 Jan 2020 - 22:15 Main.UnknownUser png fig_wavepath_gp.png manage 695.6 K 18 Feb 2020 - 14:01 Main.UnknownUser png fig_wavepath_ld.png manage 446.0 K 18 Apr 2020 - 12:51 Main.UnknownUser png fig_wavepath_nd.png manage 1248.1 K 18 Apr 2020 - 12:52 Main.UnknownUser png fig_wavepath_offset.png manage 1074.9 K 23 Jan 2020 - 22:34 Main.UnknownUser png fig_wavepath_reflector.png manage 511.0 K 18 Feb 2020 - 13:37 Main.UnknownUser png fig_wavepath_sd.png manage 299.6 K 18 Apr 2020 - 12:53 Main.UnknownUser png fig_wraparound2.png manage 337.4 K 18 Feb 2020 - 14:14 Main.UnknownUser png non_convex.png manage 63.6 K 18 Jan 2020 - 20:22 Main.UnknownUser English Français