We are searching data for your request:

**Forums and discussions:**

**Manuals and reference books:**

**Data from registers:**

**Wait the end of the search in all databases.**

Upon completion, a link will appear to access the found materials.

Upon completion, a link will appear to access the found materials.

In fMRI, univariate analysis typically makes use of the general linear model (GLM), where the relation between experimental condition and BOLD activity is estimated with a linear regression model for each voxel separately ("mass univariate model"). However, I found it surprisingly hard to find out what the response vector $Y$ contains. I am aware that it is some measure of the blood oxygenation level dependent signal (BOLD signal) - but what is measured exactly, i.e. what are the units? Is it the oxy/deoxy ratio, or is it another property?

**Short answer**

The units aren't important, and are typically normalized.

**Longer answer**

The name "BOLD" is actually very instructive: Blood Oxygenation Level Dependent contrast (or signal). BOLD is not directly measuring anything like oxygenated/deoxygenated ratio, or we would probably call it something more similar to that: oxy/deoxy ratio.

Instead, "BOLD" is just the MRI image you get when you set up the scanner properties in a way that happens to be influenced by the oxygenation level of blood. Oxygenation level is far from the only contributor to the BOLD signal, but experimentally you try to factor out any influences that are time-varying but not interesting, and you ignore baseline levels that don't change.

fMRI works because the transverse relaxation time T2 of water in blood depends on oxygenation of the blood. From the original paper showing BOLD contrast in a rat brain Ogawa et al. 1990:

At the 7-T field strength used in this study, T2 varies from 50 msec at 100% oxygenation to 4 msec at 0% oxygenation. The venous blood signal becomes very weak when the T2 value becomes comparable to or shorter than the echo time of the signal acquisition. At 60% oxygenation level, the estimated T2 value is 18 msec, similar to the echo time used in this report.

In other words, if you produce gradient-echo images with an echo time of 18 ms, the images you get show very different signal levels at 100% vs 60% oxygenation. Depending on the strength of your magnet, you can tune this time to go after the differences you want to see.

There is no straightforward way to compare images recorded in different conditions (i.e., different scanners, different parameters) on any useful scale of measurement. Instead, people typically report relative values, either in time or space. For this reason, you often see BOLD signals expressed as a % change or in terms of standard deviation (i.e., z-scores).

Ultimately, for fitting a general linear model it doesn't matter: you can freely scale a dependent variable without changing the structure of the model fit, all you influence is the interpretation of the magnitudes of the intercept and coefficients.

*References*

Ogawa, S., Lee, T. M., Kay, A. R., & Tank, D. W. (1990). Brain magnetic resonance imaging with contrast dependent on blood oxygenation. Proceedings of the National Academy of Sciences, 87(24), 9868-9872.

## The General Linear Model (GLM)

The described *t* test for assessing the difference of two mean values is a special case of an analysis of a qualitative (categorical) independent variable. A qualitative variable is defined by discrete levels, e.g., "stimulus off" vs. "stimulus on". If a design contains more than two levels assigned to a single or multiple factors, an analysis of variance (ANOVA) can be performed, which can be considered as an extension of the t test. The described correlation coefficient on the other hand is suited for the analysis of quantitative independent variables. A quantitative variable may be defined by any gradual time course. If more than one reference time course has to be considered, multiple regression analysis can be performed, which can be considered as an extension of the simple linear correlation analysis.

The General Linear Model (GLM) is mathematically identical to a multiple regression analysis but stresses its suitability for both multiple qualitative and multiple quantitative variables. The GLM is suited to implement any parametric statistical test with one dependent variable, including any factorial ANOVA design as well as designs with a mixture of qualitative and quantitative variables (covariance analysis, ANCOVA). Because of its flexibility to incorporate multiple quantitative and qualitative independent variables, the GLM has become the core tool for fMRI data analysis after its introduction into the neuroimaging community by Friston and colleagues (Friston et al. 1994, 1995). The following sections briefly describe the mathematical background of the GLM in the context of fMRI data analysis a comprehensive treatment of the GLM can be found in the standard statistical literature, e.g. Draper and Smith (1998) and Kutner et al. (2005).

Note: In the fMRI literature, the term "General Linear Model" refers to its univariate version. The term "univariate" does in this context not refer to the number of independent variables, but to the number of dependent variables. As mentioned earlier, a separate statistical analysis is performed for each voxel time series (dependent variable). In its general form, the General Linear Model has been defined for multiple dependent variables, i.e. it encompasses tests as general as multivariate covariance analysis (MANCOVA).

From the perspective of multiple regression analysis, the GLM aims to "explain" or "predict" the variation of a dependent variable in terms of a linear combination (weighted sum) of several reference functions. The dependent variable corresponds to the observed fMRI time course of a voxel and the reference functions correspond to time courses of expected (idealized) fMRI responses for different conditions of the experimental paradigm. The reference functions are also called predictors, regressors, explanatory variables, covariates or basis functions. A set of specified predictors forms the design matrix, also called the model. A predictor time course is typically obtained by convolution of a condition box-car time course with a standard hemodynamic response function (two-gamma HRF or single-gamma HRF). A condition box-car time course may be defined by setting values to 1 at time points at which the modeled condition is defined ("on") and 0 at all other time points. Each predictor time course X gets an associated coefficient or beta weight b, quantifying its potential contribution in explaining the voxel time course y. The voxel time course y is modeled as the sum of the defined predictors, each multiplied with the associated beta weight b. Since this linear combination will not perfectly explain the data due to noise fluctuations, an error value e is added to the GLM system of equations with n data points and p predictors:

The y variable on the left side corresponds to the data, i.e. the measured time course of a single voxel. Time runs from top to bottom, i.e. y1 is the measured value at time point 1, y2 the measured value at time point 2 and so on. The voxel time course (left column) is "explained" by the terms on the right side of the equation. The first column on the right side corresponds to the first beta weight b0. The corresponding predictor time course X0 has a value of 1 for each time point and is, thus, also called "constant". Since multiplication with 1 does not alter the value of b0, this predictor time course (X0) does not explicitly appear in the equation. After estimation (see below), the value of b0 typically represents the signal level of the baseline condition and is also called intercept. While its absolute value is not informative, it is important to include the constant predictor in a design matrix since it allows the other predictors to model small condition-related fluctuations as increases or decreases relative to the baseline signal level. The other predictors on the right side model the expected time courses of different conditions. For multi-factorial designs, predictors may be defined coding combinations of condition levels in order to estimate main and interaction effects. The beta weight of a condition predictor quantifies the contribution of its time course in explaining the voxel time course. While the exact interpretation of beta values depends on the details of the design matrix, a large positive (negative) beta weight typically indicates that the voxel exhibits strong activation (deactivation) during the modeled experimental condition relative to baseline. All beta values together characterize a voxels "preference" for one or more experimental conditions. The last column in the system of equations contains error values, also called residuals, prediction errors or noise. These error values quantify the deviation of the measured voxel time course from the predicted time course, the linear combination of predictors.

The GLM system of equations may be expressed elegantly using matrix notation. For this purpose, we represent the voxel time course, the beta values and the residuals as vectors, and the set of predictors as a matrix:

Representing the indicated vectors and matrix with single letters, we obtain this simple form of the GLM system of equations:

In this notation, the matrix **X** represents the design matrix containing the predictor time courses as column vectors. The beta values now appear in a separate vector **b**. The term **Xb** indicates matrix-vector multiplication. The figure above shows a graphical representation of the GLM. Time courses of the signal, predictors and residuals have been arranged in column form with time running from top to bottom as in the system of equations.

Given the data **y** and the design matrix **X**, the GLM fitting procedure has to find a set of beta values explaining the data as good as possible. The time course values predicted by the model are obtained by the linear combination of the predictors:

A good fit would be achieved with beta values leading to predicted values which are as close as possible to the measured values **y**. By rearranging the system of equations, it is evident that a good prediction of the data implies small error values:

An intuitive idea would be to find those beta values minimizing the sum of error values. Since the error values contain both positive and negative values (and because of additional statistical considerations), the GLM procedure does not estimate beta values minimizing the sum of error values, but finds those beta values minimizing the sum of squared error values:

The term **e**'**e** is the vector notation for the sum of squares (Sigma e 2 ). The apostrophe symbol denotes transposition of a vector or matrix. The optimal beta weights minimizing the squared error values (the "least squares estimates") are obtained non-iteratively by the following equation:

The term in brackets contains a matrix-matrix multiplication of the transposed, **X**', and non-transposed, **X**, design matrix. This term results in a square matrix with a number of rows and columns corresponding to the number of predictors. Each cell of the **X**'**X** matrix contains the scalar product of two predictor vectors. The scalar product is obtained by summing all products of corresponding entries of two vectors corresponding to the calculation of covariance. This **X**'**X** matrix, thus, corresponds to the predictor variance-covariance matrix. The variance-covariance matrix is inverted as denoted by the "-1" symbol. The resulting matrix (**X**'**X**) -1 plays an essential role not only for the calculation of beta values but also for testing the significance of contrasts (see below). The remaining term on the right side, **X**'**y**, evaluates to a vector containing as many elements as predictors. Each element of this vector is the scalar product (covariance) of a predictor time course with the observed voxel time course.

An interesting property of the least squares estimation method is that the variance of the measured time course can be decomposed into the sum of the variance of the predicted values (model-related variance) and the variance of the residuals:

Since the variance of the voxel time course is fixed, minimizing the error variance by least squares corresponds to maximizing the variance of the values explained by the model. The square of the multiple correlation coefficient *R* provides a measure of the proportion of the variance of the data which can be explained by the model:

The values of the multiple correlation coefficient vary between 0 (no variance explained) to 1 (all variance explained by the model). A coefficient of *R* = 0.7, for example, corresponds to an explained variance of 49% (0.7x0.7). An alternative way to calculate the multiple correlation coefficient consists in computing a standard correlation coefficient between the predicted values and the observed values: R = r_{yy}. This equation offers another view on the meaning of the multiple correlation coefficient quantifying the interrelationship (correlation) of the combined set of predictor variables with the observed time course.

## What is the response vector in the fMRI GLM model? - Biology

Many techniques have been proposed for statistically analysing fMRI data, and a variety of these are in general use. The aim of such analysis is to produce an image identifying the regions which show significant signal change in response to the task. Each pixel is assigned a value dependent on the likelihood of the null hypothesis, that the observed signal changes can be explained purely by random variation in the data consistent with its variance, being false. Such an image is called a statistical parametric map. The aim of this section is to show how such maps can be produced.

All the methods described below have been used, at one time or another, in the analysis of the data presented in this thesis. The majority have been implemented as 'C' programs, with the notable exception of using the *SPM*[10] implementation of the general linear model.

Throughout this section, the analysis techniques described are demonstrated on an example data set. The experiment performed was intended to detect activations resulting from a visually cued motor task. The whole brain of the subject was imaged, in 16 coronal slices of resolution 3 x 3 x 10 mm 3 , every four seconds. As cued by an LED display, they were required to squeeze a ball at the rate of 2 Hz. The experiment involved 16 s of rest, followed by 16 s of task performance, repeated 32 times.

Further detail of the statistics mentioned in this chapter can be found in many statistics text books, for example those by Zar[11], and Miller and Freund[12].

#### 6.3.1 Subtraction Techniques

One of the simplest methods for obtaining results from a two state fMRI experiment is to perform a simple subtraction. This is carried out by averaging together all the images acquired during the 'on' phase of the task, and subtracting the average of all the 'off' images. The disadvantage of such a technique is that it is extremely sensitive to head motion, leading to large amounts of artefact in the image. Figure 6.7a shows a single slice through the motor cortex from the example data set, and Figure 6.7b shows the result of subtracting the 'off' images from the 'on' images. Although signal increase can be seen in the primary motor cortex, there is also a large amount of artefact, particularly at the boundaries in the image.

Such a method does not yield a statistic that can be tested against the null hypothesis, so instead of straight subtraction it is more common to use a Student's t-test. This weights the difference in means, by the standard deviation in 'off' or 'on' values, giving high t-scores to large differences with small standard deviations, and low t-scores to small differences with large standard deviations. The t-score is calculated on a pixel by pixel basis, for a time series X, using the formula

and is the pooled variance

The suffix '1' refers to the *n* 1 images acquired during the 'on' period of the task, and '2' refers to the *n* 2 images acquired during the rest period. Figure 6.7c shows the statistical parametric map of t-scores for the sample data set. Again motor cortex activation is clearly seen, but the movement artefact is reduced compared to the subtraction technique.

##### Figure 6.7 Use of subtraction techniques to analyse fMRI data. (a) A single slice coronal EPI image through the primary motor cortex. (b) The mean of the images acquired during the 'off' period of the fMRI experiment subtracted from the mean of the images acquired during the 'on' period. (c) The t-statistical parametric map corresponding to image (b).

#### 6.3.2 Correlation techniques

Since we know that the BOLD response is mediated by blood flow, it is possible to improve the detection of activations by predicting the shape of the response to the stimulus, and calculating correlation coefficients between each pixel time course and this reference waveform. This is less sensitive to other physiological changes during the experiment, and to movement. For a time course X, and a reference waveform Y, the correlation coefficient is calculated as

and has a value of 1 for perfect correlation, a value of zero for no correlation, and a value of -1 for perfect anti-correlation.

The choice of an appropriate reference waveform is vital for the success of this technique in finding activations. The first approximation might be a square wave, which is high for scans acquired during the task, and low for scans acquired during rest (Figure 6.8a). Such a waveform however takes no account of the delay and smoothness of the haemodynamic response which regulates the BOLD contrast. An improvement to this would be to change the phase of the square wave (Figure 6.8b), with the delay being between 3 and 6 seconds.

To improve the reference waveform further, it is necessary to look more closely at the actual haemodynamic response. In an experiment such as the one used for the example data set, where there is both visual and motor activation, it is possible to use the response to one type of stimulus to form the reference waveform for finding the other. In this case the time series for one or more pixels in, say the visual cortex is extracted (Figure 6.8c), and correlation coefficients are calculated between this waveform and that of every other pixel in the image. Such an analysis detects only those regions in the brain which respond to the stimulus in the same way as the visual cortex. The major disadvantage of this technique is that it is particularly sensitive to motion artefact, since if such artefact is present in the reference waveform then the movement of other regions will be highly correlated. To attempt to reduce this, the response in the visual cortex to each stimulus can be averaged together, producing a mean response to the single cycle. The reference waveform is then made up of a repetition of these single cycle average responses (Figure 6.8d).

##### Figure 6.8. Various reference functions that can be used to correlate with a pixel time course to detect activations (see text for descriptions)

To be more general in predicting the haemodynamic response, so that a reference waveform can be constructed for any length of stimulus, it is necessary to know the response to a single stimulus. Friston[13] suggested that a haemodynamic response function could be considered as a point spread function, which smoothes and shifts the input function. By deconvolving the response from a known area of activation with the stimulus function, the haemodynamic response function can be obtained. The haemodynamic response function is not completely uniform across the entire brain however, and the shape obtained from one region may not be optimum for another. As an alternative, the response can be modelled by a mathematical function, such as a Poisson or beta function. The Poisson function

with width l = 6 seconds, seems to fit well to observed haemodynamic responses (Figure 6.8e).

Since in general each slice of the volume imaged is not acquired at the same instant, it is necessary to accommodate timing differences in the correlation with the reference waveform. In order to do this, the relative magnitude of the activation at the time each slice was acquired is predicted, by convolving the input stimulus with a Poisson function. Then from this series, the correlation coefficients can be calculated on a slice by slice basis, constructing the reference waveform from the appropriate points in the predicted time series.

Examples of the effect of the reference waveform on the result is shown in Figure 6.9. Here, pixels in the head which correlate to the reference waveforms (shown in Figure 6.8), with r > 0.20 are shown in red, on top of the base image. The square wave correlation is the least effective in detecting activations (a), however a considerable improvement is obtained by delaying the waveform by 4 seconds (b). The correlation of the visual cortex with itself (c) is, not surprisingly, high, but using the average visual cortex response (d) improves the correlation in the motor cortex. The Poisson function model of the haemodynamic response (e) improves slightly on the delayed square wave, and is a good model.

(a) Square wave |

(b) Delayed Square Wave |

(c) Visual Cortex Response |

(d) Average Visual Cortex Response |

(e) Poisson Distribution Model |

##### Figure 6.9 Activation images obtained by correlating the test data sets with the reference waveforms shown in Figure 6.8.

#### 6.3.3 The General Linear Model

The statistical techniques described above are both parametric tests. That is to say, they assume that the observations are taken from normal populations. Most parametric modelling techniques are special cases of the general linear model. This framework for analysing functional imaging data, first developed for PET and then extended for fMRI, is implemented in the software package *SPM*[14]. The general linear model is only outlined here, since the theory is extensively covered in the literature[15].

The aim of the general linear model is to explain the variation of the time course *y* 1 . y i . y n , in terms of a linear combination of explanatory variables and an error term. For a simple model with only one explanatory variable *x* 1 . x i . x n , the general linear model can be written

where *b* is the scaling, or slope parameter, and *e i* is the error term. If the model includes more variables it is convenient to write the general linear model in matrix form

where now **Y** is the vector of observed pixel values, **b** is the vector of parameters and **e** is the vector of error terms. The matrix **X** is known as the design matrix. It has one row for every time point in the original data, and one column for every explanatory variable in the model. In analysing an fMRI experiment, the columns of **X** contain vectors corresponding to the 'on' and 'off' elements of the stimulus presented. By finding the magnitude of the parameter in **b** corresponding to these vectors, the presence or absence of activation can be detected.

**b** can be determined by solving the 'normal equations'

where is the best linear estimate of **b**. Provided that (**X** T **X)** is invertable then is given by

Such parameter estimates are normally distributed, and since the error term can be determined, statistical inference can be made as to whether the **b** parameter corresponding to the model of an activation response is significantly different from the null hypothesis.

The general linear model provides a framework for most kinds of modelling of the data, and can eliminate effects that may confound the analysis, such as drift or respiration, provided that they can be modelled.

#### 6.3.4 The Serial T-Test

All the techniques described above require the prediction of the time course that active regions will follow. For many experiments, the use of rapid imaging, and carefully designed paradigms, makes the separation of the order of cognitive events possible. One such example, which is part of our study of Parkinson's disease and described in more detail in Chapter 7, is a paradigm involving the initiation of movement. In this experiment, the subject was required to respond, by pressing a hand held button, to the visual presentation of the number '5', and to make no response to the presentation of a '2'. This paradigm presented two differences to conventional, epoch based experiments. Firstly, the activations of interest, which are those responsible for the button pressing, occurred at an irregular rate. Secondly, all the cognitive processes involved in the task, including both the planning and execution of the movement, occurred in a time period of a few hundred milliseconds, as opposed to the sustained activation used in epoch based paradigms. Such an experiment requires a new form of analysis. Two techniques have been evaluated, both of which make no assumptions about the time course of activations during the task: the serial t-test, described here, and an analysis of variance technique, explained in the next section.

The basis of the serial t-test is to define a resting state baseline, and compare the images acquired at each time point before, during and after the task with this baseline. Figure 6.10 illustrates the technique. For each time point following the stimulus, a mean and standard deviation image is constructed, as is a baseline mean and standard deviation image. Then a set of t-statistical parametric maps are formed by calculating, on a pixel by pixel basis, the t-score (using equations 6.5 - 6.7) for the difference between mean image one and the mean baseline image, mean image two and baseline, and so on.

##### Figure 6.10. fMRI analysis using the serial t-test

Figure 6.11 shows the result of analysing the example data set using this technique. Such a data set does not really show the benefit of the serial t-test analysis. Results shown in Chapter 7 better illustrate its use in looking at event timings, and unpredictable waveforms.

##### Figure 6.11 Results of processing the test data set using the serial t-test. Eight volume image sets are shown as rows, the top four corresponding to the 'rest' periods of the experiment, and the next four corresponding to the 'task' periods. Activation can be seen in both the primary motor and visual cortices.

The technique has two major disadvantages. The first is that, in order to achieve a sufficient signal to noise ratio, it is necessary to have many more cycles than in an epoch based paradigm, thus leading to longer experiments. This can be uncomfortable for the subject, and puts additional demands on the scanner hardware. There is some scope for bringing the single event tasks closer together, but there must be a sufficient interval to allow the BOLD signal to return to baseline. This delay is at least ten seconds in length. The second disadvantage is that the analysis results in many statistical parametric maps, which have to be interpreted as a whole. However the fact that the technique makes few assumptions about the data time course makes it a strong technique, and opens up the possibility of more diverse experimental design, and a move away from the epoch based paradigms.

#### 6.3.5 Analysis of Variance

A second technique which does not require any assumptions about the shape of the activation time course, looks at the changes in variance upon averaging. The technique is based on simple signal averaging theory[16]. Take, for example, the response measured to a repeated signal as shown in Figure 6.12. The time series contains two components, one is a genuine response to the signal, and the other is the random fluctuations due to uncorrelated physiological events and noise in the image. Upon averaging 32 cycles together, the magnitude of the noisy component is reduced but that of the repeated signal is not. The reduction of the noisy component can be measured by calculating the variance of both the unaveraged and averaged data set.

##### Figure 6.12. Signal Averaging. The variance of the noise in the average signal is N times less than it is in the original signal, where N is the number of cycles.

To detect regions of activation, the ratio of the variance of the averaged data set to the variance of the unaveraged data set is calculated for each pixel in the image. For pixels in regions of purely random intensity variations, this ratio will be around 1/*n*, where *n* is the number of cycles averaged together. Pixels in regions of activation, however, will have a significantly higher ratio than this, since the variance of both unaveraged and averaged data sets is dominated by the stimulus locked intensity variations of the BOLD effect, which does not reduce upon averaging.

The technique is more formally explained as an analysis of variance (ANOVA)[17]. If *X* ij refers to the *i*th time point after the stimulus, of the *j*th cycle of an experiment

time t | X 11 , | X 12 , | . | X 1j , | . | X 1n | X 1 |

time 2t | X 21 , | X 22 , | . | X 2j , | . | X 2n | X 2 |

. | . | . | . | . | . | . | . |

time it | X i1 , | X i2 , | . | X ij , | . | X in | X i |

. | . | . | . | . | . | . | . |

time kt | X k1 , | X k2 , | . | X kj , | . | X kn | X k |

X |

with *n* cycles and *k* points per cycle. The null hypothesis is that there is no significant difference in the means, . This can be tested by comparing two estimates of the population variance, s 2 , one based on variations in measurements of the same time point, and one based on the variance between time points.

The variance within measurement of any time point can be calculated by

and so the mean variance within time points is given by

and is based on *k(n-1)* degrees of freedom. The variance of the time point means is given by

then s 2 can be estimated by

which is based on *k*-1 degrees of freedom. Under the null hypothesis, both and independently estimate the population variance s 2 . This means that the ratio

will have an F distribution with *k*-1 and *k*(*n*-1) degrees of freedom. If there is any signal change that is time locked to the stimulus, the value of will be larger than expected under the null hypothesis.

In the analysis of fMRI data, equations 6.15 - 6.19 are used to form an F-statistical parametric map. These equations are implemented using the following short-cut formulas

To assess the validity of this approach on real data, phantom and head images were analysed. The phantom data set consisted of 256 volume images, each of 128 x 128 x 16 matrix size, obtained at a repetition rate of 4 seconds. The head data set consisted of 256 head images, of the same matrix size, with the subject performing no specified task. Both data sets were pre-processed in the same way that a functional imaging data set would be, and the analysis of variance was then carried out assuming 16 points per cycle and 16 cycles. Histograms of *F* for each data set are shown in Figure 6.13, with the appropriate F distribution, shown as a dotted line, given by

where *n* 1 is the number of points per cycle minus one, and *n* 2 is the total number of data points minus the number of points per cycle[18]. All three histograms show a good fit to the F distribution, confirming the validity of applying this technique to fMRI data.

##### Figure 6.13 Plots of calculated F-scores (solid line) and the appropriate F-distribution (dotted line) for (a) simulated data, (b) phantom data, and (c) head data.

The results of analysing the example activation data set using the ANOVA technique are shown in Figure 6.14. As with the serial t-test, such a data set does not best show the potential of this technique. A better example comes from the analysis of a study to investigate short term memory.

##### Figure 6.14 Results of processing the test data set using the analysis of variance technique. Those pixels shaded in red correspond to regions which vary in some way that is time locked to the stimulus.

The stimulus paradigm for this experiment had three stages. First three digits in succession were presented to the subject. Eight seconds later a fourth digit was presented, and the subject was required to respond by pressing a button in their right hand if this final digit was the same as any of the three previously presented, or by pressing the button in their left hand if not[19]. The final stage was a period of rest, to provide a baseline. The whole test was repeated 32 times.

It would be expected that some regions of the brain would only be active during the presentation of the digits, some during the retention period, some only in the recall stage and others throughout the whole memory task. To analyse such data using a correlation technique would mean predicting a whole set of reference waveforms. The ANOVA technique, however, detected the responses of different shapes in one test. Figure 6.15 shows the activation maps obtained from the ANOVA analysis of the short term memory experiment, together with time course plots from several of the areas.

##### Figure 6.15 Analysis of Variance technique applied to data from the short term memory experiment described in the text, together with average cycle plots for several regions of interest. Areas of the brain which act in different ways to the stimulus can be seen in a single activation image.

The final image of an ANOVA analysis, essentially shows all the regions which vary in some way that is synchronous to the presentation of the stimulus. It is equally good at picking up deactivations as activations. This makes these images a good starting point for other forms of analysis, such as principal component analysis or cluster analysis, so as to glean all the information that is available.

#### 6.3.6 Software Implementation

Due to the diversity of tests that can be carried out on one data set, software for implementing the tests described above was written as a set of separate programs.

The program *correlate* constructs a reference waveform, from user specified values, and optionally convolves this with a Poisson function of user specified width. Correlation coefficients are calculated using equation 6.8. If the reference waveform is made to vary between 0 and 1 then a measure of the percentage change upon activation can be obtained by calculating a linear regression of the form

The percentage change can be calculated as (*b/a*) x 100%.

The software also calculates the appropriate image of z-scores using Fishers Z transform and the reduced degrees of freedom as explained in the following sections. The file names for each of these outputs is

cc_<file>.img | correlation coefficients saved as 'shorts' x 10,000 |

pc_<file>.img | percentage change saved as 'shorts' x 1,000 (such that 1% signal change x 1,000) |

cz_<file>.img | z-scores saved as 'floats' (not scaled) |

The serial t-tests are performed by *tmap* and *tmapnc*, the first appropriate for cyclic experiments and the second for non-cyclic experiments. For the non-cyclic version the stimulus times are obtained from a text file, and both output t-score maps saved as 'shorts', scaled by 1000, in a file named tt_*<file>*.img.

ANOVA tests are carried out by the programs *anova*, and *anovanc*, which both output f-scores scaled by 1000, in a file named va_*<file>*.img.

## 2 Answers 2

This is a pretty broad question - I would basically translate this into: what is a GLM, and what is a mixed model. First of all, you write that you want to fit a GLM, but I suspect you mean LM, because the formula

would typically denote an LM. For the GLM, we would have an additional link function.

In the formula above, $Y$ is your response, $X$ are your predictors (design matrix), and $eta$ are the regression coefficients for these predictors (contrasts if categorical).

Your notation for a random-effects model is a bit unorthodox (not sure from where this is taken), but I would suspect

$Y = Xeta + epsilon$ $eta = X'eta' + epsilon'$

means that you want to fit a so-called random slope model, in which the regression coefficients / contrasts can differ for each grouping factor. The assumption of the random slope model is that differences in $eta$ between the groups are drawn from a normal distribution $epsilon'$, which is the between-group-variability. So the final entire vector of predictors $eta$ is composed of the overall $eta'$ and the random effects $epsilon'$.

In general, I'm not sure if this notation is exceedingly useful to understand how a mixed model works - I would suggest to read start with a general textbook or tutorial about mixed models.

- A simple tutorial in R is Bates, D. Mächler, M. Bolker, B. & Walker, S. (2014) Fitting linear mixed-effects models using lme4.
- A more statistical reference is Gelman, A. & Hill, J. (2006) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, in particular ch 11,12
- To get a basic explanation about the computational methods, you could look at Bates, D. M. (2010) lme4: Mixed-effects modeling with R.

Of course there are many other good books, it depends on your field and what level of mathematics you are looking for.

## A hybrid SVM-GLM approach for fMRI data analysis

The hypothesis-driven fMRI data analysis methods, represented by the conventional general linear model (GLM), have a strictly defined statistical framework for assessing regionally specific activations but require prior brain response modeling that is usually hard to be accurate. On the contrary, exploratory methods, like the support vector machine (SVM), are independent of prior hemodynamic response function (HRF), but generally lack a statistical inference framework. To take the advantages of both kinds of methods, this paper presents a composite approach through combining conventional GLM with SVM. This hybrid SVM-GLM concept is to use the power of SVM to obtain a data-derived reference function and enter it into the conventional GLM for statistical inference. The data-derived reference function was extracted from the SVM classifier using a new temporal profile extraction method. In simulations with synthetic fMRI data, SVM-GLM demonstrated a better sensitivity and specificity performance for detecting the synthetic activations, as compared to the conventional GLM. With real fMRI data, SVM-GLM showed better sensitivity than regular GLM for detecting the sensorimotor activations.

### Figures

SVM-based classification and temporal profile…

SVM-based classification and temporal profile extraction. A) 2-dimensional data classification using a linear…

A) Definition of CNR, B) the artificial brain activation time course for simulations…

An SDPtp extracted from a…

An SDPtp extracted from a representative subject’s sensorimotor BOLD fMRI data. A) the…

Averaged (n=8) AUCs of regular…

Averaged (n=8) AUCs of regular GLM and SVM-GLM on the synthetic data generated…

Group level statistical analysis results…

Group level statistical analysis results of the left-hand sensorimotor BOLD fMRI data. The…

T-score histograms of group level…

T-score histograms of group level analysis based on individual GLM and SVM-GLM results…

Group level statistical analysis results…

Group level statistical analysis results of the right-hand sensorimotor ASL perfusion fMRI data.…

## How to create Generalized Liner Model (GLM)

- age: age of the individual. Numeric
- education: Educational level of the individual. Factor.
- marital.status: Marital status of the individual. Factor i.e. Never-married, Married-civ-spouse, .
- gender: Gender of the individual. Factor, i.e. Male or Female
- income: Target variable. Income above or below 50K. Factor i.e. >50K, <=50K

- Step 1: Check continuous variables
- Step 2: Check factor variables
- Step 3: Feature engineering
- Step 4: Summary statistic
- Step 5: Train/test set
- Step 6: Build the model
- Step 7: Assess the performance of the model
- step 8: Improve the model

Your task is to predict which individual will have a revenue higher than 50K.

In this tutorial, each step will be detailed to perform an analysis on a real dataset.

### Step 1) Check continuous variables

In the first step, you can see the distribution of the continuous variables.

- continuous <- select_if(data_adult, is.numeric): Use the function select_if() from the dplyr library to select only the numerical columns
- summary(continuous): Print the summary statistic

From the above table, you can see that the data have totally different scales and hours.per.weeks has large outliers (.i.e. look at the last quartile and maximum value).

- 1: Plot the distribution of hours.per.week
- 2: Standardize the continuous variables

- Plot the distribution

Let's look closer at the distribution of hours.per.week

The variable has lots of outliers and not well-defined distribution. You can partially tackle this problem by deleting the top 0.01 percent of the hours per week.

We compute the top 2 percent percentile

98 percent of the population works under 80 hours per week.

You can drop the observations above this threshold. You use the filter from the dplyr library.

You can standardize each column to improve the performance because your data do not have the same scale. You can use the function mutate_if from the dplyr library. The basic syntax is:

You can standardize the numeric columns as follow:

### Step 2) Check factor variables

- Select the categorical columns
- Store the bar chart of each column in a list
- Print the graphs

We can select the factor columns with the code below:

- data.frame(select_if(data_adult, is.factor)): We store the factor columns in factor in a data frame type. The library ggplot2 requires a data frame object.

The dataset contains 6 categorical variables

The second step is more skilled. You want to plot a bar chart for each column in the data frame factor. It is more convenient to automatize the process, especially in situation there are lots of columns.

- lapply(): Use the function lapply() to pass a function in all the columns of the dataset. You store the output in a list
- function(x): The function will be processed for each x. Here x is the columns
- ggplot(factor, aes(get(x))) + geom_bar()+ theme(axis.text.x = element_text(angle = 90)): Create a bar char chart for each x element. Note, to return x as a column, you need to include it inside the get()

The last step is relatively easy. You want to print the 6 graphs.

Note: Use the next button to navigate to the next graph

### Step 3) Feature engineering

### Recast education

From the graph above, you can see that the variable education has 16 levels. This is substantial, and some levels have a relatively low number of observations. If you want to improve the amount of information you can get from this variable, you can recast it into higher level. Namely, you create larger groups with similar level of education. For instance, low level of education will be converted in dropout. Higher levels of education will be changed to master.

- We use the verb mutate from dplyr library. We change the values of education with the statement ifelse

In the table below, you create a summary statistic to see, on average, how many years of education (z-value) it takes to reach the Bachelor, Master or PhD.

### Recast Marital-status

### Step 4) Summary Statistic

It is time to check some statistics about our target variables. In the graph below, you count the percentage of individuals earning more than 50k given their gender.

Next, check if the origin of the individual affects their earning.

The number of hours work by gender.

The box plot confirms that the distribution of working time fits different groups. In the box plot, both genders do not have homogeneous observations.

You can check the density of the weekly working time by type of education. The distributions have many distinct picks. It can probably be explained by the type of contract in the US.

- ggplot(recast_data, aes( x= hours.per.week)): A density plot only requires one variable
- geom_density(aes(color = education), alpha =0.5): The geometric object to control the density

To confirm your thoughts, you can perform a one-way ANOVA test:

The ANOVA test confirms the difference in average between groups.

### Non-linearity

Before you run the model, you can see if the number of hours worked is related to age.

- ggplot(recast_data, aes(x = age, y = hours.per.week)): Set the aesthetic of the graph
- geom_point(aes(color= income), size =0.5): Construct the dot plot
- stat_smooth(): Add the trend line with the following arguments:
- method='lm': Plot the fitted value if the linear regression
- formula = y

In a nutshell, you can test interaction terms in the model to pick up the non-linearity effect between the weekly working time and other features. It is important to detect under which condition the working time differs.

### Correlation

The next check is to visualize the correlation between the variables. You convert the factor level type to numeric so that you can plot a heat map containing the coefficient of correlation computed with the Spearman method.

- data.frame(lapply(recast_data,as.integer)): Convert data to numeric
- ggcorr() plot the heat map with the following arguments:
- method: Method to compute the correlation
- nbreaks = 6: Number of break
- hjust = 0.8: Control position of the variable name in the plot
- label = TRUE: Add labels in the center of the windows
- label_size = 3: Size labels
- color = "grey50"): Color of the label

### Step 5) Train/test set

Any supervised machine learning task require to split the data between a train set and a test set. You can use the "function" you created in the other supervised learning tutorials to create a train/test set.

### Step 6) Build the model

To see how the algorithm performs, you use the glm() package. The

**Generalized Linear Model**is a collection of models. The basic syntax is:You are ready to estimate the logistic model to split the income level between a set of features.

- AIC (Akaike Information Criteria): This is the equivalent of
**R2**in logistic regression. It measures the fit when a penalty is applied to the number of parameters. Smaller**AIC**values indicate the model is closer to the truth. - Null deviance: Fits the model only with the intercept. The degree of freedom is n-1. We can interpret it as a Chi-square value (fitted value different from the actual value hypothesis testing).
- Residual Deviance: Model with all the variables. It is also interpreted as a Chi-square hypothesis testing.
- Number of Fisher Scoring iterations: Number of iterations before converging.

The output of the glm() function is stored in a list. The code below shows all the items available in the logit variable we constructed to evaluate the logistic regression.

# The list is very long, print only the first three elements

Each value can be extracted with the $ sign follow by the name of the metrics. For instance, you stored the model as logit. To extract the AIC criteria, you use:

### Step 7) Assess the performance of the model

##### Confusion Matrix

The

**confusion matrix**is a better choice to evaluate the classification performance compared with the different metrics you saw before. The general idea is to count the number of times True instances are classified are False.To compute the confusion matrix, you first need to have a set of predictions so that they can be compared to the actual targets.

- predict(logit,data_test, type = 'response'): Compute the prediction on the test set. Set type = 'response' to compute the response probability.
- table(data_test$income, predict > 0.5): Compute the confusion matrix. predict > 0.5 means it returns 1 if the predicted probabilities are above 0.5, else 0.

Each row in a confusion matrix represents an actual target, while each column represents a predicted target. The first row of this matrix considers the income lower than 50k (the False class): 6241 were correctly classified as individuals with income lower than 50k (

**True negative**), while the remaining one was wrongly classified as above 50k (**False positive**). The second row considers the income above 50k, the positive class were 1229 (**True positive**), while the**True negative**was 1074.You can calculate the model

**accuracy**by summing the true positive + true negative over the total observationThe model appears to suffer from one problem, it overestimates the number of false negatives. This is called the

**accuracy test paradox**. We stated that the accuracy is the ratio of correct predictions to the total number of cases. We can have relatively high accuracy but a useless model. It happens when there is a dominant class. If you look back at the confusion matrix, you can see most of the cases are classified as true negative. Imagine now, the model classified all the classes as negative (i.e. lower than 50k). You would have an accuracy of 75 percent (6718/6718+2257). Your model performs better but struggles to distinguish the true positive with the true negative.### Precision vs Recall

**Precision**looks at the accuracy of the positive prediction.**Recall**is the ratio of positive instances that are correctly detected by the classifier- mat[1,1]: Return the first cell of the first column of the data frame, i.e. the true positive
- mat[1,2] Return the first cell of the second column of the data frame, i.e. the false positive

- mat[1,1]: Return the first cell of the first column of the data frame, i.e. the true positive
- mat[2,1] Return the second cell of the first column of the data frame, i.e. the false negative

You can test your functions

When the model says it is an individual above 50k, it is correct in only 54 percent of the case, and can claim individuals above 50k in 72 percent of the case.

You can create the /> score based on the precision and recall. The /> is a harmonic mean of these two metrics, meaning it gives more weight to the lower values.

### Precision vs Recall tradeoff

It is impossible to have both a high precision and high recall.

- Imagine, you need to predict if a patient has a disease. You want to be as precise as possible.
- If you need to detect potential fraudulent people in the street through facial recognition, it would be better to catch many people labeled as fraudulent even though the precision is low. The police will be able to release the non-fraudulent individual.

### The ROC curve

The

**Receiver Operating Characteristic**curve is another common tool used with binary classification. It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve shows the true positive rate (i.e., recall) against the false positive rate. The false positive rate is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate. The true negative rate is also called**specificity**. Hence the ROC curve plots**sensitivity**(recall) versus 1-specificityTo plot the ROC curve, we need to install a library called RORC. We can find in the conda library. You can type the code:

conda install -c r r-rocr --yes

We can plot the ROC with the prediction() and performance() functions.

- prediction(predict, data_test$income): The ROCR library needs to create a prediction object to transform the input data
- performance(ROCRpred, 'tpr','fpr'): Return the two combinations to produce in the graph. Here, tpr and fpr are constructed. Tot plot precision and recall together, use "prec", "rec".

**Step 8)**Improve the modelYou need to use the score test to compare both model

The score is slightly higher than the previous one. You can keep working on the data a try to beat the score.

## Support vector machine learning-based fMRI data group analysis ☆

To explore the multivariate nature of fMRI data and to consider the inter-subject brain response discrepancies, a multivariate and brain response model-free method is fundamentally required. Two such methods are presented in this paper by integrating a machine learning algorithm, the support vector machine (SVM), and the random effect model. Without any brain response modeling, SVM was used to extract a whole brain spatial discriminance map (SDM), representing the brain response difference between the contrasted experimental conditions. Population inference was then obtained through the random effect analysis (RFX) or permutation testing (PMU) on the individual subjects' SDMs. Applied to arterial spin labeling (ASL) perfusion fMRI data, SDM RFX yielded lower false-positive rates in the null hypothesis test and higher detection sensitivity for synthetic activations with varying cluster size and activation strengths, compared to the univariate general linear model (GLM)-based RFX. For a sensory–motor ASL fMRI study, both SDM RFX and SDM PMU yielded similar activation patterns to GLM RFX and GLM PMU, respectively, but with higher

*t*values and cluster extensions at the same significance level. Capitalizing on the absence of temporal noise correlation in ASL data, this study also incorporated PMU in the individual-level GLM and SVM analyses accompanied by group-level analysis through RFX or group-level PMU. Providing inferences on the probability of being activated or deactivated at each voxel, these individual-level PMU-based group analysis methods can be used to threshold the analysis results of GLM RFX, SDM RFX or SDM PMU.

## Independent vector analysis (IVA): multivariate approach for fMRI group study

Independent component analysis (ICA) of fMRI data generates session/individual specific brain activation maps without a priori assumptions regarding the timing or pattern of the blood-oxygenation-level-dependent (BOLD) signal responses. However, because of a random permutation among output components, ICA does not offer a straightforward solution for the inference of group-level activation. In this study, we present an independent vector analysis (IVA) method to address the permutation problem during fMRI group data analysis. In comparison to ICA, IVA offers an analysis of additional dependent components, which were assigned for use in the automated grouping of dependent activation patterns across subjects. Upon testing using simulated trial-based fMRI data, our proposed method was applied to real fMRI data employing both a single-trial task-paradigm (right hand motor clenching and internal speech generation tasks) and a three-trial task-paradigm (right hand motor imagery task). A generalized linear model (GLM) and the group ICA of the fMRI toolbox (GIFT) were also applied to the same data set for comparison to IVA. Compared to GLM, IVA successfully captured activation patterns even when the functional areas showed variable hemodynamic responses that deviated from a hypothesized response. We also showed that IVA effectively inferred group-activation patterns of unknown origins without the requirement for a pre-processing stage (such as data concatenation in ICA-based GIFT). IVA can be used as a potential alternative or an adjunct to current ICA-based fMRI group processing methods.

## 1.4 Variable selection

Variable selection for a GLM model is similar to the process for an OLS model. Nested model tests for significance of a coefficient are preferred to Wald test of coefficients. This is due to GLM coefficients standard errors being sensitive to even small deviations from the model assumptions. It is also more accurate to obtain p-values for the GLM coefficients from nested model tests.

The likelihood ratio test (LRT) is typically used to test nested models. For quasi family models an F-test is used for nested model tests (or when the fit is overdispersed or underdispersed). This use of the F statistic is appropriate if the group sizes are approximately equal.

Which variable to select for a model may depend on the family that is being used in the model. In these cases variable selection is connected with family selection. Variable selection criteria such as AIC and BIC are generally not applicable for selecting between families.

## Results

### Face decoding and reconstruction

We used the pre-trained VAE–GAN model described in Fig. 1 (with “frozen” parameters) to train a brain-decoding system. During training (Fig. 2a), the system learned the correspondence between brain activity patterns in response to numerous face images and the corresponding 1024-D latent representation of the same faces within the VAE network. More than 8000 distinct examples were used on average (range across subjects: 7664–8626), which involved 12 h of scanning over eight separate sessions for each subject. The learning procedure assumed that each brain voxel’s activation could be described as a weighted sum of the 1024 latent parameters, and we simply estimated the corresponding weights via linear regression (GLM function in SPM see Methods). After training (Fig. 2b), we inverted the linear system, such that the decoder was given the brain pattern of the subject viewing a specific, novel face image as input (a face that was not included in the training set), and its output was an estimate of the 1024-dimensional latent feature vector for that face. The image of the face was then generated (or “reconstructed”) through the generative (VAE–GAN) neural network.

Brain decoding of face images based on VAE–GAN latent representations.

**a**Training phase. Each subject saw8000 faces (one presentation each) in a rapid event-related design. The same face images were also run through the “Encoder” network (as described in Fig. 1) or a PCA decomposition, to obtain a 1024-dimensional latent face description. The “brain decoder” was a simple linear regression, trained to associate the 1024-dimensional latent vector with the corresponding brain response pattern. This linear regression, with 1024 parametric regressors for the BOLD signal (and an additional constant “bias” term), produced a weight matrix

*W*(1025 by*n*_{voxels}dimensions) optimized to predict brain patterns in response to face stimuli.**b**Testing phase. We also presented 20 distinct “test” faces (not part of the training set at least 45 randomly interleaved presentations each) to the subjects. The resulting brain activity patterns were simply multiplied by the transposed weight matrix*W*T (*n*_{voxels}by 1025 dimensions) and its inverse covariance matrix to produce a linear estimate of the latent face dimensions. The Generator network (Fig. 1a) or an inverse PCA transform was then applied to translate the predicted latent vector into a reconstructed face imageWe contrasted the results obtained from this deep neural network model with those produced by another, simpler model of face image decomposition: principal components analysis (PCA, retaining only the first 1024 principal components from the training data set see Supplementary Fig. 1). The PCA model also describes every face by a vector in a 1024-dimensional latent space, and can also be used to reconstruct faces based on an estimate of this 1024-D feature vector, as demonstrated in recent studies 13,14 .

For both the deep neural network and PCA-based models, we defined a subset of the gray matter voxels as our “region-of-interest” (ROI). Indeed, many parts of the brain perform computations that are not related to face processing or recognition entering such regions in our analysis would adversely affect signal-to-noise. Our selection criterion combined two factors: (i) voxels were expected to respond to face stimuli (as determined by a

*t*test between face and baseline conditions, i.e., fixation of an empty screen), and (ii) the explained variance of the voxels’ BOLD response was expected to improve when the 1024 latent face features were entered as regressors in the linear model (compared with a baseline model with only a binary face regressor: face present/absent). The distribution of voxels along these two dimensions, and the corresponding selection criterion, are illustrated for one representative subject in Supplementary Fig. 2. Across the four subjects, the number of resulting voxels in the selection was100,000 (mean: 106,612 range: 74,388-162,388). The selected voxels are depicted in Fig. 3 they include occipital, temporal, parietal, and frontal regions. A separate selection was made based on the PCA face parameters, and used for the PCA-based “brain decoder” (mean number of selected voxels: 106,685 range: 74,073–164,524) the selected regions were virtually identical for the two models. It is important to highlight that the above voxel selection criteria were applied based on BOLD responses to the training face images only, but not to the 20 test images therefore, the decoding analysis does not suffer from “circular reasoning” issues caused by this voxel selection 16 .

Voxels selected for brain decoding. Voxels were selected based on a combination of their visual responsiveness and their GLM goodness-of-fit during the brain decoder training stage (Fig. 2a). The color code (red to yellow) indicates the number of subjects (1–4) for whom each particular voxel was selected. The colored lines indicate the boundaries of standard cortical regions 43

Examples of the reconstructed face images from the test image set of each of the four subjects are shown in Fig. 4a. Although both the VAE–GAN and the PCA models could reconstruct an acceptable likeness of the original faces, the images reconstructed from the deep generative neural network (VAE–GAN) appear more realistic, and closer to the original image. We quantified the performance of our brain-decoding system by correlating the brain-estimated latent vectors of the 20 test faces with the 20 actual vectors, and used the pairwise correlation values to measure the percentage of correct classification. For each subject, for each of the 20 test faces, we compared the decoded 1024-D vector with the ground-truth vector from the actual test image, and to that of another test image (distractor): brain decoding was “correct” if the correlation with the actual target vector was higher than with the distractor vector. This was repeated for all (20 × 19) pairs of test images, and the average performance compared with chance (50%) with a non-parametric Monte–Carlo test (see Methods: Statistics). Reconstructions from the GAN model achieved 95.5% classification (range: 91.3–98.7%, all

*p*< 10 −6 ), whereas the PCA model only reached 87.5% (range 76.6–92.4%, still highly above chance, all*p*< 10 −4 , but much below the GAN model, Friedman non-parametric test,*χ*2 (1) = 4,*p*< 0.05). We also tested the ability of the brain decoder to pick the exact correct face among the 20 test faces: this “full recognition” task was deemed correct if and only if the reconstructed latent vector was more correlated to the true target vector than to all of the 19 distractor vectors. This is a more-stringent test of face recognition, with chance level at 5%: the VAE–GAN model achieved 65% correct (range: 40–75%, binomial test, all*p*< 10 −6 ), whereas the PCA model resulted in 41.25% correct recognition only (range 25–50%, all*p*< 10 −3 ) again, the VAE–GAN model performance was significantly higher than the PCA (*χ*2 (1) = 4,*p*< 0.05).Face reconstruction.

**a**Examples of reconstructed face images. For each of our four subjects (S1–4), the first column displays four example faces (two male+two female, chosen among the 20 test faces) actually shown to the subject during the scanning sessions. The next two columns are the face reconstructions based on the corresponding fMRI activation patterns for the brain-decoding system trained using the VAE–GAN latent space (middle column) or PCA decomposition (right column).**b**Pairwise recognition. The quality of brain decoding was quantified with a pairwise pattern classification (operating on the latent vector estimates), and the average performance compared with chance (50%). Brain decoding from the VAE–GAN model achieved 95.5% correct performance on average (*p*< 10 −6 ), the PCA model only 87.5% (*p*< 10 −4 ) the difference between the two models was significant (*χ*2 (1) = 4,*p*< 0.05).**c**Full recognition. A more-stringent performance criterion was also applied, whereby decoding was considered correct if and only if the procedure identified the exact target face among all 20 test faces (chance = 5%). Here again, performance of the VAE–GAN model (65%) was far above chance (*p*< 10 −6 ), and outperformed (*χ*2 (1) = 4,*p*< 0.05) the PCA model (41.25%*p*< 10 −3 )As linear regression models typically require many more data samples than their input dimensions, we had initially decided to train the brain-decoding system with

8000 faces per subject (compared with the 1024 latent dimensions). In order to establish whether smaller training sets might be sufficient, we repeated the linear regression step (computation of the

*W*matrix in Fig. 2a) using only one half, one quarter or one eighth of the training data set (see Supplementary Fig. 3). For both pairwise and full recognition measures, above-chance performance could already be obtained with1000 training faces however, decoding performance kept growing as the training set size was increased, and was highest for

8000 training faces. Importantly, the PCA model remained below the VAE–GAN model for all training set sizes.

These comparisons indicate that it is easier and more efficient to create a linear mapping from human brain activations to the VAE–GAN latent space than to the PCA space. This is compatible with our hypothesis that the deep generative neural network is more similar to the space of human face representations. In addition, this classification accuracy was measured here based on the distance (or vector correlation) in the latent space of each model it is even possible that the difference between the two models could be exacerbated if their accuracy was evaluated with a common metric, such as the perceptual quality of reconstructed images. To support this idea, we asked naive human observers to compare the quality of faces reconstructed by the two models: each original test image from each of the four subjects was shown together with the corresponding VAE–GAN and PCA reconstructions the observer decided which reconstruction was perceptually more similar to the original. Each pair was rated 15 times overall, by at least 10 distinct participants, with at least five participants seeing the two response options in either order, VAE–GAN first or PCA first. The VAE–GAN reconstruction was chosen in 76.1% of trials, whereas the PCA reconstruction only in 23.9% of trials. That is, observers were three times more likely to prefer the quality of VAE–GAN reconstructed faces than PCA reconstructions, a difference that was highly unlikely to occur by chance (binomial test, 1200 observations,

*p*< 10 −10 ).### Contributions from distinct brain regions

To determine which brain regions most contributed to the face reconstruction abilities of the two brain-decoding models, for each subject we divided our voxel selection into three equally sized subsets, as illustrated in Fig. 5a. The brain decoding and face reconstruction procedure was then applied separately for these three subsets. The pairwise recognition results revealed that occipital voxels, and to a lesser extent temporal voxels, were providing most of the information necessary for brain decoding (Fig. 5b). Occipital voxels' decoding performance was much above chance (50%) for both models (VAE–GAN: 91.8%, all individual

*p*< 10 −6 PCA: 87.2%, all*p*< 10 −4 ), and similarly for temporal voxels (VAE–GAN: 78.8%, all*p*< 10 −3 PCA: 73.6%, all*p*< 0.01). On the other hand, although frontoparietal voxels satisfied our selection criteria (see Fig. 3), they did not carry sufficiently reliable information on their own to allow for accurate classification (VAE–GAN: 60.1%, one subject with*p*< 10 −6 , all other*p*> 0.2 PCA: 56.4%, one subject with*p*< 10 −6 , all other*p*> 0.05 see, however, Lee et al. 14 ). The pattern of results was identical for both the VAE–GAN and the PCA-based decoding models: a non-parametric Friedman test suggested that performance differed across the three subsets (for VAE–GAN:*χ*2 (2) = 8,*p*< 0.02 for PCA:*χ*2 (2) = 6.5,*p*< 0.04), with post hoc tests revealing that occipital voxels performed significantly better than frontoparietal ones, with temporal voxels in between (not significantly different from either of the other two). Across all voxel selections, PCA always produced lower accuracies than VAE–GAN—though this difference did not reach statistical significance given our limited subject number (across all three voxel selections,*χ*2 (1) ≥ 3,*p*> 0.08).Contributions from distinct brain regions.

**a**Voxel segmentation procedure. To investigate the brain regions that most strongly supported our brain-decoding performance, while keeping the different subsets comparable, we linearly separated our voxel selection into three equally sized subsets. First, the 1/3 of most posterior voxels for each subject were labeled as “occipital”. Among the remaining voxels, the more rostral half (1/3 of the initial number) was labeled as “temporal”, and the remaining caudal half as “frontoparietal”. This three-way segmentation, different for each subject, was chosen because the performance of our brain-decoding procedure is highly sensitive to the number of included voxels.**b**Pairwise recognition performance for the different regions of interest. The full selection refers to the set of voxels depicted in Fig. 3 it is the same data as in Fig. 4b, averaged over subjects (error bars reflect standard error of the mean). Circles represent individual subjects’ performance. The dotted line is the*p*< 0.05 significance threshold for individual subjects’ performance. Among the three subsets, and for both the VAE–GAN and PCA models, performance is maximal in occipital voxels, followed by temporal voxels. Frontoparietal voxels by themselves do not support above-chance performance (except for one of the four subjects). In all cases, the VAE–GAN model performance remains higher than the PCA modelTo further distinguish the relative contributions of the three brain regions to the brain-decoding performance, we also employed a variance partitioning approach (Supplementary Fig. 4). Compatible with the results already described in Fig. 5b, we found that latent vector predictions derived from occipital voxels accounted for the largest portion of the variance of the corresponding ground-truth latent vectors, followed by temporal voxels, and finally frontoparietal voxels. Each of the three areas also had a unique, independent contribution to the explained variance, which was sizably larger for the VAE–GAN than the PCA model. That is, even though occipital voxels provided the most accurate reconstructions, temporal voxels did not merely convey redundant information.

### Possible applications: gender decoding as an example

The learned mapping between the brain-activation patterns and the deep generative neural network latent space (i.e., the matrix

*W*in Fig. 2a) can serve as a powerful tool to probe the human brain representation of faces, without necessarily having to perform costly additional experiments. A straightforward application, for example, could be the visualization of the facial feature selectivity of any voxel or ROI in the brain. The voxel or ROI defines a subset of columns in the*W*matrix (Fig. 2), each column storing a latent vector that represents the voxel’s facial selectivity. By simply running this latent vector (or its average over the ROI) into the face Generator network, the voxel or ROI selectivity can be revealed as an actual face image.Another extension would be to explore the brain representation of behaviorally important facial features, such as gender, race, emotion or age. Any such face property can be expressed as a latent vector, which can easily be computed based on a number of labeled face examples (by subtracting the average latent vector for faces without the attribute label from the average latent vector for faces with the label see Fig. 1b for examples of latent vectors computed with faces having a “smile” label, or a “male” label). The publicly available celebrity face data set (CelebA 17 ) used in our experiments is already associated with 40 such labels describing gender, expressions, skin or hair color, and numerous other properties of each face. Note that these 40 binary labels (feature present/absent) were collected via a manual annotation procedure for each face stimulus in the face data set, and were chosen to be representative of the variability in the data set. Given the latent vector describing such a facial property, we can use the brain-decoding model to find out which brain voxels are most sensitive to the associated face property. This procedure is illustrated in Supplementary Fig. 5 for the example of the “gender” attribute (“male” label). The voxels most sensitive to this facial property are recovered by computing the column-wise correlation of the matrix

*W*with the “male” latent vector: gender-selective voxels must have strongly positive or strongly negative correlation values (depending on their preference towards male or female faces). The voxels with largest (absolute-value) correlations are found in occipital and temporal regions, notably in both early visual areas and the fusiform cortex (Supplementary Fig. 5), consistent with a previous report of distributed representation of gender information 6 .Finally, another way to investigate the brain representation of a specific facial attribute is to create a simple classifier to label the brain-decoded latent vectors according to this face property. This is illustrated in Fig. 6, again for the example of the “gender” face attribute. Each brain-decoded latent vector is projected onto the “gender” axis of the latent space (Fig. 6a), and the sign of the projection determines the classification output (“male” for positive, “female” for negative signs). This rudimentary classifier provides sufficient information to classify face gender with 70% accuracy (binomial test,

*p*= 0.0001 Fig. 6b). A non-parametric Friedman test indicates that gender decoding performance differs across the three voxel subsets (*χ*2 (2) = 7.6,*p*< 0.03), and a post hoc test reveals that occipital voxels perform significantly better than frontoparietal ones, with temporal voxels in-between (not significantly different from either of the other two). Previous attempts at classifying face gender using multi-voxel pattern analysis had achieved limited success, with maximum classification accuracy below 60% 6,8 . Our simple linear brain decoder (Fig. 6a) already improves on these previous methods, while still leaving room for future enhancements, e.g., using more powerful classification techniques (such as SVM) on the brain-decoded latent vectors.Gender decoding.

**a**Basic linear classifier. A simple gender classifier was implemented as a proof-of-principle. The “gender” axis was computed by subtracting the average latent description of 10,000 female faces from the average latent description of 10,000 male faces. Each latent vector was simply projected onto this “gender” axis, and positive projections were classified as male, negative projections as female.**b**Decoding accuracy. When applied to the true latent vectors for each subject’s test faces, this basic classifier performed at 85% correct (range: 80–90%). This is the classifier’s ceiling performance, represented as a horizontal gray region (mean ± sem across subjects). When operating on the latent vectors estimated via our brain-decoding procedure, the same gender classifier performed at 70% correct, well above chance (binomial test,*p*= 0.0001 bars represent group-average accuracy ± sem across subjects, circles represent individual subjects’ performance). Gender classification was also accurate when restricting the analysis to occipital voxels (71.25%,*p*= 0.00005) or temporal voxels (66.25%,*p*< 0.001), but not frontoparietal voxels (51.25%,*p*= 0.37). The star symbols indicate group-level significance: ****p*< 0.001, ***p*< 0.01. The dotted line is the*p*< 0.05 significance threshold for individual subjects’ performance### Imagery decoding

To further demonstrate the versatility of our brain-decoding method, we next applied it to another notoriously difficult problem: retrieving information about stimuli that are not directly experienced by the subject, but only imagined in their “mind’s eye”. Previous studies have shown that this classification problem can be solved when the different classes of stimuli to be imagined are visually distinctive 18 , such as images from different categories 19,20,21,22,23,24 . However, the ability to distinguish between highly visually similar objects—such as different faces—during imagery, as far as we know, has not been reported before.

Prior to the experiment, each subject chose one face among a set of 20 possible images (different from both their training and test image sets). During the experiment, they were instructed to imagine this specific face, whenever a large gray square occurred in the middle of the screen (12s presentation). These imagery trials were repeated 52 times on average (range across subjects: 51–55) during the fMRI-scanning sessions, interleaved with normal stimulus presentations. The average BOLD response during imagery was then used to estimate a latent face vector (using the brain decoder illustrated in Fig. 2b), and this vector was compared with the 20 possible latent vectors in a pairwise manner, as described previously for test images (Figs. 4b, 5b). As illustrated in Fig. 7 (see also Supplementary Fig. 6), the pairwise decoding performance was not different from chance (50%) in each of our predefined regions of interest (full selection

*p*= 0.53, occipital*p*= 0.30 or frontoparietal regions*p*= 0.43), with the sole exception of the temporal voxel selection, which produced 84.2% correct decoding (*p*= 0.012). A non-parametric Friedman test indicated that imagery decoding performance differed across the three subsets (*χ*2 (2) = 6.5,*p*< 0.04), and a post hoc test revealed that temporal voxels performed significantly better than frontoparietal ones, with occipital voxels in-between (not significantly different from either of the other two). Altogether, temporal regions, but not occipital or frontoparietal ones, can support mental imagery reconstruction. This performance could reflect the strong involvement of temporal brain regions in high-level face processing 25,26,27 , as well as the primarily top–down nature of mental imagery 28 . In any case, the ability to classify imagined faces from brain response patterns highlights again the flexibility and potential of our approach.Imagery decoding. The fMRI BOLD response pattern recorded during mental imagery of a specific face (not visible on the screen) was passed through our brain-decoding system. The resulting estimated latent vector was compared with the true vector and 19 distractor vectors, in a pairwise manner. Only the temporal voxel selection supported above-chance imagery decoding, with 84.2% correct performance (

*p*= 0.012). Neither occipital, nor frontoparietal regions, nor the full voxel selection performed above chance (all*p*> 0.30). Bars represent group-average accuracy ( ± sem across subjects), circles represent individual subjects’ performance. The star symbols indicate group-level significance: * for*p*< 0.05

## Access to Document

- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS

In: NeuroImage , Vol. 40, No. 1, 01.03.2008, p. 86-109.

Research output : Contribution to journal › Article › peer-review

T1 - Independent vector analysis (IVA)

T2 - Multivariate approach for fMRI group study

N1 - Funding Information: This work was partially supported in part by grants from NIH R01-NS048242 to Yoo, SS and NIH U41RR019703 to Jolesz FA.

N2 - Independent component analysis (ICA) of fMRI data generates session/individual specific brain activation maps without a priori assumptions regarding the timing or pattern of the blood-oxygenation-level-dependent (BOLD) signal responses. However, because of a random permutation among output components, ICA does not offer a straightforward solution for the inference of group-level activation. In this study, we present an independent vector analysis (IVA) method to address the permutation problem during fMRI group data analysis. In comparison to ICA, IVA offers an analysis of additional dependent components, which were assigned for use in the automated grouping of dependent activation patterns across subjects. Upon testing using simulated trial-based fMRI data, our proposed method was applied to real fMRI data employing both a single-trial task-paradigm (right hand motor clenching and internal speech generation tasks) and a three-trial task-paradigm (right hand motor imagery task). A generalized linear model (GLM) and the group ICA of the fMRI toolbox (GIFT) were also applied to the same data set for comparison to IVA. Compared to GLM, IVA successfully captured activation patterns even when the functional areas showed variable hemodynamic responses that deviated from a hypothesized response. We also showed that IVA effectively inferred group-activation patterns of unknown origins without the requirement for a pre-processing stage (such as data concatenation in ICA-based GIFT). IVA can be used as a potential alternative or an adjunct to current ICA-based fMRI group processing methods.

AB - Independent component analysis (ICA) of fMRI data generates session/individual specific brain activation maps without a priori assumptions regarding the timing or pattern of the blood-oxygenation-level-dependent (BOLD) signal responses. However, because of a random permutation among output components, ICA does not offer a straightforward solution for the inference of group-level activation. In this study, we present an independent vector analysis (IVA) method to address the permutation problem during fMRI group data analysis. In comparison to ICA, IVA offers an analysis of additional dependent components, which were assigned for use in the automated grouping of dependent activation patterns across subjects. Upon testing using simulated trial-based fMRI data, our proposed method was applied to real fMRI data employing both a single-trial task-paradigm (right hand motor clenching and internal speech generation tasks) and a three-trial task-paradigm (right hand motor imagery task). A generalized linear model (GLM) and the group ICA of the fMRI toolbox (GIFT) were also applied to the same data set for comparison to IVA. Compared to GLM, IVA successfully captured activation patterns even when the functional areas showed variable hemodynamic responses that deviated from a hypothesized response. We also showed that IVA effectively inferred group-activation patterns of unknown origins without the requirement for a pre-processing stage (such as data concatenation in ICA-based GIFT). IVA can be used as a potential alternative or an adjunct to current ICA-based fMRI group processing methods.

## Watch the video: Μοναδιαία διανύσματα (November 2022).