OTTERS: A powerful TWAS framework leveraging summary-level reference data

Most existing TWAS tools require individual-level eQTL reference data and thus are not applicable to summary-level reference eQTL datasets. The development of TWAS methods that can harness summary-level reference data is valuable to enable TWAS in broader settings and enhance power due to increased reference sample size. Thus, we develop a TWAS framework called OTTERS (Omnibus Transcriptome Test using Expression Reference Summary data) that adapts multiple polygenic risk score (PRS) methods to estimate eQTL weights from summary-level eQTL reference data and conducts an omnibus TWAS. We show that OTTERS is a practical and powerful TWAS tool by both simulations and application studies.

Stage I that uses the GReX regression models to estimate effect sizes of SNP predictors that, in the broad sense, are expression quantitative trait loci (eQTLs), Stage II of TWAS proceeds by using these trained eQTL effect sizes to impute GReX within an independent GWAS of a complex human disease or trait.One can then test for association between the imputed GReX and phenotype, which is equivalent to a gene-based association test taking these eQTL effect sizes as corresponding test SNP weights [19][20][21] .
For Stage I of TWAS, a variety of training tools exist for fitting GReX regression models using reference expression and genetic data, including PrediXcan 19 , FUSION 20 , and TIGAR 22 .
While these methods all employ different techniques for model fitting, they all require individuallevel reference expression and genetic data to estimate eQTL effect sizes for TWAS.Therefore, these methods cannot be applied to emerging reference summary-level eQTL results such as those generated by the eQTLGen 23 and CommonMind 24 consortia, which provide eQTL effect sizes and p-values relating individual SNPs to gene expression.The development of TWAS methods that can utilize such summary-level reference data is valuable to permit applicability of the technique to broader analysis settings.Moreover, as TWAS power increases with increasing reference sample size 25 , TWAS using summary-level reference datasets can lead to enhanced performance compared to using individual-level reference datasets since the sample sizes of the former often are considerably larger than the latter.For example, the sample size of the summarybased eQTLGen reference sample is 31,684 for blood, whereas the sample size of the individuallevel GTEx V6 reference is only 338 for the same tissue.Consequently, TWAS analysis leveraging the summary-based eQTLGen dataset as reference likely can provide novel insights into genetic regulation of complex human traits.
In this work, we propose a framework that can use summary-level reference data to train GReX regression models required for Stage I of TWAS analysis.Our method is motivated by a variety of published polygenic risk score (PRS) methods [26][27][28][29][30][31] that can predict phenotype in a test dataset using summary-level SNP effect-size estimates and p-values based on single SNP tests from an independent reference GWAS.We can adapt these PRS methods for TWAS since eQTL effect sizes are essentially SNP effect sizes resulting from a reference "GWAS" of gene expression.Thus, our predicted GReX in Stage II of TWAS is analogous to the PRS constructed based on training GWAS summary statistics of single SNP-trait association.Here, we adapt four representative summary-data based PRS methods --P-value Thresholding with linkage disequilibrium (LD) clumping (P+T) 26 , frequentist LASSO 32 regression based method lassosum 27 , nonparametric Bayesian Dirichlet Process Regression (DPR) model 33 based method SDPR 29 , and Bayesian multivariable regression model based method with continuous shrinkage (CS) priors PRS-CS 28 for TWAS analysis.We apply each of these PRS methods to first train eQTL effect sizes based on a multivariable regression model from summary-level reference eQTL data (Stage I), and subsequently use these eQTL effect sizes (i.e., eQTL weights) to impute GReX and then test GReX-trait association in an independent test GWAS (Stage II).
As we will show, the PRS method with optimal performance for TWAS depends on the underlying genetic architecture for gene expression.Since the genetic architecture of expression is unknown apriori, we maximize the performance of TWAS over different possible architectures by proposing a novel TWAS framework called OTTERS (Omnibus Transcriptome Test using Expression Reference Summary data).OTTERS first constructs individual TWAS tests and pvalues using eQTL weights trained by each of the PRS techniques outlined above, and then calculates an omnibus test p-value using the aggregated Cauchy association test 34 (ACAT-O) with all individual TWAS p-values (Figure 1).OTTERS is applicable to both summary-level and individual-level test GWAS data within Stage II TWAS analysis.
In subsequent sections, we first describe how to use the PRS methods on summary-level reference eQTL data in Stage I TWAS, and then describe how we can use the resulting eQTL weights to perform Stage II TWAS using OTTERS.We then evaluate the performance of individual PRS methods and OTTERS using simulated expression and real genetic data based on patterns observed in real datasets.Interestingly, when we assume individual-level reference data are available, we observe that OTTERS outperforms the popular FUSION 20 approach across all simulation settings considered.Many of the individual PRS methods also outperform FUSION in these settings.We then apply OTTERS to blood eQTL summary-level data (n=31,684) from the eQTLGen consortium 23 and GWAS summary data of cardiovascular disease from the UK Biobank (UKBB) 35 .By comparing OTTERS results to those of FUSION 20

using individual-level
GTEx reference data of whole blood tissue, we demonstrate that OTTERS using large summarylevel reference datasets and multiple gene expression imputation models can successfully reveal potential risk genes missed by FUSION based on smaller individual-level reference datasets and only one model.Finally, we conclude with a discussion.

Method Overview
For the standard two-stage TWAS approach, Stage I estimates a GReX imputation model using individual-level expression and genotype data available from a reference dataset, and then Stage II uses the eQTL effect sizes from Stage I to impute gene expression (GReX) in an independent GWAS and test for association between GReX and phenotype.GReX for test samples can be imputed from individual-level genotype data and eQTL effect size estimates.
When individual-level GWAS data are not available, one can instead use summary-level GWAS data for TWAS by applying the TWAS Z-score statistics proposed by FUSION 20 and S-PrediXcan 36 (see details in Methods).
Since eQTL summary data are analogous to GWAS summary data where gene expression represents the phenotype, we can follow the idea from PRS methods to estimate the eQTL effect sizes based on a multivariable regression model using only marginal least squared effect estimates and p-values (based on a single variant test) from the eQTL summary data as well as a reference LD panel from samples of the same ancestry [26][27][28][29] .Although all PRS methods are applicable to TWAS Stage I, we only consider four representative methods --P+T 26 , Frequentist lassosum 27 , Nonparametric Bayesian SDPR 29 , Bayesian PRS-CS 28 (see details in Methods).
As shown in Figure 1, OTTERS first trains GReX imputation models per gene g using P+T, lassosum, SDPR, and PRS-CS methods that each infers cis-eQTL weights using cis-eQTL summary data and an external LD reference panel of the same ancestry (Stage I).Once we derive cis-eQTL weights for each training method, we can impute the respective GReX using that method and perform the respective gene-based association analysis in the test GWAS dataset.
We thus derive a set of TWAS p-values for gene g, one per training method.We then use these TWAS p-values to create an omnibus test using the ACAT-O 34 approach that employs a Cauchy distribution for inference (see details in Supplemental Methods).We refer to the p-value derived from ACAT-O test as the OTTERS p-value.The ACAT-O 34 approach has been widely used in hypothesis testing to combine multiple testing methods for the same hypothesis [37][38][39] , which has been shown as an effective approach to leverage different test methods to increase the power while still managing to control for type I error.Adding TWAS p-values based on additional PRS methods to the ACAT-O test can possibly improve the power further at the cost of additional computation.

Simulation Study
We used real genotype data from 1894 whole genome sequencing (WGS) samples from the Religious Orders Study and Rush Memory and Aging Project (ROS/MAP) cohort 40,41 and Mount Sinai Brain Bank (MSBB) study 42 for simulation.We divided 14,772 genes into five groups according to gene length, and randomly selected 100 genes from each group (500 genes in total).We randomly split samples into 568 training (30%) and 1326 testing samples (70%) to mimic a relatively small sample size in the real reference panel for training gene expression imputation models.. From the real genotype data, we simulated 6 scenarios with 2 different proportions of causal cis-eQTL,   = (0.001, 0.01), as well as 3 different proportions of gene expression variance explained by causal eQTL, ℎ  2 = (0.01, 0.05, 0.1).
We generated gene expression of gene  (  ) using the multivariable regression model To evaluate the type I error of the individual PRS methods along with OTTERS, we picked one simulated replicate per gene from the scenario with ℎ  2 = 0.1 and   = 0.001 , simulated 2 × 10 3 phenotypes from (0,1) , and permuted the eQTL weights for TWAS to perform a total of 10 6 null simulations.OTTERS was shown well calibrated in the tails of the distribution as shown by quantile-quantile (Q-Q) plots of TWAS p-values in Figure S1.We also observed that OTTERS had well-controlled type I error for stringent significance levels between 10 −4 and 2.5 × 10 −6 (Table S1), which are typically utilized in TWAS.For more modest significance thresholds ( = 10 −2 ) , we noted that OTTERS had a slightly inflated type-I error rate.
This modest inflation is consistent with the findings of the original ACAT-O work, which showed that the Cauchy-distribution-based approximation that ACAT-O employs may not be accurate for larger p-values when correlation among tests is strong 34 .This suggests that modest OTTERS pvalues may be interpreted with caution.
We also compared the performance of our individual PRS training methods to those of FUSION assuming individual-level reference data were available for the latter method to train GReX models.As shown in Figure 2A, we interestingly observed that our training methods yielded similar or improved test  2 compared to FUSION in this situation, with SDPR and PRS-CS outperforming FUSION across all simulation settings.Comparing TWAS power, we found that OTTERS outperformed FUSION by a considerable margin in our simulations (Figure 2B).These simulation results suggest that, while we developed OTTERS based on PRS training methods to handle summary-level reference data, OTTERS can still improve TWAS power when individuallevel reference data are available.This is likely because OTTERS accounts for multiple possible models of genetic architectures of gene expression assumed by the different PRS training methods.

GReX Imputation Accuracy in GTEx V8 Blood Samples
To evaluate the imputation accuracy of P+T (0.001), P+T (0.05), lassosum, SDPR, and PRS-CS methods in real data, we applied these training methods to summary-level eQTL reference data from the eQTLGen consortium 23 with n=31,684 blood samples, to train GReX imputation models for 16,699 genes.For test data, we downloaded the transcriptomic data of 315 blood tissue samples that are in GTEx V8 but were not part of GTEx V6 (as GTEx V6 samples contributed to the reference eQTLGen consortium summary data).For these 315 samples, we compared imputed GReX to observed expression levels.We considered trained imputation models with test  2 > 0.01 as "valid" models, as suggested by previous TWAS methods 20,43 .We also compared imputation accuracy of these five training models to those using FUSION based on a smaller individual-level training dataset (individual-level GTEx V6 reference dataset; see Methods).For such models, we compared the test  2 for genes that had test  2 > 0.01 by at least one training method.
By comparing test  2 per "valid" GReX imputation model by PRS-CS versus the other methods (Figure 3), we observed that PRS-CS had the best overall performance for imputing GReX as it provided the most "valid" models with higher GReX imputation accuracy compared to P+T methods, lassosum, SDPR, and FUSION.Comparing the test  2 among the other four training methods, we observed that these two P+T methods obtained similar test  2 per "valid" model.Meanwhile, the test  2 per valid model varied widely among the P+T methods, lassosum, and SDPR (Figure S3), suggesting that none of these four were optimal across all genes and their performance likely depended on the underlying unknown genetic architecture.These results are consistent with our simulation results.

TWAS of Cardiovascular Disease
Using the eQTL weights trained by P+T (0.001), P+T (0.05), lassosum, SDPR, and PRS-CS methods with the eQTLGen 23 reference data and reference LD from GTEx V8 WGS samples 44 , we applied our OTTERS framework to the summary-level GWAS data of Cardiovascular Disease from UKBB (n=459,324, case fraction = 0.319) 35  significance level) were identified as significant TWAS genes for cardiovascular risk.
In total, we identified 40 significant TWAS genes by using OTTERS.To identify independently significant TWAS genes, we calculated the  2 (squared correlation) between the GReX predicted by PRS-CS for of each pair of genes.For a pair of genes with the predicted GReX  2 > 0.5, we only kept the gene with the smaller TWAS p-value as the independently significant gene.OTTERS obtained 38 independently significant TWAS genes ( These genes were identified by FUSION when considering the GTEx V6 reference data of artery, thyroid, adipose visceral, and nerve tibial tissues.For example, the most significant gene FES (OTTERS p-value = 2.87 × 10 −32 ) was identified by FUSION using GTEx reference data of artery tibial, thyroid, and adipose visceral omentum tissues, and was also identified as a TWAS risk gene for high blood pressure, which is strongly related to cardiovascular disease 46 .
By comparing OTTERS results with the ones obtained by individual methods (Table 2; Figure 4; Figure S4), we found that all individual methods contributed to the OTTERS results.For example, the novel risk gene LINC01093 was only identified by lassosum, while genes CPEB4, SIDT2, and ACE were only detected by PRS-CS and SDPR and the novel risk gene EDN3 was only identified by the P+T methods.To better understand the differences among individual methods, we plotted the eQTL weights estimated by P+T (0.001), P+T (0.05), lassosum, SDPR, and PRS-CS for three example genes that were only detected by one or two individual methods (Figures S5-S7).For these genes, we plotted the eQTL weights produced by each method with such weights color coded with respect to − 10 (GWAS p-values) from the UKBB GWAS summary statistics and shape coded with respect to the direction of UKBB GWAS Z-score statistics.Generally, significant TWAS p-values would be obtained by methods that obtained eQTL weights with relatively large magnitude for SNPs with relatively more significant GWAS pvalues.
In Figure S5, we showed the eQTL weights for gene SIDT2, which was a significant risk lassosum was the only method that produced relatively large eQTL weights that co-localized with GWAS significant SNPs.
These results were consistent with our simulation study results, demonstrating that the performance of different individual methods depended on the underlying genetic architecture.We do note that there were a handful of genes identified by an individual method that were not significant using OTTERS (Table S2).Nonetheless, the omnibus test borrows strength across all individual methods, thus generally achieves higher TWAS power and identifies the group of most robust TWAS risk genes.
By examining the Q-Q plots of TWAS p-values, we observed a moderate inflation for all methods (Figure S8).Such inflation in TWAS results is not uncommon [48][49][50] , which could be due to similar inflation in the GWAS summary data and not distinguishing the pleiotropy and mediation effects for considered gene expression and phenotype of interest 51 (Figure S9).We also observed a notable inflation in the GWAS p-values of cardiovascular disease from UKBB (Figure S9), as we estimated the LD score regression 52 intercept to be 1.1 from the GWAS summary data.
We did not consider directly comparing to FUSION in our above TWAS analyses of cardiovascular disease since we used the summary-level reference data eQTLGen.However, to assess the performance of OTTERS and FUSION in a real study where individual-level reference data are available, we performed an additional TWAS analysis of cardiovascular disease in the UK Biobank using the GTEx V8 data of 574 whole blood samples as the reference data.We trained OTTERS Stage I using cis-eQTL summary statistics obtained from these 574 GTEx V8 whole blood samples and reference LD from GTEx V8 WGS samples, and trained FUSION models using individual-level genotype data and gene expression data of the same 574 whole blood samples.
We tested TWAS association for 19,653 genes and identified genes with TWAS p-values < 2.53 × 10 −6 (Bonferroni corrected significance level) as significant TWAS genes.Training  2 > 0.01 was used to select "valid" GReX imputation models for TWAS (Figure S10).To identify independently significant TWAS genes, we calculated the training  2 between the GReX predicted by lassosum for of each pair of genes, since lassosum had the best training  2 (Figure S10).For a pair of genes with the predicted GReX  2 > 0.5, we only kept the gene with the smaller TWAS p-value as the independently significant gene.As a result, OTTERS obtained 34 independently significant TWAS genes, while FUSION identified 21 independently significant TWAS genes (Figure S11).A total of 14 genes were identified by both FUSION and OTTERS (Table S3).
These results demonstrate the advantages of OTTERS for using multiple PRS training methods to account for the unknown genetic architecture of gene expression, which is consistent in our simulation results.These results also showed the advantage of using eQTL summary data with a larger training sample size, as more independently significant TWAS genes were identified by using the eQTLGen summary reference data (38 vs. 34), even with a more stringent rule (test instead of training  2 > 0.01) applied to select test genes with "valid" GReX imputation models.

Computational Time
The computational time per gene of different PRS methods depends on the number of test variants considered for the target gene.Thus, we calculated the computational time and memory usage for 4 groups of genes whose test variants were <2000, between 2000 and 3000, between 3000 and 4000, and >4000, respectively.Among all tested genes in our real studies, the median number of test variants per gene is 3152, and the proportion of genes in each group is 10.3%, 33.4%, 34.5%, and 21.8%, respectively.For each group, we randomly selected 10 genes on Chromosome 4 to evaluate the average computational time and memory usage per gene.We benchmarked the computational time and memory usage of each method on one Intel(R) Xeon(R) processor (2.10 GHz).The evaluation was based on 1000 MCMC iterations for SDPR and PRS-CS (default) without parallel computation (Table S4).We showed that P+T and lassosum were computationally more efficient than SDPR and PRS-CS, whose speed were impeded by the need of MCMC iterations.Between the two Bayesian methods, SDPR implemented in C++ uses significantly less time and memory than PRS-CS implemented in Python.

Discussion
Our OTTERS framework represents an omnibus TWAS tool that can leverage summarylevel expression and genotype results from a reference sample, thereby robustly expanding the use of TWAS into more settings.To this end, we adapted and evaluated five different PRS methods assuming different underlying genetic models, including the relatively simple method P+T 26 with two different p-value thresholds (0.001 and 0.05), the frequentist method lassosum 27 , as well as the Bayesian methods PRS-CS 28 and SDPR 29 within our omnibus test for optimal inference.We note that additional PRS methods such as MegaPRS 30 or PUMAS 31 could also be implemented as additional OTTERS Stage I training methods.Higher TWAS power might be obtained by adding more PRS methods in OTTERS Stage I, with additional computation cost.We also note that the existing SMR-HEIDI 53 method, which uses summary-level data from GWAS and eQTL studies to test for possible causal genetic effects of a trait of interest that were mediated through gene expression, could also be used as an alternative method besides TWAS.However, the SMR method generally restricts eQTL for consideration, excluding those where the eQTL p-values larger than a certain threshold, e.g., 0.05.
In simulation studies, we demonstrated that the performance of each of these five PRS methods depended substantially on the underlying genetic architecture for gene expression, with P+T methods generally performing better for sparse architecture whereas the Bayesian methods performing better for denser architecture.Consequently, since genetic architecture of gene expression is unknown apriori, we believe this justifies the use of the omnibus TWAS test implemented in OTTERS for practical use as this test had near-optimal performance across all simulation scenarios considered.While we developed our methods with summary-level reference data in mind, we note that our prediction methods and OTTERS perform well (in terms of imputation accuracy and power) relative to existing TWAS methods like FUSION when individuallevel reference data are available.
In our real data application using UKBB GWAS summary-level data, we compared OTTERS TWAS results using reference eQTL summary data from eQTLGen consortium to FUSION TWAS results using a substantially smaller individual-level reference dataset from GTEx V6.OTTERS identified 13 significant TWAS risk genes that were missed by FUSION using individual-level GTEx V6 reference data of blood tissue, suggesting that the use of larger reference datasets like eQTLGen in TWAS can identify novel findings.Interestingly, the genes missed by FUSION were instead detected using individual-level GTEx reference data of other tissue types that are more directly related to cardiovascular disease.By comparing OTTERS to FUSION when the same individual-level GTEx V8 reference data of whole blood samples were used, we still observed that OTTERS identified more risk genes than FUSION, which we believe is due to the former method accounting for the unknown genetic architecture of gene expression by using multiple regression methods to train GReX imputation models.These applied results were consistent with our simulation results.
Among all individual methods, P+T is the most computationally efficient method.The Bayesian methods SDPR and PRS-CS require more computation time than the frequentist method lassosum as the former set of methods require a large number of MCMC iterations for model fit.By comparing the performance of these five methods in terms of the imputation accuracy and TWAS power in simulations and real applications, we conclude that none of these methods were optimal across different genetic architectures.We found that all methods provided distinct and considerable contributions to the final OTTERS TWAS results.These results demonstrate the benefits of OTTERS in practice, since OTTERS can combine the strength of these individual methods to achieve the optimal performance.SDPR training methods as these three provide complementary results in our studies.Second, the currently available eQTL summary statistics are mainly derived from individuals of European descent.Our OTTERS trained GReX imputations model based on these eQTL summary statistics, and the resulting imputed GReX could consequently have attenuated cross-population predictive performance 56 .This might limit the transferability of our TWAS results across populations.Third, our OTTERS cannot provide the direction of the identified gene-phenotype associations, which should be referred to the sign of the TWAS Z-score statistic per training method.Last, even though the method applies to integrate both cis-and trans-eQTL with GWAS data, the computation time and availability of summary-level trans-eQTL reference data are still the main obstacles.Our current OTTERS tool only considers cis-eQTL effects.Extension of OTTERS to enable cross-population TWAS and incorporation of trans-eQTL effects is part of our ongoing research but out of the scope of this work.
Our novel OTTERS framework using large-scale eQTL summary data has the potential to identify more significant TWAS risk genes than standard TWAS tools that use smaller individuallevel reference transcriptomic data and use only a single regression method for training GReX imputation models.This tool provides the opportunity to leverage not only available public eQTL summary data of various tissues for conducting TWAS of complex traits and diseases, but also the emerging summary-level data of other types of molecular QTL such as splicing QTLs, methylation QTLs, metabolomics QTLs, and protein QTLs.For example, OTTERS could be applied to perform proteome-wide association studies using summary-level reference data of genetic-protein relationships such as those reported by the SCALLOP consortium 57 , and epigenome-wide association studies using summary-level reference data of methylationphenotype relationships reported by Genetics of DNA Methylation Consortium (GoDMC) (see Web Resources).OTTERS would be most useful for the broad researchers who only have access to summary-level QTL reference data and summary-level GWAS data.The feasibility of integrating summary-level molecular QTL data and GWAS data makes our OTTERS tool valuable for wide application in current multi-omics studies of complex traits and diseases.

Traditional Two-Stage TWAS Analysis
Stage I of TWAS estimates a GReX imputation model using individual-level expression and genotype data available from a reference dataset.Consider the following GReX imputation model from  individuals and  SNPs (multivariable regression model assuming linear additive genetic effects) within the reference dataset: =  + ,  ∼ (0,   2 ).
(Equation 1) Here,   is a vector representing gene expression levels of gene ,  is an  ×  matrix of genotype data of SNP predictors proximal or within gene ,  is a vector of genetic effect sizes (referred to as a broad sense of eQTL effect sizes), and  is the error term.Here, we consider only cis-SNPs within 1 MB of the flanking 5' and 3' ends as genotype predictors that are coded within  19,20,22 .Once we configure the model in Equation 1, we can employ methods like PrediXcan, FUSION, and TIGAR to fit the model and obtain estimates of eQTL effect sizes ( ̂).
Stage II of TWAS uses the eQTL effect sizes ( ̂) from Stage I to impute gene expression (GReX) in an independent GWAS and then test for association between GReX and phenotype.
Given individual-level GWAS data with genotype data   and eQTL effect sizes ( ̂) from Stage I, the GReX for   can be imputed by  ̂=    ̂. √ ̂′ ̂ (Equation 2) where   is the single variant Z-score test statistic in GWAS for the j th SNP,  = 1, … , , for all test SNPs that have both eQTL weights with respect to the test gene  and GWAS Z-scores;  ̂ is the genotype standard deviation of the j th SNP; and  denotes the genotype correlation matrix in FUSION Z-score statistic and genotype covariance matrix in S-PrediXcan Z-score statistic of the test SNPs.In particular,  ̂ and  can be approximated from a reference panel with genotype data of samples of the same ancestry such as those available from the 1000 Genomes Project 58 .If  ̂ are standardized effect sizes estimated assuming standardized genotype  and gene expression   in Equation 1, FUSION and S-PrediXcan Z-score statistics are equivalent 13 .
Otherwise, the S-PrediXcan Z-score should be applied to avoid false positive inflation.

TWAS Stage I Analysis using Summary-Level Reference Data
We now consider a variation of TWAS Stage I to estimate cis-eQTL effect sizes  ̂ based on a multivariable regression model (Equation 1) from summary-level reference data.We assume that the summary-level reference data provide information on the association between a single genetic variant j ( = 1, … , ) and expression of gene g.This information generally consists of effect size estimates ( ̃,  = 1, … , ) and p-values derived from the following single variant regression models: Here,   is an  × 1 vector of genotype data for genetic variant .Since eQTL summary data are analogous to GWAS summary data where gene expression represents the phenotype, we can estimate the eQTL effect sizes  ̂ using marginal least squared effect estimates ( ̃,  = 1, … , ) and p-values from the QTL summary data as well as reference linkage disequilibrium (LD) information of the same ancestry [26][27][28][29] .Although all PRS methods apply to the TWAS Stage I framework, we only consider four representative methods as follows: P+T: The P+T method selects eQTL weights by LD-clumping and P-value Thresholding 26 .
Given threshold   for p-values and threshold   for LD  2 , we first exclude SNPs with marginal p-values from eQTL summary data greater than   or strongly correlated (LD  2 greater than   ) with another SNP having a more significant marginal p-value (or Z-score statistic value).For the remaining selected test SNPs, we use marginal standardized eQTL effect sizes from eQTL summary data as eQTL weights for TWAS in Stage II.We considered   = 0.99 and   = (0.001, 0.05) in this paper and implemented the P+T method using PLINK 1.9 55 (see Web Resources).We denote the P+T method with   equal to 0.001 and 0.05 as P+T (0.001) and P+T (0.05), respectively.
Frequentist lassosum: With standardized   and X, we can show that the marginal least squared eQTL effect size estimates from the single variant regression model (Equation 3) is  ̃=     / and that the LD correlation matrix is  =   /.That is,     =  ̃ and    = .
(Equation 4) By approximating  by   (  = (1 − )  +  with a tuning parameter 0 <  < 1, a reference LD correlation matrix   from an external panel such as one from the 1000 Genomes Project 58 , and an identity matrix ) in the LASSO 32 penalized loss function, the frequentist lassosum method 27 can tune the LASSO penalty parameter and  using a pseudovalidation approach and then solve for eQTL effect size estimates  ̂ by minimizing the approximated LASSO loss function requiring no individual-level data (see details in Supplemental Methods).

OTTERS Framework
As shown in Figure 1, OTTERS first trains GReX imputation models per gene g using P+T, lassosum, SDPR, and PRS-CS methods that each infers cis-eQTLs weights using cis-eQTL summary data and an external LD reference panel of similar ancestry (Stage I).Once we derive cis-eQTLs weights for each training method, we can impute the respective GReX using that method and perform the respective gene-based association analysis in the test GWAS dataset using the formulas given in Equation 2 (Stage II).We thus derive a set of TWAS p-values for gene g; one p-value for each training model that we applied.We then use these TWAS p-values to create an omnibus test using the ACAT-O 34 approach that employs a Cauchy distribution for inference (see details in Supplemental Methods).We refer to the p-value derived from ACAT-O test as the OTTERS p-value.

Marginal eQTL Effect Sizes
In practice of training GReX imputation models using reference eQTL summary data, the marginal standardized eQTL effect sizes were approximated by  ̃ ≈   /√( , ), where   denotes the corresponding eQTL Z-score statistic value by single variant test and ( , ) denotes the median sample size of all cis-eQTLs for the target gene .

LD Clumping
We performed LD-clumping with   =0.99 for all individual methods in both simulation and real studies.Using PRS-CS as an example, we also showed that LD-clumping does not affect the GReX imputation accuracy compared to no clumping in the real data testing (Figure S12).

LD Blocks for lassosum, PRS-CS, and SDPR
LD blocks were determined externally by ldetect 60       Manhattan plot of TWAS results by OTTERS using GWAS summary-level statistics of cardiovascular disease and imputation models fitted based on eQTLGen summary statistics.
Independently significant TWAS risk genes are labeled.

Figure Titles and LegendsFigure 1 .
Figure Titles and Legends

Figure 2 .
Figure 2. Test   (A) and TWAS power (B) comparison in simulation studies

Figure 4 .
Figure 4. Manhattan plot of TWAS results by OTTERS.

Table 2
We compared our OTTERS results with the TWAS results shown on TWAS hub (see Web Resource) obtained by FUSION using the same UKBB GWAS summary data of cardiovascular disease but using a smaller individual-level reference expression dataset from , Figure3B), compared to 17 independently significant genes by P+T (0.001), 11 by P+T (0.05), 10 by lassosum, 41 by SDPR, and 12 by PRS-CS.Among these 38 independent TWAS risk genes identified by OTTERS, gene RP11-378A13.1 (OTTERS p-value = 9.78 × 10 −9 ) was not within 1 MB of any known GWAS risk loci with genomic-control corrected p-value < 5 × 10 −8 in the UKBB summary-level GWAS data.This novel risk gene RP11-378A13.1 was also identified to be a significant TWAS risk gene in blood tissue for systolic blood pressure, high cholesterol, and cardiovascular disease by FUSION 1 .HAUS8, RPL28, CTSZ), when considering all available tissue types in GTEx V6 reference data.
For this gene, SDPR and PRS-CS estimated near-zero weights for most test SNPs with significant GWAS p-values in the test region.Most significant GWAS SNPs did not have eQTL test p-values < 0.001 or 0.05, and were thus filtered out by P+T methods.
gene identified by both PRS-CS and SDPR, and had p-values < 10 −4 by other methods.Compared to lassosum, SDPR had more significant GWAS SNPs colocalized with eQTLs having relatively large weights in the test region, and PRS-CS had more non-significant GWAS SNPs colocalized with eQTLs having zero weights.Compared to the P+T methods, SDPR and PRS-CS based on a multivariate regression model modeled LD among all test SNPs, and thus estimated eQTL weights leading to significant TWAS findings.In FigureS6, we provided the results of gene EDN3, which was only identified by P+T methods (p-values ≤ 9.15 × 10 −8 ).Compared to P+T methods, SDPR (p-value = 5.9 × 10 −3 ) and PRS-CS (p-value = 0.0158) had fewer significant GWAS SNPs colocalized with eQTLs that had relatively large weights in the test region, while lassosum (p-value = 8.6 × 10 −6 ) assigned relatively large weights to more non-significant GWAS SNPs.In FigureS7, we provided results for gene LINC01093, which was only identified by lassosum.
55 enable the use of OTTERS by the public, we provide an integrated tool (see Availability of data and materials) to: (1) Train GReX imputation models (i.e., estimate eQTL weights in Stage I) using eQTL summary data by P+T, lassosum, SDPR, and PRS-CS; (2) Conduct TWAS (i.e., testing gene-trait association in Stage II) using both individual-level and summary-level GWAS data with the estimated eQTL weights; and (3) Apply ACAT-O to aggregate the TWAS p-values from individual training methods.Since the existing tools for P+T, lassosum, SDPR, and PRS-CS were originally developed for PRS calculations, we adapted and optimized them for training GReX imputation models in our OTTERS tool.For example, we integrate TABIX54and PLINK55tools in OTTERS to extract input data per target gene more efficiently.We also enable parallel computation in OTTERS for training GReX imputation models and testing gene-trait association of multiple genes.
The OTTERS framework does have its limitations.First, training GReX imputation models by all individual methods on average cost ~20 minutes for all 5 training models per gene, which might be computationally challenging for studying eQTL summary data of multiple tissue types and for ~20K genome-wide genes.Users might consider prioritize P+T(0.001),lassosum, and 28timates  ̂ for the underlying multivariable regression model in Equation 1 by assuming a normal prior (0,   2 ) for   and a Dirichlet process prior 59 (, ) for   2 with base distribution  and concentration parameter .SDPR29assumes the same DPR model but can be applied to estimate the eQTL effect sizes  ̂ using only eQTL summary data (see details in Supplemental Methods).The PRS-CS method28assumes the following normal prior for   and non-informative scale-invariant Jeffreys prior on the residual variance   2 in Equation1where local shrinkage parameter   has an independent gamma-gamma prior and  is a globalshrinkage parameter controlling the overall sparsity of .PRS-CS sets hyper parameters  = 1 and  = 1/2 to ensure the prior density of   to have a sharp peak around zero to shrink small effect sizes of potentially false eQTL towards zero, as well as heavy, Cauchy-like tails which asserts little influence on eQTLs with larger effects.Posterior estimates  ̂ will be obtained from eQTL summary data (i.e., marginal effect size estimates  ̃ and p-values) and reference LD correlation matrix  by Gibbs Sampler (see details in Supplemental Methods).We set  as the square of the proportion of causal variants in the simulation and as 10 −4 per gene in the real data application.
The median cis-eQTL sample size per gene was also taken as the sample size value required by lassosum, SDPR, and PRS-CS methods, for robust performance.Since summary eQTL datasets (e.g., eQTLGen) were generally obtained by meta-analysis of multiple cohorts, the sample size per test SNP could vary across all cis-eQTLs of the test gene.The median cis-eQTL sample size ensures a robust performance for applying those eQTL summary data based methods.

Table 1 . Test 𝑹 𝟐 in 315 whole blood tissue samples from GTEx V8.
for lassosum and PRS-CS, while internally for SDPR which ensure that SNPs in one LD block do not have nonignorable correlation ( 2 > 0.1) with SNPs in other blocks.Given gene expression   simulated from the multivariate regression model   =    +   with standardized genotype matrix   and   ∼ (0, (1 − ℎ  2 ), we assume GWAS phenotype data of   samples are simulated from the following linear regression model  = ℎ  (  ) +   ,   ∼ (0, ). on true genetic effect sizes, the GWAS Z-score test statistics of all test SNPs will follow a multivariate normal distribution,  (  √  ℎ  2 ,   ), where   is the correlation matrix of the standardized genotype   from test samples, and ℎ  2 denotes the amount of phenotypic variance explained by simulated GReX=   38 .Thus, for a given GWAS sample size, we can generate GWAS Z-score statistic values from this multivariate normal distribution.size > 3000.As a result, we used cis-eQTL summary data of 16,699 genes from eQTLGen to train GReX imputation models for use in OTTERS in this study..319) 35were generated by BOLT-LMM based on the Bayesian linear mixed model per SNP 63 with assessment centered, sex, age, and squared age as covariates.Although BOLT-LMM was derived based on a quantitative trait model, it can be applied to analyze case-control traits and has well-controlled false positive rate when the trait is sufficiently balanced with case fraction ≥ 10% and samples are of the same ancestry .The tested dichotomous cardiovascular disease phenotype includes a list of sub-phenotypes: hypertension, heart/cardiac problem, peripheral vascular disease, venous thromboembolic disease, stroke, transient ischaemic attack (tia),

Table 2 . Independent TWAS risk genes of cardiovascular disease identified by OTTERS. 690
Reference eQTL summary data from eQTLGen consortium and GWAS summary data from 691 UKBB were used.The corresponding TWAS p-values by 5 individual PRS methods and 692 OTTERS are shown in the table with significant p-values in bold, and those for genes with test 693GReX  2 ≤ 0.01 were shown as a dash.Risk gene of UKBB cardiovascular disease in TWAS-hub identified using GTEx whole blood tissue.
b: Risk genes of UKBB cardiovascular disease in TWAS-hub identified using other GTEx tissue types.c: Novel risk gene 50.Center for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center Nijmegen, Nijmegen, The Netherlands 51.Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany 52.Institute for Laboratory Medicine, LIFE -Leipzig Research Center for Civilization Diseases, Universität Leipzig, Leipzig, Germany 53.Department of Internal Medicine, Erasmus Medical Centre, Rotterdam, The Netherlands 54.UMC Utrecht Brain Center, University Medical Center Utrecht, Department of Neurology, Utrecht University, Utrecht, The Netherlands 55.Interfaculty Institute for Genetics and Functional Genomics, University Medicine Greifswald, Greifswald, Germany 56.School of Life Sciences, College of Liberal Arts and Science, University of Westminster, 115 New Cavendish Street, London, United Kingdom 57.Division of Medical Sciences, Department of Health Sciences, Luleå University of Technology, Luleå, Sweden 58.Institute for Advanced Research, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China