MATLAB API¶
This page provides documentation for the ldgm MATLAB API to perform efficent matrix algebra with a LDGM precision matrix, as well as code to create an LDGM precision matrix from an LDGM.
Efficient Matrix Algebra Operations for StatGen Applications¶
- MATLAB.BLUPxldcov(R, whichIndices, mergedSumstats, betaCov, sampleSize, alpha_param)¶
BLUPx computes the cross-popn best linear unbiased predictor, E(beta|GWAS, gaussian prior).
Input arguments: R: LD correlation matrices, as a number-of-LD-blocks by number-of-popns cell array with each cell containing a correlation matrix; should be the same size across populations. For SNPs that are missing from the summary statistics in some of the populations, R can be padded with zeros in the appropriate entries
whichIndices: output from mergesnplists, encoding which indices (rows/columns of the correlation matrices) have corresponding SNPs in each population. Should be a number-of-blocks by number-of-popns cell array.
mergedSumstats: merged summary statistics tables for each LD block - population pair, output from mergesnplists. Each table should have height equal to the length of corresponding cell of whichIndices. Tables are expected to report either a Z score (column name Z) or an effect size and standard error (beta, se). Optionally, if they report an allele frequency (AF) and alpha_param is specified, an AF-dependent prior will be used.
sampleSize: GWAS sample size for each population, as a vector.
betaCov: covariance matrix for the per-allele effect size of a SNP across popns, which is assumed to be i.i.d.
Output arguments: betaExpectationPerSD: expected value of beta for each popn, in per-s.d. (standardized) units, as a cell array of the same size as input arguments. Values are reported for every index in the precision matrices, but when some of these indices are missing summary statistics or precision matrix entries, beta will be zero in those positions. betaExpectationPerAllele: same as betaExpectationPerSD, but in per-allele instead of per-s.d. units. This can only be reported if AF is specified in the sumstats tables.
- MATLAB.BLUPxldgm(P, whichIndices, mergedSumstats, betaCov, sampleSize, alpha_param)¶
BLUPx computes the best linear unbiased predictor, E(beta|GWAS), under a Gaussian model. It inputs data from one or more ancestry groups.
Input arguments: P: LDGM precision matrices, as a number-of-LD-blocks by number-of-popns cell array with each cell containing a precision matrix; should be the same size across populations. For SNPs that are missing from the summary statistics in some of the populations, P can be padded with zeros in the appropriate entries
whichIndices: output from mergesnplists, encoding which indices (rows/columns of the LDGM precision matrices) have corresponding SNPs in each population. Should be a number-of-blocks by number-of-popns cell array. If no SNPs are missinig (e.g., in simulations), specify cells as whichIndices{jj} = true(length(P{jj}),1);
mergedSumstats: merged summary statistics tables for each LD block - population pair, output from mergesnplists. Each table should have height equal to the length of corresponding cell of whichIndices. Tables are expected to report a Z score with column name Z_deriv_allele. Optionally, if they report an allele frequency with column name AF, then an AF-dependent prior can be used
betaCov: covariance matrix for the per-allele effect size of a SNP across popns, which is assumed to be i.i.d.
sampleSize (optional): GWAS sample size for each population, as a vector. If not specified, BLUPxldgm will look for a column of the summary statistics file called ‘N’ and use that if it is found.
alpha_param (optional): frequency-dependent architecture parameter; models per-allele effect size variance of each SNP as var(beta) == betaCov * heterozygosity ^ alpha. Default: 0.
Output arguments: betaExpectationPerSD: expected value of beta for each popn, in per-s.d. (standardized) units, as a cell array of the same size as input arguments. Values are reported for every index in the precision matrices, but when some of these indices are missing summary statistics or precision matrix entries, beta will be zero in those positions.
betaExpectationPerAllele: same as betaExpectationPerSD, but in per-allele instead of per-s.d. units. This can only be reported if AF is specified in the sumstats tables.
- MATLAB.GWASlikelihood(Z_deriv_allele, sigmasq, P, nn, whichIndices)¶
GWASlikelihood computes likelihood of the GWAS sumstats, Z_deriv_allele, under a gaussian model:
beta ~ MVN(mu,diag(sigmasq)) Z|beta ~ MVN(sqrt(nn)*R*beta, R) inv(R) = P.
The GWAS SNPs in Z_deriv_allele should be a subset of those in P, and in the same order; whichIndices should be true for rows/columns of P corresponding to one of these SNPs, false elsewhere (or specify indices instead of logicals)
sigmasq should be the same size as Z_deriv_allele, and missing SNPs are modeled as having zero effect size.
Optionally, arguments Z_deriv_allele, sigmasq, P, whichSNPs can be specified as cell arrays with one cell for each LD block.
- If it is desired to specify the mean of beta, ie
beta ~ MVN(mu, diag(sigmasq))
call GWASlikelihood( Z - P mu * sqrt(nn), sigmasq, P, …)
- MATLAB.GWASlikelihoodGradient(alphaHat, sigmasq, P, nn, delSigmaDelA, whichSNPs, fixedIntercept)¶
GWASlikelihoodGradient computes gradient of the likelihood of the GWAS sumstats alphaHat under a gaussian model:
beta ~ MVN(mu,diag(sigmasq)) alphaHat|beta ~ MVN(R*beta, R/nn) inv(R) = P.
This function requires the sparseinv() function from suitesparse: https://people.engr.tamu.edu/davis/suitesparse.html
The GWAS SNPs in alphahat should be a subset of those in P, and in the same order; boolean vector whichSNPs should be true for rows/columns of P corresponding to one of these SNPs, false elsewhere.
sigmasq should be the same size as alphahat, such that missing SNPs are modeled as having zero effect-size variance.
Argument delSigmaDelA is the gradient of sigmasq w.r.t. some parameters A. GWASlikelihoodGradient computes the gradient of the likelihood w.r.t. A, i.e.: grad_A loglikelihood = delSigmaDelA * grad_sigmasq loglikelihood
Units of alphaHat should be either (1) normalized effect size estimates, ie sample correlations, or (2) Z scores. In case (1), nn should be the number of individuals in the GWAS. In case (2), nn should be 1, and the vector of sigmasq values should be multiplied by nn. Optionally, arguments alphaHat, sigmasq, P, whichSNPs can be specified as cell arrays with one cell for each LD block.
Optionally, arguments alphaHat, sigmasq, P, whichSNPs can be specified as cell arrays with one cell for each LD block.
- Optionally, if it is desired to specify the mean of beta, ie
beta ~ MVN(mu, diag(sigmasq))
call GWASlikelihoodGradient( alphahat - P mu, …)
Optionally, if it is desired to model an unknown amount of population stratification/relatedness, or if the sample size is unknown, specify fixedIntercept = true. This modifies the model as: alphaHat|beta ~ MVN(R*beta, R * (1/nn + a)) where a==0, and it computes the derivative of the log-likelihood with respect to a; this will be the last element of grad.
- MATLAB.README¶
This directory contains MATLAB code to infer LDGM precision matrices and to analyze GWAS data using LDGMs.
New users should start in the tutorials directory, where there are tutorials on how to analyze real and simulated GWAS summary statistics using LDGM precision matrices.
This directory contains the following functions:
BLUPxldgm: computes polygenic scores from GWAS summary statistics and LDGM precision matrices, for one or more populations
GWASlikelihood: computes log-likelihood of the GWAS summary statistics under a Gaussian model
GWASlikelihoodGradient: this function requires the sparseinv function, which can be obtained by installing suitesparse: https://people.engr.tamu.edu/davis/suitesparse.html It calculates the gradient of the log-likelihood w.r.t. the variance parameters of the Gaussian prior distribution.
loadLDGMs: loads LDGM precision matrices and SNP lists from a specified directory
logDeterminant: calculates the log-determinant of a matrix M=R+D, where R is the inverse of the LDGM precision matrix and D is a diagonal matrix.
mergesnplists: merges LDGM precision matrix SNPs with those from a table of GWAS summary statistics. It also takes care of allele phasing. We highly recommend using this function, as documented in the tutorial
precisionDivide: divides a vector y by the precision matrix, accounting for missing SNPs (i.e., multiplies y by the inverse of the precision matrix restricted to nonmissing SNPs)
precisionMultiply: multiplies a vector y by the precision matrix accounting for missing SNPs
r2PGS: calculates the PGS r2 of a polygenic predictor in a population with the specified LD precision matrix and allele frequencies
simulateSumstats: simulates GWAS summary statistics with the specified LD patterns and effect size distribution, for one or more populations
The utility subdirectory contains functions that users will probably not use directly, but which are called from the functions above.
- MATLAB.loadLDGMs(filepath, popn_names)¶
loadLDGMs reads LDGMs or LDGM precision matrices from specified file or directory, together with corresponding SNP lists
It expects to find .edgelist files named [filepath, ‘*’, popn_name{j}, ‘.edgelist’] and .snplist files named [filepath, ‘*’, ‘.snplist’]. If you are loading data for multiple populations, specify the population names as a cell array of strings, e.g. {‘AFR’,’EUR’}; otherwise, specify it just as a string.
- MATLAB.logDeterminant(P, whichIndices, d)¶
logDeterminant computes the log-determinant of the matrix M = R(whichIndices,whichIndices) + D, where R == inv(P) is the correlation matrix, D == diag(d) is a diagonal matrix, and whichIndices is a list of indices.
Input arguments: P: LDGM precision matrices, as a cell array with each cell containing a precision matrix or as a single precision matrix
whichIndices (optional): output from mergesnplists, encoding which indices (rows/columns of the LDGM precision matrices) have corresponding SNPs in each summary statistic file respectively. If P is a cell array, this should be a cell array of the same size; if P is a single matrix, this should be a vector of logicals or indices
d (optional): vectors to be converted into a diagonal matrix and added to P. Can also specify scalars. If P is a cell array, this should be a cell array of the same size; if P is a single matrix, this should be a scalar or a vector of size equal to the number of nonmissing SNPs
Output arguments: logDetM: log-determinant of M for each LD block
A: cholesky factor of L, which has submatrix M; has size equal to number of nonzero elements on the diagonal of P
- MATLAB.mergesnplists(snplists, sumstats, P)¶
mergesnplists merges (1) a cell array of .snplist tables, one for each LD block, with (2) a table of summary statistics containing RSIDs.
Input arguments: snplists: cell array where each cell contains a snplist table.
sumstats: summary statistics table with mandatory column name (non-case-sensitive) SNP or RSID, to be used for merging.
Optionally, sumstats can also contain columns named A1/A2 or anc_alleles/deriv_alleles. If these columns are specified, they will be merged with the anc_alleles/deriv_alleles columns of snplists.
- P (optional): cell array of LDGM precision matrices. Any indices whose
corresponding rows/columns in P are empty will be ingored. Alternatively, specify a logical row vector.
Ouput arguments: whichIndices: cell array of indices, a subset of the ‘index’ column of each snplist. Indexes which row/column of LDGMs have a corresponding SNP in the sumstats table.
mergedSumstats: cell array of sumstats tables, each containing a subset of the rows of the original sumstats table, that match the indices of each LD block. Note: when multiple SNPs correspond to the same row/column of the LDGM, the first-listed one will arbitrarily be chosen as a representative.
If alleles are specified in the sumstats table, the mergedSumstats tables will have an extra column appended called ‘phase’, which indicates whether the alleles match (+1) or anti-match (-1). If they mismatch (i.e., flipping their labels doesn’t make them match), then the SNP will be discarded.
Note: sometimes it makes this function run much faster if you go chromosome-by-chromosome; this is not done automatically.
- MATLAB.precisionDivide(P, y, whichIndices)¶
precisionDivide solves (P/P00)*x = y where P = [P11, P10; P01 P00], P11 = P(whichIndices,whichIndices), and P/P00 is the Schur complement Input arguments: P: precision matrix or cell array of precision matrices y: vector or cell array of vectors with same dimension as P whichIndices: indices or cell array of indices, or boolean vector/cell array of boolean vectors, indicating which entries of P correspond to the entries of y Output arguments: x == (P/P00) y
- MATLAB.precisionMultiply(P, y, whichIndices)¶
precisionMultiply computes x = (P/P00)y where P = [P11, P10; P01 P00], P11 = P(whichIndices,whichIndices), and P/P00 is the Schur complement
Input arguments: P: precision matrix or cell array of precision matrices y: vector or cell array of vectors with same dimension as P whichIndices: indices or cell array of indices, or boolean vector/cell array of boolean vectors, indicating which entries of P correspond to the entries of y Output arguments: x == (P/P00)*y
- MATLAB.r2PGS(true_beta_perallele, estimated_beta_perallele, P, whichIndices, AF)¶
r2PGS calculates the PGS accuracy, i.e. the squared correlation between X*true_beta_perSD and X*estimated_beta_perSD, using the precision matrix.
Input arguments: true_beta_perSD: ‘true’ causal effect sizes in per-allele units, as a cell array of size number-of-blocks by number-of-populations
estimated_beta_perallele: estimated causal effect sizes, as a cell array of the same size, or as a number-of-blocks by 1 cell array (in which case the same effect size estimates are used for every population)
P: LDGM precision matrices, as a cell array with each cell containing a precision matrix, for the target population (as opposed to the training population)
whichIndices: output from mergesnplists, encoding which indices (rows/columns of the LDGM precision matrices) have corresponding SNPs in each summary statistic file respectively. Same size as P.
AF: allele frequencies, as a number-of-blocks by number-of-populations cell array
Output arguments: r2: the squared correlation between the predictions (this is usually larger than the squared correlation between the true and predicted weights themselves)
- MATLAB.simulateSumstats(sampleSize, alleleFrequency, varargin)¶
Simulates summary statistics from specified prior distribution for one or more populations.
Required input arguments: sampleSize: sample size for each population as a vector
alleleFrequency: allele frequency for each LD block and each population as a number-of-LD blocks by number-of-populations cell array
Optional input arguments as name-value pairs:
precisionMatrices: precision matrix for each LD block and each population as a number-of-LD blocks by number-of-populations cell array. Specify either this or correlationMatrices.
correlationMatrices: correlation matrix for each LD block and each population as a number-of-LD blocks by number-of-populations cell array. Specify either this or precisionMatrices
heritability: total heritability for each population, either as a scalar, a vector, or a square matrix. If a matrix, it specifies both the heritability for each population (along its diagonal) and the genetic correlations (off the diagonal, i.e. with corrcof(heritability) == r_pop). If a vector, the cross-population genetic correlations will be (almost) 1. If a scalar, the cross-population genetic correlations will be (almost) 1, and the heritability will be the same for each population.
componentVariance: per-allele effect-size covariance matrix for each mixture component, as a number-of-populations by number-of-populations by number-of-components array. These will be rescaled to match the total heritability.
componentWeight: mixture weight for each heritability component
missingness: fraction of SNPs that are missing, in addition to those for which precision matrix already has a zero diagonal element
Output arguments: sumstats: cell array of tables, one per LD block, with the following columns:
Z_deriv_allele: Z scores AF: allele frequencies N: sample size
whichIndices: number-of-blocks by number-of-populations cell array of indices for which sumstats are reported. If missingness input is zero (default), this is simply the indices for which the precision matrix is nonmissing.
componentVariance: per-allele effect-size covariance matrix across populations for each variance component. This is normalized to the desired heritability for each population.
true_beta_perallele: true per-allele effect size of each variant, including those that are missing in the summary statistics
true_beta_perSD: same as true_beta_perallele, but in per-SD units
Estimating an LDGM Precision Matrix¶
- MATLAB.precision.estimatePrecision(data_directory, varargin)¶
Estimates LDGM precision matrix from an LDGM and genotype data using modified DP-GLASSO algorithm
Input data (located at data_directory) can be one of three types, indicated by ‘data_type’ input:
-genotypes: an individials-by-SNPs binary matrix with filename extension .genos, and columns that correspond to the rows of .snplist file (see below). If this is specified, there several additional options (see below) -correlation: a SNPs-by-SNPs symmetric matrix with filename extension .txt, encoding the correlation between each pair of SNPs. -edgelist: an edges-by-3 .edgelist file encoding the correlation between each pair of SNPs with an edge in the LDGM. If any values are missing, the corresponding edge of the LDGM is dropped.
Differences between these options are as follows. If genotypes are specified, the .genos file must match the SNPs in the .snplist file. There is no option to specify a second .snplist file and merge the two sets of SNPs. On the other hand, there are several additional options (see below). If correlation or edgelist is specified, it is required to also specify LD_matrix_snplist_dir, whose rows correspond to the SNPs in the LD matrix (or edgelist). This will automatically be merged with the LDGM .snplist file, and alleles will be matched. The resulting precision matrix will have one row/column for each labeled brick (i.e., nonredundant SNP) in the LDGM such that at least one of its SNPs is present in the LD matrix. Optionally, specify ‘AF_column_name’ together with ‘minimum_AF’. With this option, the allele frequencies will be looked up in the LD matrix .snplist file (rather than the LDGM .snplist file), under the column with the corresponding name.
- If genotypes are specified, there are various additional options:
-population_data_file: a file describing the ancestry of each individual (column) of the .genos file. If this is specified, it can be used to restrict to a subset of these individuals by ancestry, or to compute a weighted meta-analysis across ancestries -population_name: which population (in population_data_file) to restrict to. This can also be specified with the other two options, but it only affects which column of the .snplist file is used as the allele frequencies (i.e., AF_POPULATIONNAME). -meta_weights: meta-analysis weights to compute LD as a weighted average across populations in the population_data_file. To use this option, set population_name to ‘META’ -downsample_fraction: option to randomly downsample individuals in the .genos file. If this option is used, the .stats.txt output file will contain some additional columns reporting cross-validated MSE. -local_ancestry_dir: not yet supported
- FILENAME HANDLING OPTIONS:
-edgelist_dir: directory containing LDGM .edgelist files. Defaults to data_directory. snplist_dir: directory containing .snplist files. Defaults to data_directory. -output_dir: directory in which to save the output files -custom_filename: if specified, this is prepended to the name of the output files within output_dir -data_pattern: if specified, only data files matching this regular expression will be found -data_file_index: out of the data files that are found, which one to compute a precision matrix for -LD_matrix_snplist_dir: only applicable if specifying an LD matrix or LD correlation matrix SNP list as the data file; directory in which to find a corresponding .snplist file, for merging with the LDGM .snplist file
- METHOD PARAMETERS: (all of these can probably be left at default values)
-minimum_maf: minimum allele frequency, as calculated either using the LDGM .snplist file (default) or the LD matrix .snplist file (if AF_column_name is specified) -l1_penalty: L1 penalty for graphical lasso. Method performs some coordinate descent steps with the penalty, throws out edges whose precision matrix entries are zero, and continues without the penalty. Default value is 0.1. -path_distance_threshold: LDGM entries with distance below this threshold will be retained. Default value is 4. -patch_paths: how to remove missing and low-frequency SNPs. Options are 0 (discard these SNPs, do not patch paths that are broken), 1 (patch paths for missing SNPs, not for low-frequency SNPs), and 2 (patch paths for both missing and low-frequency SNPs) (not recommended). Default value is 0. -penalized_iterations: number of coordinate descent iterations to perform for each SNP in the L1-penalized graphical lasso step. Default value is 5. -unpenalized_iterations: number of coordinate descent iterations to perform for each SNP in the unpenalized graphical lasso step. Default value is 25. -lasso_iterations: number of inner-loop iterations in the coordinate descent procedure. Default value is 10. -banded_control_ldgm: instead of using the LDGM that is specified, replace it with a banded-diagonal matrix. The band size is chosen to match the density of the LDGM (with the specified path_distance_threshold). This should produce worse results than the original LDGM. -r2_control_ldgm: similar to banded_control_ldgm, except that instead of a banded diagonal matrix it uses a matrix with a ‘1’ wherever the r^2 between two SNPs is high enough. This should produce much worse results than using the original LDGM.
OUTPUT FILES: two output files are written to output_dir directory. They are named [custom_filename, filename, extension] where custom_filename is an optional user-specified string, filename is the name of the data file (except without its filename extension), and extension is either ‘.precisionMatrix.edgelist’ or ‘.stats.txt’
-’.stats.txt’ is a table with one row. It contains information like the block size, the runtime, and the MSE. Note: if the data file is a SNP list, then the MSE will be calculated from whatever entries are provided in that file, and it might be unreliable if too few are provided. -’.precisionMatrix.edgelist’ is an edge list with each row corresponding to an entry of the precision matrix. This file is zero-indexed, and the rows/columns of the precision matrix correspond to the ‘index’ column of the LDGM .snplist file.
Instead of saving the output, it is also an option to return the precision matrix P directly.
- MATLAB.precision.precisionCoordinateDescent(A, R_sample, maxIters, lambda, lassoIters, P)¶
precisionCoordinateDescent estimates the precision matrix P from the LDGM A and the sample correlation matrix R_sample, using a modified DP-GLASSO algorithm (Mazumder and Hastie, 2012).
Input arguments: A: graphical model as a sparse symmetric matrix; precision matrix edges will be a subset of those of A R_sample: sample correlation matrix; can be sparse with entries corresponding to those of A maxIters (optional): number of coordinate descent outer loops (default 25) lambda (optional): L1 penalty term, usually something like 0.1 (default 0) lassoIters (optional): number of iterations for lasso regression inner loop (default 10) P (optional): initialization for P (default identity matrix)
- MATLAB.precision.reduce_weighted_graph(A, missing)¶
reduce_weighted_graph removes rows/columns “missing” from A and patches paths between remaining vertices. A: weighted graph, with ‘1’ corresponding to a strong edge. missing: which SNPs are missing from A (logical or index)
- MATLAB.precision.solveLasso(P, y, lambda, neighbors, noIter, min_stepsize, u)¶
solveLasso uses proximal gradient descent (Combettes and Wajs, 2005) to solve a lasso regression problem:
P_inv(neighbors,neighbors) * u - y + lambda*sign(u) == 0 where P_inv == inv(P). This is the gradient of the lasso regression problem: argmin 0.5*sum((P_inv(neighbors,neighbors) * beta - y).^2) + lambda * sum(abs(beta))
Input arguments: P, y, lambda, idx: problem definition noIter: number of iterations u: initial guess for the coefficients