MATLAB API

This page provides documentation for the ldgm MATLAB API to perform efficent matrix algebra with a LDGM precision matrix, as well as code to create an LDGM precision matrix from an LDGM. We recommend starting with with the tutorials, which walk through how to use LDGM precision matrices to efficiently analyze real or simulated GWAS summary statistics. Detailed documentation of the functions in this package are below.

Functions to analyze GWAS summary statistics using LDGM precision matrices

MATLAB.BLUPxldcov(R, whichIndices, mergedSumstats, betaCov, sampleSize, alpha_param)

BLUPx computes the cross-popn best linear unbiased predictor, E(beta|GWAS, gaussian prior).

Input arguments: R: LD correlation matrices, as a number-of-LD-blocks by number-of-popns cell array with each cell containing a correlation matrix; should be the same size across populations. For SNPs that are missing from the summary statistics in some of the populations, R can be padded with zeros in the appropriate entries

whichIndices: output from mergesnplists, encoding which indices (rows/columns of the correlation matrices) have corresponding SNPs in each population. Should be a number-of-blocks by number-of-popns cell array.

mergedSumstats: merged summary statistics tables for each LD block - population pair, output from mergesnplists. Each table should have height equal to the length of corresponding cell of whichIndices. Tables are expected to report either a Z score (column name Z) or an effect size and standard error (beta, se). Optionally, if they report an allele frequency (AF) and alpha_param is specified, an AF-dependent prior will be used.

sampleSize: GWAS sample size for each population, as a vector.

betaCov: covariance matrix for the per-allele effect size of a SNP across popns, which is assumed to be i.i.d.

Output arguments: betaExpectationPerSD: expected value of beta for each popn, in per-s.d. (standardized) units, as a cell array of the same size as input arguments. Values are reported for every index in the precision matrices, but when some of these indices are missing summary statistics or precision matrix entries, beta will be zero in those positions. betaExpectationPerAllele: same as betaExpectationPerSD, but in per-allele instead of per-s.d. units. This can only be reported if AF is specified in the sumstats tables.

MATLAB.BLUPxldgm(P, whichIndices, mergedSumstats, betaCov, sampleSize, alpha_param)

BLUPx computes the best linear unbiased predictor, E(beta|GWAS), under a Gaussian model. It inputs data from one or more ancestry groups.

Input arguments: P: LDGM precision matrices, as a number-of-LD-blocks by number-of-popns cell array with each cell containing a precision matrix; should be the same size across populations. For SNPs that are missing from the summary statistics in some of the populations, P can be padded with zeros in the appropriate entries

whichIndices: output from mergesnplists, encoding which indices (rows/columns of the LDGM precision matrices) have corresponding SNPs in each population. Should be a number-of-blocks by number-of-popns cell array. If no SNPs are missinig (e.g., in simulations), specify cells as whichIndices{jj} = true(length(P{jj}),1);

mergedSumstats: merged summary statistics tables for each LD block - population pair, output from mergesnplists. Each table should have height equal to the length of corresponding cell of whichIndices. Tables are expected to report a Z score with column name Z_deriv_allele. Optionally, if they report an allele frequency with column name AF, then an AF-dependent prior can be used

betaCov: covariance matrix for the per-allele effect size of a SNP across popns, which is assumed to be i.i.d.

sampleSize (optional): GWAS sample size for each population, as a vector. If not specified, BLUPxldgm will look for a column of the summary statistics file called ‘N’ and use that if it is found.

alpha_param (optional): frequency-dependent architecture parameter; models per-allele effect size variance of each SNP as var(beta) == betaCov * heterozygosity ^ alpha. Default: 0.

Output arguments: betaExpectationPerSD: expected value of beta for each popn, in per-s.d. (standardized) units, as a cell array of the same size as input arguments. Values are reported for every index in the precision matrices, but when some of these indices are missing summary statistics or precision matrix entries, beta will be zero in those positions.

betaExpectationPerAllele: same as betaExpectationPerSD, but in per-allele instead of per-s.d. units. This can only be reported if AF is specified in the sumstats tables.

MATLAB.DENTISTldcov(R, whichIndices, mergedSumstats, normalizeCorrelation, partition)

DENTISTldcov computes p-values for LD mismatch or bad summary statistics QC using LDGM precision matrices and GWAS summary statistics

Re-implements the DENTIST method of Chen et al. 2021 Nat Commun, “Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors”

However, this implementation is simplified: it only implements the formula in equation (1), without performing the iterative SNP-removal procedure. This means that for loci containing bad SNPs, it might inflate

Input arguments: R: LDGM correlation matrices, as a number-of-LD-blocks by 1 cell array with each cell containing a correlation matrix, or as a single precision matrix.

whichIndices: output from mergesnplists, encoding which indices (rows/columns of the correlation matrices) have corresponding SNPs in the summary statistics. Should be a cell array of indices/logicals, or a single vector of indices/logicals

mergedSumstats: merged summary statistics tables for each LD block, output from mergesnplists. Each table should have height equal to the length of corresponding cell of whichIndices. Tables are expected to report a Z score with column name Z_deriv_allele.

Output arguments: P_dentist: DENTIST p-values for LD mismatch for each SNP

T_dentist: DENTIST chi^2 statistics

S: cell array containing SNPs that were assigned to each partition

MATLAB.DENTISTldgm(P, whichIndices, mergedSumstats, normalizePrecision, partition)

DENTISTldgm computes p-values for LD mismatch or bad summary statistics QC using LDGM precision matrices and GWAS summary statistics

Re-implements the DENTIST method of Chen et al. 2021 Nat Commun, “Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors”

However, this implementation is simplified: it only implements the formula in equation (1), without performing the iterative SNP-removal procedure. This means that for loci containing bad SNPs, it might inflate

Input arguments: P: LDGM precision matrices, as a number-of-LD-blocks by 1 cell array with each cell containing a precision matrix, or as a single precision matrix.

whichIndices: output from mergesnplists, encoding which indices (rows/columns of the LDGM precision matrices) have corresponding SNPs in the summary statistics. Should be a cell array of indices/logicals, or a single vector of indices/logicals

mergedSumstats: merged summary statistics tables for each LD block, output from mergesnplists. Each table should have height equal to the length of corresponding cell of whichIndices. Tables are expected to report a Z score with column name Z_deriv_allele.

Output arguments: P_dentist: DENTIST p-values for LD mismatch for each SNP

T_dentist: DENTIST chi^2 statistics

S: cell array containing SNPs that were assigned to each partition

MATLAB.GWASlikelihood(Z_deriv_allele, sigmasq, P, nn, whichIndices, intercept)

GWASlikelihood computes likelihood of the GWAS sumstats, Z_deriv_allele, under a gaussian model:

beta ~ MVN(mu,diag(sigmasq)) Z|beta ~ MVN(sqrt(nn)*R*beta, R) inv(R) = P.

The GWAS SNPs in Z_deriv_allele should be a subset of those in P, and in the same order; whichIndices should be true for rows/columns of P corresponding to one of these SNPs, false elsewhere (or specify indices instead of logicals)

sigmasq should be the same size as Z_deriv_allele, and missing SNPs are modeled as having zero effect size.

Optionally, arguments Z_deriv_allele, sigmasq, P, whichSNPs can be specified as cell arrays with one cell for each LD block.

If it is desired to specify the mean of beta, ie

beta ~ MVN(mu, diag(sigmasq))

call GWASlikelihood( Z - P mu * sqrt(nn), sigmasq, P, …)

MATLAB.GWASlikelihoodGradient(Z, sigmasq, P, nn, delSigmaDelA, whichSNPs, intercept, fixedIntercept)

GWASlikelihoodGradient computes gradient of the likelihood of the GWAS sumstats alphaHat under a gaussian model:

beta ~ MVN(mu,diag(sigmasq)) alphaHat|beta ~ MVN(R*beta, R/nn) inv(R) = P.

This function requires the sparseinv() function from suitesparse: https://people.engr.tamu.edu/davis/suitesparse.html

The GWAS SNPs in alphahat should be a subset of those in P, and in the same order; boolean vector whichSNPs should be true for rows/columns of P corresponding to one of these SNPs, false elsewhere.

sigmasq should be the same size as alphahat, such that missing SNPs are modeled as having zero effect-size variance.

Argument delSigmaDelA is the gradient of sigmasq w.r.t. some parameters A. GWASlikelihoodGradient computes the gradient of the likelihood w.r.t. A, i.e.: grad_A loglikelihood = delSigmaDelA * grad_sigmasq loglikelihood

Units of alphaHat should be either (1) normalized effect size estimates, ie sample correlations, or (2) Z scores. In case (1), nn should be the number of individuals in the GWAS. In case (2), nn should be 1, and the vector of sigmasq values should be multiplied by nn. Optionally, arguments alphaHat, sigmasq, P, whichSNPs can be specified as cell arrays with one cell for each LD block.

Optionally, arguments alphaHat, sigmasq, P, whichSNPs can be specified as cell arrays with one cell for each LD block.

Optionally, if it is desired to specify the mean of beta, ie

beta ~ MVN(mu, diag(sigmasq))

call GWASlikelihoodGradient( alphahat - P mu, …)

Optionally, if it is desired to model an unknown amount of population stratification/relatedness, or if the sample size is unknown, specify fixedIntercept = true. This modifies the model as: alphaHat|beta ~ MVN(R*beta, R * (1/nn + a)) where a==0, and it computes the derivative of the log-likelihood with respect to a; this will be the last element of grad.

MATLAB.ImpGldcov(R, whichIndices, mergedSumstats, lambda)

ImpGindiv performs “Bogdan-style” linear imputation of missing summary statistics using the method of Pasaniuc et al. 2014 Bioinformatics

Let “0” index the SNPs being imputed, and “1” index the non-missing SNPs. The imputation formula is: Z_0 = R_01 * inv(R_11) * Z_1, where R is the LD correlation matrix.

Input arguments: R: LDGM correlation matrices, as a number-of-LD-blocks by 1 cell array with each cell containing a precision matrix, or as a single precision matrix.

whichIndices: output from mergesnplists, encoding which indices (rows/columns of the LDGM precision matrices) have corresponding SNPs in the summary statistics. Should be a cell array of indices/logicals, or a single vector of indices/logicals

mergedSumstats: merged summary statistics tables for each LD block, output from mergesnplists. Each table should have height equal to the length of corresponding cell of whichIndices. Tables are expected to report a Z score with column name Z_deriv_allele. Should be a cell array of the same size as P and whichIndices, or a single table.

parallelize_blocks: optionally, parallelize computation across LD blocks (if P, whichIndices and mergedSumstats are cell arrays) (default: 0)

Output arguments: Z_imputed: Bogdan-style imputed Z scores.

MATLAB.ImpGldgm(P, whichIndices, mergedSumstats, precomputed_ldlChol, parallelize_blocks)

ImpGldgm performs “Bogdan-style” linear imputation of missing summary statistics using the method of Pasaniuc et al. 2014 Bioinformatics

Let “0” index the SNPs being imputed, and “1” index the non-missing SNPs. The imputation formula is: Z_0 = R_01 * inv(R_11) * Z_1, where R is the LD correlation matrix.

This is implemented with LDGMs instead using the formula Z_0 = v_0 * (P/P_11) (v_1’ * (P/P_00) * Z_1) where v_0, v_1 are the projection matrices onto the coordinates of the missing and non-mising SNPs, respectively.

Input arguments: P: LDGM precision matrices, as a number-of-LD-blocks by 1 cell array with each cell containing a precision matrix, or as a single precision matrix.

whichIndices: output from mergesnplists, encoding which indices (rows/columns of the LDGM precision matrices) have corresponding SNPs in the summary statistics. Should be a cell array of indices/logicals, or a single vector of indices/logicals

mergedSumstats: merged summary statistics tables for each LD block, output from mergesnplists. Each table should have height equal to the length of corresponding cell of whichIndices. Tables are expected to report a Z score with column name Z_deriv_allele. Should be a cell array of the same size as P and whichIndices, or a single table.

parallelize_blocks: optionally, parallelize computation across LD blocks (if P, whichIndices and mergedSumstats are cell arrays) (default: 0)

Output arguments: Z_imputed: Bogdan-style imputed Z scores.

MATLAB.README

This directory contains MATLAB code to infer LDGM precision matrices and to analyze GWAS data using LDGMs.

New users should start in the tutorials directory, where there are tutorials on how to analyze real and simulated GWAS summary statistics using LDGM precision matrices.

This directory contains the following functions:

BLUPxldgm: computes polygenic scores from GWAS summary statistics and LDGM precision matrices, for one or more populations

BLUPxldcov: computes polygenic scores from GWAS summary statistics and LD correlation matrices, for one or more populations

GWASlikelihood: computes log-likelihood of the GWAS summary statistics under a Gaussian model

GWASlikelihoodGradient: this function requires the sparseinv function, which can be obtained by installing suitesparse: https://people.engr.tamu.edu/davis/suitesparse.html It calculates the gradient of the log-likelihood w.r.t. the variance parameters of the Gaussian prior distribution.

loadLDGMs: loads LDGM precision matrices and SNP lists from a specified directory

logDeterminant: calculates the log-determinant of a matrix M=R+D, where R is the inverse of the LDGM precision matrix and D is a diagonal matrix.

mergesnplists: merges LDGM precision matrix SNPs with those from a table of GWAS summary statistics. It also takes care of allele phasing. We highly recommend using this function, as documented in the tutorial

precisionDivide: divides a vector y by the precision matrix, accounting for missing SNPs (i.e., multiplies y by the inverse of the precision matrix restricted to nonmissing SNPs)

precisionMultiply: multiplies a vector y by the precision matrix accounting for missing SNPs

r2PGS: calculates the PGS r2 of a polygenic predictor in a population with the specified LD precision matrix and allele frequencies

simulateSumstats: simulates GWAS summary statistics with the specified LD patterns and effect size distribution, for one or more populations

ImpGldgm and ImpGldcov: perform linear imputation of GWAS summary statistics using an LDGM precision matrix or an LD correlation matrix, respectively

The utility subdirectory contains functions that users will probably not use directly, but which are called from the functions above.

MATLAB.loadLDGMs(filepath, varargin)

loadLDGMs reads LDGMs or LDGM precision matrices from specified file or directory, together with corresponding SNP lists

Input arguments: filepath: where to look for .edgelist files to load

Optional input arguments (name-value pairs):

popnNames: names of populations to be included, or arbitrary strings that must be included in the filenames whichBlocks: indices of which LD blocks to include, out of all of those that are found snsplistpath: where to look for SNP lists, if different from filepath normalizePrecision: whether to normalize precision matrices such that the diagonal of their inverse is exactly 1

MATLAB.logDeterminant(P, whichIndices, d)

logDeterminant computes the log-determinant of the matrix M = R(whichIndices,whichIndices) + D, where R == inv(P) is the correlation matrix, D == diag(d) is a diagonal matrix, and whichIndices is a list of indices.

Input arguments: P: LDGM precision matrices, as a cell array with each cell containing a precision matrix or as a single precision matrix

whichIndices (optional): output from mergesnplists, encoding which indices (rows/columns of the LDGM precision matrices) have corresponding SNPs in each summary statistic file respectively. If P is a cell array, this should be a cell array of the same size; if P is a single matrix, this should be a vector of logicals or indices

d (optional): vectors to be converted into a diagonal matrix and added to P. Can also specify scalars. If P is a cell array, this should be a cell array of the same size; if P is a single matrix, this should be a scalar or a vector of size equal to the number of nonmissing SNPs

Output arguments: logDetM: log-determinant of M for each LD block

A: cholesky factor of L, which has submatrix M; has size equal to number of nonzero elements on the diagonal of P

MATLAB.mergesnplists(snplists, sumstats, P, varargin)

mergesnplists merges a cell array of .snplist tables, one for each LD block, with a table of summary statistics or other SNP information.

Input arguments: snplists: cell array where each cell contains a snplist table.

sumstats: summary statistics table with mandatory column name (non-case-sensitive) SNP or RSID, to be used for merging.

Optionally, sumstats can also contain columns named A1/A2 or anc_alleles/deriv_alleles. If these columns are specified, they will be merged with the anc_alleles/deriv_alleles columns of snplists.

P: cell array of LDGM precision matrices. Any indices whose

corresponding rows/columns in P are empty will be ingored. Alternatively, specify a logical row vector.

Optional input arguments: Optional inputs specify the names of various columns in the summary statistics table. Only the variant ID column is required; the others are optional, and will be ignored if they do not match anything in the sumstats table

columnNameContainingVariantId: column of sumstats table containing SNP IDs (default: ‘SNP’)

columnNameContainingReferenceAllele, columnNameContainingAlternativeAllele: column names of reference and alternative alleles, required for phasing effect sizes. Any rows with matching SNP IDs but mismatching alleles will be discarded. (default: ‘A1’, ‘A2’)

columnNameContainingZ: Z scores column name. If not specified, function will attempt to compute Z scores from ‘beta’ and ‘SE’ columns. Either way, it will create a column named ‘Z_deriv_allele’ for the Z score of the derived allele (if ref and alt alleles are found) (default: ‘Z’)

columnNameContainingBeta, columnNameContainingSE: if Z score column not found, function will compute Z scores from these columns (default: ‘Beta’, ‘SE’)

columnNameContainingAlternativeAlleleFrequency: column name containing allele frequency of the alternative allele. If specified, function will create a column called AF_deriv_allele with the AF of the derived allele (if ref and alt alleles are found) (default: ‘EAF’)

columnNameContainingPGSWeight: column name containing polygenic score weights. Like Z scores, these are phased and assigned to a new column called ‘pgs_weight_deriv_allele’. However, they are handled differently for SNPs that correspond to the same row/column of the LDGM (which are expected to be in near-perfect LD). Instead of picking one of the SNPs arbitrarily, the entry of pgs_weight_deriv_allele will be the sum of the phased PGS weights for all SNPs corresponding to that index. (default: none)

Ouput arguments: whichIndices: cell array of indices, a subset of the ‘index’ column of each snplist. Indexes which row/column of LDGMs have a corresponding SNP in the sumstats table.

mergedSumstats: cell array of sumstats tables, each containing a subset of the rows of the original sumstats table, that match the indices of each LD block. Note: when multiple SNPs correspond to the same row/column of the LDGM, the first-listed one will arbitrarily be chosen as a representative, except for the column specified by columnNameContainingPGSWeight

If alleles are specified in the sumstats table, the mergedSumstats tables will have an extra column appended called ‘phase’, which indicates whether the alleles match (+1) or anti-match (-1). If they mismatch (i.e., flipping their labels doesn’t make them match), then the SNP will be discarded.

Note: it makes this function run faster if you go chromosome-by-chromosome; this is not done automatically.

MATLAB.precisionDivide(P, y, whichIndices, use_ldlchol)

precisionDivide solves (P/P00)*x = y where P = [P11, P10; P01 P00], P11 = P(whichIndices,whichIndices), and P/P00 is the Schur complement Input arguments: P: precision matrix or cell array of precision matrices y: vector, matrix or cell array of vectors/matrices with same dimension as P whichIndices: indices or cell array of indices, or boolean vector/cell array of boolean vectors, indicating which entries of P correspond to the entries of y use_ldlchol: optionally, specify use_ldlchol == true to use a precomputed LDL cholesky decomposition in place of the precision matrix. This requires installing SuiteSparse. Then, run:

[A, ~, perm] = ldlchol(P); x(perm) = precisionDivide(A, y(perm), whichIndices(perm), true);

where perm is a fill-reducing permutation and A is the LD factor of P.

Output arguments: x == (P/P00) y

MATLAB.precisionMultiply(P, y, whichIndices)

precisionMultiply computes x = (P/P00)y where P = [P11, P10; P01 P00], P11 = P(whichIndices,whichIndices), and P/P00 is the Schur complement

Input arguments: P: precision matrix or cell array of precision matrices y: vector or cell array of vectors with same dimension as P whichIndices: indices or cell array of indices, or boolean vector/cell array of boolean vectors, indicating which entries of P correspond to the entries of y Output arguments: x == (P/P00)*y

MATLAB.r2PGS(trueBeta, estimatedBeta, P, whichIndices, AF, useLdlChol)

r2PGS calculates the PGS accuracy, i.e. the squared correlation between X*true_beta_perSD and X*estimated_beta_perSD, using the precision matrix.

Input arguments: trueBeta: ‘true’ causal effect sizes, as a cell array of size number-of-blocks by number-of-populations. If AF is specified, trueBeta should be specified in per-allele units; this will be the most convenient choice when making comparisons across ancestry groups (with ancestry-specific LD + AF). If AF is empty, trueBeta should be specified in per-SD units.

estimatedBeta: estimated causal effect sizes, as a cell array of the same size, or as a number-of-blocks by 1 cell array (in which case the same effect size estimates are used for every population). If AF is specified, estimatedBeta should be specified in per-allele units; this will be the most convenient choice when making comparisons across ancestry groups (with ancestry-specific LD + AF). If AF is empty, trueBeta should be specified in per-SD units.

P: LDGM precision matrices, as a cell array with each cell containing a precision matrix, for the target population (as opposed to the training population). If useLdlChol == true, then P should instead be a cell array of Cholesky factors computed using ldlchol.

whichIndices: output from mergesnplists, encoding which indices (rows/columns of the LDGM precision matrices) have corresponding SNPs in each summary statistic file respectively. Same size as P.

AF: allele frequencies, as a number-of-blocks by number-of-populations cell array. Can also be empty. If nonempty, trueBeta and estimatedBeta should be specified in per-allele units; if empty, they should be in per-SD units.

useLdlChol: whether Cholesky factors have been pre-computed using ldlchol(). This option leads to significant speedups. To precompute the cholesky factors, apply the following code to each LD block:

nz = any(P); % indices not missing from precision matrix A = ldlchol(P(nz,nz)); % Cholesky factor whichIndicesNz = lift(whichIndices, find(nz)); % indices into A trueBeta = trueBeta(nz); estimatedBeta = estimatedBeta(nz); r2 = r2PGS(trueBeta, estimatedBeta, A, whichIndicesNz, AF, true);

Output arguments: r2: the squared correlation between the predictions (this is usually larger than the squared correlation between the true and predicted weights themselves)

MATLAB.simulateSumstats(sampleSize, varargin)

Simulates summary statistics from specified prior distribution for one or more populations.

Required input arguments: sampleSize: sample size for each population as a vector

% Optional input arguments as name-value pairs:

savePath: path and file name where summary statistics should be saved. One file is saved per population; if there are multiple populations, this should be a cell array of length noPops. Sumstats will be saved as a plain text file with the following columns:

SNP: identifier of each SNP, if snpID is specified Z_deriv_allele: Z scores N: sample size AF_deriv_allele: if alleleFrequency is specified, the AF of each SNP index: row/column of the precision matrix corresponding to each SNP;

indexing starts at zero

block: which LD block the row/col belongs to; indexing starts at

zero

beta_perSD_true: true causal effect size, in per-standard-deviation units, for each SNP

snplists: SNP lists containing additional information to be printed with the summary statistics. Must be specified in order to use ‘PRScs’ file format option

fileFormat: If saving summary statistics to a file, which file format to use. Current options are ‘ldgm’, ‘PRScs’, ‘LDpred’, and ‘LDSC’

alleleFrequency: allele frequency for each LD block and each population as a number-of-LD blocks by number-of-populations cell array. If specified, the effect sizes will be in per-allele rather than per-SD units

precisionMatrices: precision matrix for each LD block and each population as a number-of-LD blocks by number-of-populations cell array. Specify either this or correlationMatrices.

correlationMatrices: correlation matrix for each LD block and each population as a number-of-LD blocks by number-of-populations cell array. Specify either this or precisionMatrices

annotations: annotation matrix for each LD block as a number-of-LD blocks by 1 cell array

linkFn: link function mapping annotation vector of a SNP to its relative per-SNP heritability. Should be a nonnegative, scalar-valued function: for example, @(annot)log(1 + exp(annot * tau)), where tau is a column vector of length equal to the number of annotations

whichIndicesAnnot: which rows/columns of the precision matrices/correlation matrices have a corresponding annotations vector, as a number-of-LD-blocks by 1 cell array. This is useful when you are performing simulations with real functional annotations, and they are missing for some of the SNPs in the LDGM. SNPs not in whichIndices will not be assigned an effect, and will not have summary statistics.

annotationDependentPolygenicity: if false (default), annotation-dependent and AF-dependent differences in per-SNP heritability will be simulated by scaling the effect-size magnitude of causal SNPs. If true, differences will be simulated by modifying the proportion of causal SNPs. When this is set to true, sum(componentWeight) should be small enough (e.g., 0.01) that it is possible to produce the deesired enrichments without exceeding the desired polygenicity; otherwise, a warning will be printed.

heritability: total heritability for each population, either as a scalar, a vector, or a square matrix. If a matrix, it specifies both the heritability for each population (along its diagonal) and the genetic correlations (off the diagonal, i.e. with corrcof(heritability) == r_pop). If a vector, the cross-population genetic correlations will be (almost) 1. If a scalar, the cross-population genetic correlations will be (almost) 1, and the heritability will be the same for each population.

componentVariance: per-allele effect-size covariance matrix for each mixture component, as a number-of-populations by number-of-populations by number-of-components array. These will be rescaled to match the total heritability.

componentWeight: mixture weight for each heritability component

missingness: fraction of SNPs that are missing, in addition to those for which precision matrix already has a zero diagonal element

Output arguments: sumstats: cell array of tables, one per LD block, with the following columns:

Z_deriv_allele: Z scores AF: allele frequencies N: sample size

whichIndices: number-of-blocks by number-of-populations cell array of indices for which sumstats are reported. If missingness input is zero (default), this is simply the indices for which the precision matrix is nonmissing.

componentVariance: per-allele effect-size covariance matrix across populations for each variance component. This is normalized to the desired heritability for each population.

true_beta_perallele: true per-allele effect size of each variant, including those that are missing in the summary statistics

true_beta_perSD: same as true_beta_perallele, but in per-SD units

Functions to infer LDGM precision matrices

MATLAB.precision.estimatePrecision(data_directory, varargin)

Estimates LDGM precision matrix from an LDGM and genotype data using modified DP-GLASSO algorithm

Input data (located at data_directory) can be one of three types, indicated by ‘data_type’ input:

-genotypes: an individials-by-SNPs binary matrix with filename extension .genos, and columns that correspond to the rows of .snplist file (see below). If this is specified, there several additional options (see below) -correlation: a SNPs-by-SNPs symmetric matrix with filename extension .txt, encoding the correlation between each pair of SNPs. -edgelist: an edges-by-3 .edgelist file encoding the correlation between each pair of SNPs with an edge in the LDGM. If any values are missing, the corresponding edge of the LDGM is dropped.

Differences between these options are as follows. If genotypes are specified, the .genos file must match the SNPs in the .snplist file. There is no option to specify a second .snplist file and merge the two sets of SNPs. On the other hand, there are several additional options (see below). If correlation or edgelist is specified, it is required to also specify LD_matrix_snplist_dir, whose rows correspond to the SNPs in the LD matrix (or edgelist). This will automatically be merged with the LDGM .snplist file, using the column named ‘site_ids’. The resulting precision matrix will have one row/column for each labeled brick (i.e., nonredundant SNP) in the LDGM such that at least one of its SNPs is present in the LD matrix. Optionally, specify ‘AF_column_name’ together with ‘minimum_AF’. With this option, the allele frequencies will be looked up in the LD matrix .snplist file (rather than the LDGM .snplist file), under the column with the corresponding name. Optionally, specify ‘A1_column_name’ and ‘A2_column_name’. With this option, the alleles will be matched between A1/A2 and ancestral/derived allele, and the resulting precision matrix will be phased to ancestral/derived.

If genotypes are specified, there are various additional options:

-population_data_file: a file describing the ancestry of each individual (column) of the .genos file. If this is specified, it can be used to restrict to a subset of these individuals by ancestry, or to compute a weighted meta-analysis across ancestries -population_name: which population (in population_data_file) to restrict to. This can also be specified with the other two options, but it only affects which column of the .snplist file is used as the allele frequencies (i.e., AF_POPULATIONNAME). -meta_weights: meta-analysis weights to compute LD as a weighted average across populations in the population_data_file. To use this option, set population_name to ‘META’ -downsample_fraction: option to randomly downsample individuals in the .genos file. If this option is used, the .stats.txt output file will contain some additional columns reporting cross-validated MSE. -local_ancestry_dir: not yet supported

FILENAME HANDLING OPTIONS:

-edgelist_dir: directory containing LDGM .edgelist files. Defaults to data_directory. -snplist_dir: directory containing .snplist files. Defaults to data_directory. -output_dir: directory in which to save the output files -custom_filename: if specified, this is prepended to the name of the output files within output_dir -data_pattern: if specified, only data files matching this regular expression will be found -data_file_index: out of the data files that are found, which one to compute a precision matrix for -LD_matrix_snplist_dir: only applicable if specifying an LD matrix or LD correlation matrix SNP list as the data file; directory in which to find a corresponding .snplist file, for merging with the LDGM .snplist file

METHOD PARAMETERS: (all of these can probably be left at default values)

-minimum_maf: minimum allele frequency, as calculated either using the LDGM .snplist file (default) or the LD matrix .snplist file (if AF_column_name is specified) -l1_penalty: L1 penalty for graphical lasso. Method performs some coordinate descent steps with the penalty, throws out edges whose precision matrix entries are zero, and continues without the penalty. Default value is 0.1. -path_distance_threshold: LDGM entries with distance below this threshold will be retained. Default value is 4. -patch_paths: how to remove missing and low-frequency SNPs. Options are 0 (discard these SNPs, do not patch paths that are broken), 1 (patch paths for missing SNPs, not for low-frequency SNPs), and 2 (patch paths for both missing and low-frequency SNPs) (not recommended). Default value is 0. -penalized_iterations: number of coordinate descent iterations to perform for each SNP in the L1-penalized graphical lasso step. Default value is 5. -unpenalized_iterations: number of coordinate descent iterations to perform for each SNP in the unpenalized graphical lasso step. Default value is 25. -lasso_iterations: number of inner-loop iterations in the coordinate descent procedure. Default value is 10. -banded_control_ldgm: instead of using the LDGM that is specified, replace it with a banded-diagonal matrix. The band size is chosen to match the density of the LDGM (with the specified path_distance_threshold). This should produce worse results than the original LDGM. -r2_control_ldgm: similar to banded_control_ldgm, except that instead of a banded diagonal matrix it uses a matrix with a ‘1’ wherever the r^2 between two SNPs is high enough. This should produce much worse results than using the original LDGM.

OUTPUT FILES: two output files are written to output_dir directory. They are named [custom_filename, filename, extension] where custom_filename is an optional user-specified string, filename is the name of the data file (except without its filename extension), and extension is either ‘.precisionMatrix.edgelist’ or ‘.stats.txt’

-’.stats.txt’ is a table with one row. It contains information like the block size, the runtime, and the MSE. Note: if the data file is a SNP list, then the MSE will be calculated from whatever entries are provided in that file, and it might be unreliable if too few are provided. -’.precisionMatrix.edgelist’ is an edge list with each row corresponding to an entry of the precision matrix. This file is zero-indexed, and the rows/columns of the precision matrix correspond to the ‘index’ column of the LDGM .snplist file.

Instead of saving the output, it is also an option to return the precision matrix P directly.

MATLAB.precision.precisionCoordinateDescent(A, R_sample, maxIters, lambda, lassoIters, P)

precisionCoordinateDescent estimates the precision matrix P from the LDGM A and the sample correlation matrix R_sample, using a modified DP-GLASSO algorithm (Mazumder and Hastie, 2012).

Input arguments: A: graphical model as a sparse symmetric matrix; precision matrix edges will be a subset of those of A R_sample: sample correlation matrix; can be sparse with entries corresponding to those of A maxIters (optional): number of coordinate descent outer loops (default 25) lambda (optional): L1 penalty term, usually something like 0.1 (default 0) lassoIters (optional): number of iterations for lasso regression inner loop (default 10) P (optional): initialization for P (default identity matrix)

MATLAB.precision.reduce_weighted_graph(A, missing, is_weighted)

reduce_weighted_graph removes rows/columns “missing” from A and patches paths between remaining vertices. A: weighted graph, with ‘1’ corresponding to a strong edge. missing: which SNPs are missing from A (logical or index) is_weighted: true -> edge weights should be between 0 and 1 (default); 0 -> edge weights represent distances and should be between 0 and inf

MATLAB.precision.solveLasso(P, y, lambda, neighbors, noIter, min_stepsize, u)

solveLasso uses proximal gradient descent (Combettes and Wajs, 2005) to solve a lasso regression problem:

P_inv(neighbors,neighbors) * u - y + lambda*sign(u) == 0 where P_inv == inv(P). This is the gradient of the lasso regression problem: argmin 0.5*sum((P_inv(neighbors,neighbors) * beta - y).^2) + lambda * sum(abs(beta))

Input arguments: P, y, lambda, idx: problem definition noIter: number of iterations u: initial guess for the coefficients

MATLAB.precision.writeCorrelationEdgelist(data_directory, varargin)

writeCorrelationEdgelist inputs a genotype matrix and an LDGM, and it outputs the correlations between pairs of SNPs with edges in the LDGM, as a .edgelist file.

Input data (located at data_directory) is an individials-by-SNPs binary

matrix with filename extension .genos, and columns that correspond to the rows of .snplist file (see below).

-population_data_file: a file describing the ancestry of each individual (column) of the .genos file. If this is specified, it can be used to restrict to a subset of these individuals by ancestry, or to compute a weighted meta-analysis across ancestries -population_name: which population (in population_data_file) to restrict to. This can also be specified with the other two options, but it only affects which column of the .snplist file is used as the allele frequencies (i.e., AF_POPULATIONNAME). -meta_weights: meta-analysis weights to compute LD as a weighted average across populations in the population_data_file. To use this option, set population_name to ‘META’ -local_ancestry_dir: not yet supported

FILENAME HANDLING OPTIONS:

-edgelist_dir: directory containing LDGM .edgelist files. Defaults to data_directory. snplist_dir: directory containing .snplist files. Defaults to data_directory. -output_dir: directory in which to save the output files -custom_filename: if specified, this is prepended to the name of the output files within output_dir -data_pattern: if specified, only data files matching this regular expression will be found -data_file_index: out of the data files that are found, which one to compute a precision matrix for

METHOD PARAMETERS: (all of these can probably be left at default values)

-minimum_maf: minimum allele frequency, as calculated either using the LDGM .snplist file (default) or the LD matrix .snplist file (if AF_column_name is specified) -path_distance_threshold: LDGM entries with distance below this threshold will be retained. Default value is 4.

OUTPUT FILES:

-’*.correlationMatrix.edgelist’ is an edge list with each row corresponding to an entry of the LDGM. This file is zero-indexed, and the rows/columns of the precision matrix correspond to the ‘index’ column of the LDGM .snplist file.

Instead of saving the output, it is also an option to return the correlation matrix R directly.