Python API

The ldgm Python API provides methods for inferring an LDGM from a tree sequence. See the MATLAB API page for documentation on inferring LDGM precision matrices from LDGMs.

Creating an LDGM

ldgm.make_ldgm(ts, path_weight_threshold, recombination_freq_threshold=None, num_processes=1, chunksize=100, progress=False)

Take a tree sequence and produce an LDGM. The path_weight_threshold and recombination_freq_threshold parameters allow differing levels of regularization to be applied to this process.

Parameters:
  • tree_sequence (tskit.TreeSequence) – The input tskit.TreeSequence, which will be used to create the linkage disequilibrium graphical model.

  • path_weight_threshold (float) – The maximum path weight between nodes in the brick-haplo graph which should be retained in the LDGM. Lower path_weight_threshold values correspond to more regularization. The supplementary note of Nowbandegani et al. (2022) defines these path weights. Values of 4-6 have led to good results with data from the 1000 Genomes Project.

  • recombination_freq_threshold (float) – The minimum frequency above which bricks should be created by “bifurcating” edges. For example, if recombination_freq_threshold is 0.01, an edge with a different number of descendant samples to the left and right of a recombination event (marginal tree change) is bifurcated to create two bricks if the frequency of the edge’s child node is greater than 1% in the right hand marginal tree. If None, all edges with differing numbers of descendants to the left and right of a recombination event are bifurcated. Default: None

  • num_processes (int) – The number of threads to use. Default: 1

  • chunksize (int) – If using multiple threads, the algorithm will chop the dijkstra search step (the rate-limiting step of the algorithm) into chunks of nodes. The chunksize parameter determines the approximate size of chunks. Good results were observed with values around 100. Default: 100

  • progress (bool) – Whether to display a progress bar. Default: False

Returns:

A tuple of an LDGM and a bricked tree sequence.

Return type:

(networkx.DiGraph, tskit.TreeSequence)

Intermediate steps to creating an LDGM

Creating a “bricked tree sequence”

ldgm.brick_ts(ts, recombination_freq_threshold=None, progress=True)

Take an input tree sequence and bifurcate edges to create “bricks” that have the same set of descendants at every position.

Parameters:
  • ts (tskit.TreeSequence) – The input tskit.TreeSequence, which has not had ldgm.brick_ts() run on it.

  • recombination_threshold (float) – The minimum frequency above which bricks should be created by “bifurcating” edges. For example, if recombination_freq_threshold is 0.01, an edge with a different number of descendant samples to the left and right of a recombination event (marginal tree change) is bifurcated to create two bricks if the frequency of the edge’s child node is >1% in the right hand marginal tree. If None, all edges with differing numbers of descendants to the left and right of a recombination event are bifurcated. Default: None

  • progress (bool) – Whether to display a progress bar. Default: False

Returns:

A bricked version of the input tree sequence

Return type:

tskit.TreeSequence

Creating a brick graph from a bricked tree sequence

ldgm.brick_haplo_graph(bricked_ts, edge_weight_threshold=None, make_sibs=True, progress=True)

Takes a “bricked” tree sequence and creates a “brick-haplotype graph”. This graph consists of vertices for labeled bricks (which have a mutation), unlabeled bricks (without a mutation) and haplotypes. For each labeled brick, we define a cluster of six vertices: Up-Before, Down-Before, Up-After, Down-After, U-Turn, and Out. For each unlabeled brick, we define a cluster of four vertices: Up-Before, Down-Before, Up-After, and Down-After. Haplotypes have a cluster of two vertices: Before and After. The union of these clusters form the vertex set of the brick-haplotype graph. Edges in the brick-haplotype graph are used to find the conditional dependencies between SNPs in in the ldgm.reduce_graph() function

Vertex labels determined as the ID of a brick (edge) in the input brick tree sequence plus an integer, which determines which the vertex’s category:

labeled bricks: brick_id + 0 brick_id + 1 brick_id + 2 brick_id + 3 brick_id + 4

labeled bricks: brick_id + 0 brick_id + 1 brick_id + 2 brick_id + 3 brick_id + 4

Parameters:
  • bricked_ts (tskit.TreeSequence) – The input tskit.TreeSequence, which has been bricked (i.e. ldgm.brick_ts() has been run on it).

  • edge_weight_threshold (float) – The

  • make_sibs (bool) – Whether to run rule two, connecting siblings. Default: True

  • progress (bool) – Whether to display a progress bar. Default: False

Returns:

A Networkx graph encoding conditional dependence between labeled bricks.

Return type:

networkx.DiGraph

Creating a brick graph from a bricked tree sequence

ldgm.reduce_graph(brick_haplo_graph, brick_ts, path_weight_threshold, num_processes=1, progress=True, chunksize=100)

Make a reduced graph from a brick-haplo graph and bricked tree sequence

Parameters:
  • brick_haplo_graph (networkx.DiGraph) – An input brick-haplo graph, outputted by ldgm.brick_haplo_graph().

  • brick_ts (tskit.TreeSequence) – The input tskit.TreeSequence, which has been bricked (i.e. ldgm.brick_ts() has been run on it). This must be the tree sequence which was passed to ldgm.brick_haplo_graph().

  • path_weight_threshold (float) – The maximum path weight between nodes in the brick-haplo graph which should be retained in the LDGM. Lower path_weight_threshold values correspond to more regularization. The supplementary note of Nowbandegani et al. (2022) defines these path weights. Values of 4-6 have led to good results with data from the 1000 Genomes Project.

  • progress (bool) – Whether to display a progress bar. Default: False

Returns:

An LDGM

Return type:

networkx.DiGraph

Utility functions

ldgm.prune_sites(ts, threshold)
ldgm.make_snplist(ts, site_metadata_id=None, population_dict=None)
Returns information on variant sites in the input

tskit.TreeSequence.

Parameters:
  • bricked_ts (TreeSequence) – The input :class`tskit.TreeSequence`, for which the site list will be determined. This tree sequence must have been “bricked”, i.e. ldgm.brick_ts() must have been run on the tree sequence.

  • site_metadata_id (string) – The site ID field in the JSON-encoded site metadata. If None, does not return site IDs. Default: None.

  • population_dict (list) – A dictionary where keys are population ids and values are lists of nodes in each population. This defines populations from which allele frequencies are computed. If None, does not return site frequencies. Default: None.

Returns:

A dictionary containing the following key, value pairs: key: “index”, value: a numpy.ndarray of the IDs of the node in the LDGM to which each site belongs. When multiple mutations appear on the same brick, they will refer to the same node in the LDGM. key: “site_ids”, value: a numpy.ndarray containing the site_id of each site, only returned if site_metadata_id is specified. key: “anc_alleles”, value: a numpy.ndarray of the ancestral allele of each site. key: “deriv_alleles”, value: a numpy.ndarray of the derived allele of each site. key: “site_frequencies”, value: a mxp numpy.ndarray, where \(m\) is the number of sites in the input tree sequence and \(s\) is the number of sample sets passed to the population_dict parameter. Only returned if population_dict is specified.

Return type:

dict