Tutorial

Creating Graphs

Let’s say that you wanted to create the following graph:

../_images/exampleGraph.png

This graph is a combination of nodes (labelled as n0, n1, …, n9) and directed edges (arrows).

Graph Objects

Edges and nodes are accessed through an implementation of the bdsg.handlegraph.HandleGraph interface. Individual nodes in the graph are pointed at by bdsg.handlegraph.handle_t objects.

Paths exist in graphs that implement the bdsg.handlegraph.PathHandleGraph. Paths are accessed through bdsg.handlegraph.path_handle_t, which is a series of bdsg.handlegraph.step_handle_t linked together. Each bdsg.handlegraph.step_handle_t points to the node in that step, and also contains directional information regarding the nodes preceeding and following it.

Handles are pointers to specific pieces of the graph, and it is not possible to operate on them directly, aside from comparing whether the objects are equal. To get information regarding the object that each handle is pointing to, it is necessary to use the corresponding get accessor method on the graph that issued the handle.

Reference materials for these methods can be found at the Python API, as well as the Sorted Glossary of Methods, which contains lists sorted by object type for Accessor Methods, Mutator Methods, and Iteratator Methods.

Making a Graph

First, we must create the graph, then make each node and keep track of their handles. We’re going to be using the HashGraph, bdsg.bdsg.HashGraph, which is a good all-around graph that implements bdsg.handlegraph.MutablePathDeletableHandleGraph.

from bdsg.bdsg import HashGraph
gr = HashGraph()
seq = ["CGA", "TTGG", "CCGT", "C", "GT", "GATAA", "CGG", "ACA", "GCCG", "ATATAAC"]
n = []
for s in seq:
    n.append(gr.create_handle(s))

Now we link together these nodes using their handles. Note that each of these handles is directional, and we create each edge from the first handle to the second. In order to create both of the edges between n5 and n8 (since each can follow the other) we use create_edge twice.

gr.create_edge(n[0], n[1])
gr.create_edge(n[1], n[2])
gr.create_edge(n[2], n[3])
gr.create_edge(n[2], n[4])
gr.create_edge(n[3], n[5])
gr.create_edge(n[5], n[6])
# Connect the end of n5 to the start of n8
gr.create_edge(n[5], n[8])
gr.create_edge(n[6], n[7])
gr.create_edge(n[6], n[8])
gr.create_edge(n[7], n[9])
gr.create_edge(n[8], n[9])
# Connect the end of n8 back around to the start of n5
gr.create_edge(n[8], n[5])

Traversing Edges

If we wanted to traverse these edges, we could do it using the iterator method bdsg.handlegraph.HandleGraph.follow_edges().

def next_node_list(handle):
    lis = []
    gr.follow_edges(handle, False, lambda y: lis.append(y))
    return lis

print(f'n0: {gr.get_sequence(n[0])}')
next_node = next_node_list(n[0])[0]
print(f'n1: {gr.get_sequence(next_node)}')
next_node = next_node_list(next_node)[0]
print(f'n2: {gr.get_sequence(next_node)}')

Which will output the following:

n0: CGA
n1: TTGG
n2: CCGT

Creating a Path

Generating a linear sequence from this graph could be done in infinitely many ways, due to the interal loop between n5, n6, and n8. If we wanted to define a single consensus sequence, we would do this by defining a path.

../_images/exampleGraphPath.png

To create the hilighted path, we would need to create a bdsg.handlegraph.path_handle_t in the graph, and then append each bdsg.handlegraph.handle_t to the end of the path.

path = gr.create_path_handle("path")
gr.append_step(path, n[0])
gr.append_step(path, n[1])
gr.append_step(path, n[2])
gr.append_step(path, n[4])
gr.append_step(path, n[5])
gr.append_step(path, n[6])
gr.append_step(path, n[7])
gr.append_step(path, n[9])

Warning

bdsg.handlegraph.MutablePathHandleGraph.append_step() will not stop you from appending nodes that are not connected to the preceeding node.

# the following code runs without error
badpath = gr.create_path_handle("badpath")
gr.append_step(badpath, n[0])
gr.append_step(badpath, n[3])

Traversing a path

To traverse a path, we need to fetch a series of bdsg.handlegraph.step_handle_t from the graph. Note that although we are effectively asking the path for these items in it, all accessor methods are a part of the bdsg.handlegraph.PathHandleGraph object.

step = gr.path_begin(path)
while(gr.has_next_step(step)):
    # get the node handle from the step handle
    current_node_handle = gr.get_handle_of_step(step)
    # ask the node handle for the sequence
    print(gr.get_sequence(current_node_handle))
    # progress to the next step
    step = gr.get_next_step(step)
current_node_handle = gr.get_handle_of_step(step)
print(gr.get_sequence(current_node_handle))

Which will output the following:

CGA
TTGG
CCGT
GT
GATAA
CGG
ACA
ATATAAC

Saving and Loading Graphs

Graphs that implement bdsg.handlegraph.SerializableHandleGraph, such as bdsg.bdsg.HashGraph, can be saved and loaded through the bdsg.handlegraph.SerializableHandleGraph.serialize() and bdsg.handlegraph.SerializableHandleGraph.deserialize() methods.

Graph File Example

If you wish to save the graph from the above session, that can be done with:

gr.serialize("example_graph.hg")

This can be loaded into a new python session by using:

from bdsg.bdsg import HashGraph
gr = HashGraph()
gr.deserialize("example_graph.hg")

Loading in Pre-Existing Data

Each graph implementation knows how to read files in its respective file format.

For example, provided that data has been serialized in PackedGraph format, it is possible to read it directly from a file with bdsg.bdsg.PackedGraph. Download this graph and load it into python with:

from bdsg.bdsg import PackedGraph
brca2 = PackedGraph()
brca2.deserialize("cactus-brca2.pg")

We can poke around this data and get the sequence of the path with:

path_handle = []
handles = []
brca2.for_each_path_handle(lambda y: path_handle.append(y) or True)
brca2.for_each_step_in_path(path_handle[0],
    lambda y: handles.append(brca2.get_handle_of_step(y)) or True)
sequence = ""
for handle in handles:
    sequence += brca2.get_sequence(handle)
print(sequence[0:10])
print(len(sequence))

Note how we are using or True in the iteratee callback lambda functions to make sure they return True. If a callback returns False or None (which is what is returned when you don’t return anything), iteration will stop early and the for_each call will return False.

Reading in a Graph from vg

Graph assembies can be created with vg. Many .vg files that vg 1.28.0 or newer produces will be in HashGraph format, directly loadable by bdsg.bdsg.HashGraph.deserialize(). To check a file, you can use vg stats --format, like so:

vg stats --format graph.vg

If you see one of format: HashGraph or format: PackedGraph, you can probably (but not always; see the note on encapsulation) load the graph with bdsg.bdsg.HashGraph or bdsg.bdsg.PackedGraph, respectively.

However, in some circumstances, you will need to convert the graph to one of those formats. Some graphs (most notably, graphs from vg construct) will report format: VG-Protobuf, and .xg files will report format: XG. These graphs will need to be converted to HashGraph or PackedGraph format, and then loaded with the appropriate class. For example, you can do this:

vg convert --packed-out graph.vg > graph.pg

The resulting PackedGraph file can be loaded with bdsg.bdsg.PackedGraph.deserialize().

from bdsg.bdsg import PackedGraph
graph = PackedGraph()
graph.deserialize("graph.pg")

To use bdsg.bdsg.HashGraph instead, substitute --hash-out for --packed-out.

Older vg Graphs with Encapsulation

Versions of vg before 1.28.0 would encapsulate HashGraph and PackedGraph graphs in a file format that vg can read but libbdsg cannot. Consequently, some older graph files will be reported as format: HashGraph or format: PackedGraph by vg stats --format, but will still not be readable using libbdsg.

If you encounter one of these files, you can use vg view --extract-tag to remove the encapsulation and pull out the internal file which libbdsg can understand. For example, for a file that reports format: PackedGraph but is not loadable by libbdsg, you can do:

vg view graph.vg --extract-tag PackedGraph > graph.pg

This also works for HashGraph files, by replacing PackedGraph with HashGraph.

Running the file through vg convert with vg 1.28.0 or newer will also solve the problem, but could take longer.