Information for Developers¶
Please file issues in the issue tracker if you have questions or comments!
Pull requests are welcome!
Guidance on writing new parsers¶
The current built-in parsers are contained in
src/taxburst/parsers.py. In general, if you are familiar with
Python, the easiest way to proceed is to copy and rename an existing
parsing class (e.g. Parse_SourmashCSVSummary) and modify the build
method. Then add a new if/else branch in the top level parse_file
method.
The current parsers work by creating a nodes_by_tax
dictionary that contains (key, value) pairs where each key is a
semicolon-separated lineage (e.g. d__Bacteria;p__Spirochaetota) and
each value is a "node dictionary", a dictionary containing at least
name, count, and rank. This dictionary must contain all lineage
subpaths - e.g. if there is an entry for d__Bacteria;p__Spirochaetota
then there must also be an entry for d__Bacteria.
If the nodes_by_tax dictionary is built properly, then the function
taxburst.parsers.assign_children will build the hierarchy of nodes
needed for conversion into XHTML.
Many consistency checks are applied to this tree before output, and
additional consistency checks can be run with --check-tree on the
taxburst command line. If you find an error that is not caught by
these checks, please file an issue about it and we will add it to the
checks!
An alternative approach to writing a parser is to produce your own set of nested dictionaries in Python, or, if you prefer to program in a different language, write code to output the JSON format (see below).
Examples, documentation, and automated testing¶
For each new Python-based format parser, please add an example
(ideally, calculated for the metagenome SRR11125891) to the
examples/ top-level directory, and (ideally) add a step to the
snakemake workflow in examples/Snakefile so that taxburst is
automatically run on the example. You should also add a link to the
example output in the doc/README.md file.
Please also add a brief description of the new parser format to
the doc/command-line.md document.
Last but by no means least, please add a new tests/test_parse_*.py file
that runs the parser and checks a few values.
If you are writing a parser in another language that outputs JSON, please feel free to include the original file and the JSON output file in the examples, and link to your parsing code in the documentation. Note: for the moment, the JSON format doesn't support custom attribute display; if you need this, please let us know and we'll figure something out!
Feel free to ask for help on any of these tasks!p
Additional points¶
There is no inherent restriction on ranks, although the current parsing
classes all inherit from GenericParser which supports the normal
NCBI/GTDB ranks from "superkingdom" on down. Eventually we want to support
custom ranks (LINS, ICTV, etc); drop us a note if you're interested in
helping out, or testing!
Other keys are allowed in the node dictionary but are ignored in the
output format unless an extra_attributes dictionary is returned by
the parsing function; see the tax_annotate format parser for an
example.
Internals of the input and output formats¶
The taxburst code works in the following stages:
- Load in an input file containing some taxonomic summary.
- Convert that summary into an internal tree format in Python, based on nested lists of dictionaries.
- Convert that internal tree format into XHTML, which is then saved in a static HTML file with accompanying JavaScript to support interactive visualization.
These two intermediate formats are useful to know about because there are two ways to support new input formats: you can either write a Python function to convert a new format into the internal tree format, or you can write code in any language to output JSON that can be loaded into the internal tree format.
taxburst consumes a JSON version of this format with -F json, and
produces this format with --save-json <filenam>.
(In the future, it should be possible to modify the internal JavaScript in the static HTML file to read the JSON directly, which would simplify this even more and allow for more flexibility as well.)
Internal dictionary format¶
Here is a simple example of the internal dictionary format:
nodes = [
{
"name": "A",
"count": 5,
"score": 0.831,
"rank": "Phylum",
"children": [
{"name": "B", "count": 3, "score": 0.2, "rank": "Class"},
{"name": "C", "count": 1, "score": 0.1, "rank": "Class"},
],
},
]
See
examples/simple-output.py
in the github repo for example code to produce an output HTML from
this.
JSON version of the internal dictionary format¶
This nested dictionary format converts to fairly simple JSON:
[
{
"name": "A",
"count": 5,
"rank": "Phylum",
"children": [
{
"name": "B",
"count": 3,
"rank": "Class"
},
{
"name": "C",
"count": 1,
"rank": "Class"
}
]
}
]
This can be loaded from a file and converted into an HTML file like so:
import json
import taxburst
with open('nodes.json') as fp:
nodes = json.load(fp)
content = taxburst.generate_html(nodes)
with open('nodes.html') as fp:
fp.write(content)
This is equivalent to taxburst -F json nodes.json -o nodes.html.
Output formatting¶
taxburst uses Jinja2 templates to provide customizable formatting of the HTML output. Please see the files in src/taxburst/templates/ for implementation.