Minimal syntax for phylogenetic trees

Darius J Chuck

2021-12-06

Newick format was developed in 1986 as a minimal representation for phylogenetic trees for the PHYLogeny Inference Package.

More generally it can represent different kinds of tree-structures.

Following up the previous article here I show how Jevko could be used as an even more general and minimal alternative for the task.

Example

The Wikipedia example tree:

which is represented in the Newick format in several ways:

(,,(,));                           
(A,B,(C,D));                       
(A,B,(C,D)E)F;                     
(:0.1,:0.2,(:0.3,:0.4):0.5);       
(:0.1,:0.2,(:0.3,:0.4):0.5):0.0;   
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);   
(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; 
((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;

can be represened in Jevko as follows:

[[][][[][]]]
[[A][B][[C][D]]]
[[A][B][[C][D]E]F]
[0.1[]0.2[]0.5[0.3[]0.4[]]]
0.0[0.1[]0.2[]0.5[0.3[]0.4[]]]
[0.1[A]0.2[B]0.5[0.3[C]0.4[D]]]
[0.1[A]0.2[B]0.5[0.3[C]0.4[D]E]F]
[0.1[0.2[B]0.5[0.3[C]0.4[D]E]F]A]

Let’s call this format Phylo-Jevko.

Grammar comparison

Compared to the Newick format’s grammar:

Tree → Subtree ";"
Subtree → Leaf | Internal
Leaf → Name
Internal → "(" BranchSet ")" Name
BranchSet → Branch | Branch "," BranchSet
Branch → Subtree Length
Name → empty | string
Length → empty | ":" number

Phylo-Jevko is simpler:

Tree = *Branch Name
Branch = Length "[" Tree "]"
Length = number / ""
Name = string / ""

In Phylo-Jevko:

a Subtree is a branch of the top-level Tree.
a Leaf node is a Tree with zero Branches.
an Internal node is a Branch. Its Name is the Name of the Tree it contains.
the Length of a Branch goes before the opening bracket as opposed to going after Name, separated by :, as it is in the Newick format.

Both Name and Length may be surrounded with whitespace. Whitespace is also allowed within Name, no quoting needed.

A simple escape mechanism (as in the definition of Jevko) could be introduced to allow Names with brackets.

Thanks to these simplifications, no extra separators, such as :, ;, ,, or ' are needed – only brackets.

Comments could be implemented as branches prefixed with # instead of Length:

Comment = "#" "[" Name "]"

Nested comments could be allowed like so:

Comment = "#" "[" NestedComment "]"
NestedComment = *("[" NestedComment "]" / Name)

Conclusion

Extremely minimal formats for encoding all kinds of tree structures can be built based on Jevko, offering the most bang for the buck in terms of complexity. Less accidental complexity means less trouble and better efficiency. This is a good direction to go in when building a new system.

For existing systems, whether the extra efficiency is worth the work necessary to simplify – that’s a question that is best answered on an individual basis.

Comments welcome on Mastodon.

Support TAO

If you like, you can support my work with a small donation.

Thank you!

— Darius J Chuck