2021-06-04
CSV is a simple format that works fairly well for representing tabular data.
However, it does have some problems.
Let’s see how we can solve them by creating an even simpler format out of a subset of TAO.
The trouble, as always, appears in the edge cases.
CSV separates rows of data by newlines. This becomes a problem if we need to include newlines in the data itself.
Solution: use a TAO operator to separate rows. Let’s keep the newline character as the operator.
CSV separates data cells by commas. This becomes a problem if we need to include commas in the data itself.
Solution: use a TAO operator to separate cells. Let’s keep the comma character as the operator.
To solve the problem of newlines and commas in data, CSV introduces quoted values. This creates the need to escape the quote symbols in the data.
Solution: since we’ve solved the previous problems differently, we can drop the quoting and get rid of the incidental issue.
Whether or not and when to trim whitespace from CSV data is not clear and causes inconsistencies in practice. This is somewhat related to the problem of quoting.
Solution: for simplicity, let’s specify that all whitespace is part of data. The problem of quoting is already solved by unncessary feature subtraction, so there is no further ambiguity.
The format we ended up with shall be called TOSA for Tabular Operator-Separated Annotations (using nomenclature consistent with TAO).
Compared to CSV it is much simpler and at the same time less brittle.
There is a fixed cost we have to pay for that – the separators now take twice as many symbols. However we have a much simpler parser and got rid of the major headaches of CSV.
The only special symbol in our format and so the only character that
needs to be escaped in the data is the operator
meta
symbol.
This CSV example from Wikipedia:
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
can be translated to TOSA like so:
Year`,Make`,Model`,Description`,Price`
1997`,Ford`,E350`,ac, abs, moon`,3000.00`
1999`,Chevy`,Venture "Extended Edition"`,`,4900.00`
1999`,Chevy`,Venture "Extended Edition, Very Large"`,`,5000.00`
1996`,Jeep`,Grand Cherokee`,MUST SELL!
air, moon roof, loaded`,4799.00
Even simple formats like CSV suffer from issues related to escaping.
Equally expressive, yet simpler formats are however possible, as demonstrated here.
The format created herein can be made fully compatible with TAO by
introducing two additional special symbols ([
and
]
) for building nested structures.
It is also possible to maintain only a single special symbol and
build nested structures with additional operators (`[
and
`]
) instead.
If we pick a control character in place of `
as the
operator symbol, we get as close as possible to eliminating the escaping
problem from data space.
This however creates a few trade-offs which reduce the practical utility of the syntax. This is why TAO does not go to such extreme. It aims to achieve a pragmatic equilibrium of ease and simplicity that will make it useful in as many domains as possible, today.
Design routes however remain open for the future.
Below I include a TOSA to CSV converter that I hacked together this afternoon (JavaScript must be enabled):
Select an example: