Fixing CSV

Darius J Chuck

2021-06-04

CSV is a simple format that works fairly well for representing tabular data.

However, it does have some problems.

Let’s see how we can solve them by creating an even simpler format out of a subset of TAO.

CSV problems and solutions

The trouble, as always, appears in the edge cases.

New lines in data

CSV separates rows of data by newlines. This becomes a problem if we need to include newlines in the data itself.

Solution: use a TAO operator to separate rows. Let’s keep the newline character as the operator.

Commas in data

CSV separates data cells by commas. This becomes a problem if we need to include commas in the data itself.

Solution: use a TAO operator to separate cells. Let’s keep the comma character as the operator.

Quoted data

To solve the problem of newlines and commas in data, CSV introduces quoted values. This creates the need to escape the quote symbols in the data.

Solution: since we’ve solved the previous problems differently, we can drop the quoting and get rid of the incidental issue.

Trimming whitespace

Whether or not and when to trim whitespace from CSV data is not clear and causes inconsistencies in practice. This is somewhat related to the problem of quoting.

Solution: for simplicity, let’s specify that all whitespace is part of data. The problem of quoting is already solved by unncessary feature subtraction, so there is no further ambiguity.

TOSA

The format we ended up with shall be called TOSA for Tabular Operator-Separated Annotations (using nomenclature consistent with TAO).

Compared to CSV it is much simpler and at the same time less brittle.

There is a fixed cost we have to pay for that – the separators now take twice as many symbols. However we have a much simpler parser and got rid of the major headaches of CSV.

The only special symbol in our format and so the only character that needs to be escaped in the data is the operator meta symbol.

This CSV example from Wikipedia:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

can be translated to TOSA like so:

Year`,Make`,Model`,Description`,Price`
1997`,Ford`,E350`,ac, abs, moon`,3000.00`
1999`,Chevy`,Venture "Extended Edition"`,`,4900.00`
1999`,Chevy`,Venture "Extended Edition, Very Large"`,`,5000.00`
1996`,Jeep`,Grand Cherokee`,MUST SELL!
air, moon roof, loaded`,4799.00

Summary

Even simple formats like CSV suffer from issues related to escaping.

Equally expressive, yet simpler formats are however possible, as demonstrated here.

The format created herein can be made fully compatible with TAO by introducing two additional special symbols ([ and ]) for building nested structures.

It is also possible to maintain only a single special symbol and build nested structures with additional operators (`[ and `]) instead.

If we pick a control character in place of ` as the operator symbol, we get as close as possible to eliminating the escaping problem from data space.

This however creates a few trade-offs which reduce the practical utility of the syntax. This is why TAO does not go to such extreme. It aims to achieve a pragmatic equilibrium of ease and simplicity that will make it useful in as many domains as possible, today.

Design routes however remain open for the future.

Demo

Below I include a TOSA to CSV converter that I hacked together this afternoon (JavaScript must be enabled):

Select an example:

convert to CSV
 
CSV output will appear here.