blog

This is the TAO blog. See also djedr.github.io/writing and github.com/jevko/writing for more writing.

Meanwhile, here is the latest post.

Jevko for Large Language Models

Darius J Chuck

2024-03-16

I’ve been learning about LLMs a bit. After watching a recent video about LLM tokenization by Andrej Karpathy a thought appeared in my mind.

Namely: I think it might be promising to investigate Jevko-based formats as an efficient (cheap in terms of tokens) alternative to JSON/YAML/etc. for interacting with LLMs in a structured way.

How so? Let’s take a look at the relevant fragment of the video:

Here is the transcript:

[D]ifferent kinds of formats and different representations and different languages and so on might be more or less efficient with GPT tokenizers or any tokenizers for any other LLM for that matter.

So for example JSON is actually really dense in tokens and YAML is a lot more efficient in tokens.

So for example these are the same in JSON and in YAML.

The JSON is 116 and the YAML is 99 so quite a bit of an improvement.

And so in the token economy where we are paying per token in many ways and you are paying in the context length and you’re paying in dollar amount for the cost of processing all this kind of structured data when you have to.

So prefer to use YAMLs over JSONs and in general kind of like the tokenization density is something that you have to sort of care about and worry about at all times and try to find efficient encoding schemes and spend a lot of time in Tiktokenizer and measure the different token efficiencies of different formats and settings and so on…

Emphasis mine.

Now let’s create an equivalent piece of data in a Jevko-based format and see how that compares.

Here is one possibility, using formatting similar to the JSON example:

product[
  type[T-Shirt]
  price[20.00]
  sizes[[S][M][L]]
  reviews[
    [ username[user1] rating[4] created_at[2023-04-19T12:30:00Z] ]
    [ username[user2] rating[5] created_at[2023-05-02T15:00:00Z] ]
  ]
]

Down to 85. I think that qualifies as “quite a bit of an improvement”.

We can actually remove all the whitespace though:

product[type[T-Shirt]price[20.00]sizes[[S][M][L]]reviews[[username[user1]rating[4]created_at[2023-04-19T12:30:00Z]][username[user2]rating[5]created_at[2023-05-02T15:00:00Z]]]]

Down to 72. Not bad, eh?

Let’s see how all these compare:

JSON: 116 (100 %)
YAML: 99 (85 %)

Jevko #1: 85 (73 %)
Jevko #2: 72 (62 %)

See also this old comparison of a Jevko-based format vs XML/JSON/EDN/S-expr.

Now I don’t know enough about LLMs yet to tell if this is a good lead. So I’m putting this idea out there in the hope that somebody who knows a bit more can tell me.

If you are that somebody, write me an email at tao at xtao.org. If you know a somebody, I’d appreciate if you send them a link to this article. Thanks!

Support TAO

If you like, you can support my work with a small donation.

Donate directly via Stripe   or   Buy Me a Coffee at ko-fi.com   Postaw mi kawę na buycoffee.to

Thank you!

Darius J Chuck

Verification links:

https://mastodon.social/@djedr https://layer8.space/@jevko https://toot.io/@tao

Archive

Jevko for Large Language Models 2024-03-16

Binary Lambda Calculus implemented as a shell on top of λDNA 2024-01-18

JsonHilo.js Githelp trial 2023-12-08

λDNA programming language 2023-12-04

Meditating on the Wizard Book and language design 2023-12-01

Introducing nuklĕus, a minimal IDE for LAST 2023-11-28

Revelation: Lambda Calculus Reduced To Four Primitive Operations 2023-07-23

Multistrings: a simple syntax for heredoc-style strings 2023-05-25

Encodings for numbers in lambda calculus 2022-09-29

Introduction to the LAST programming language 2022-09-01

Introducing Jevko: a minimal general-purpose syntax 2022-02-22

Writing about Jevko on GitHub 2022-01-21

Semantics of JSON 2021-12-18

Alternative description of Data Jevko 2021-12-10 12:45

Alternative grammar for Jevko 2021-12-10 12:00

Formal grammar for Data Jevko 2021-12-10

Jevko: a zero waste syntax 2021-12-09

Minimal syntax for phylogenetic trees 2021-12-06

Minimal syntax for rose trees 2021-11-27

JsonHilo.js – ultra-fast lossless JSON parse event streaming 2021-07-29

One year of TAO 2021-07-21

One-escape TAO 2021-07-16

TAO in one line 2021-07-09

The best format for multiple-word identifiers 2021-07-07

Square brackets 2021-07-05

Fixing S-expressions: overnesting on the left 2021-06-29

xtao.org 2021-06-10

Operator, please dial the number 2021-06-04

Fixing CSV 2021-06-04

Streaming spreadsheets 2021-06-03

Nested query params 2021-06-02

No escaping 2021-06-01

TAO blog and public newsletter 2021-04-08

Why NOT to add the pipeline operator to JavaScript (or TypeScript, etc.)? And what to maybe add instead. 2018-01-25