Multistrings: a simple syntax for heredoc-style strings

Darius J Chuck

2023-05-25

This post was originally published here.

2023-05-26 edit: based on the feedback from this discussion (thanks everyone!), I simplified things by leaving only the more universal multistring variant. I also moved some information to appendices and simplified the formal definition, removing tags and reintroducing them as a possible extension.

In this article I will share with you a recipe I developed for a very useful, but notoriously difficult to get right, and still not fully evolved syntactic feature that has been gaining more and more adoption in recent years. I hope this to be especially useful for designers and implementers of full-blown programming languages as well as more specialized languages (e.g. a configuration or text markup formats).

Feel free to skip the introduction and get straight to the idea or the formal definition.

Introduction

I shall call this feature the “multistring”. It is a generalization of a multitude of related syntactic constructs, known variedly as: raw string, raw string literal, multiline string, multiline string literal, template literal, here string, heredoc, here document, here text, hereis, here script, code block, fenced code block, inline code block, inline code, and more.

It is available in some (often flawed) form in:

I’ll forgo trying to classify these in terms of how good the feature is, but it certainly varies. There are languages that have many different types of strings that seem to try to achieve approximations of what I’m about to describe, with neither type fully succeeding.

I believe that the general idea behind all manifestations of this feature is the same: to be able to embed an arbitrary sequence of characters in a syntax, without needing to modify that sequence to fit the limitations of the syntax. The purpose of this may be to create and populate a text file inside a script, to embed one language into another, to embed a fragment of source code of a language in itself as a string (suppressing normal interpretation), etc.

Multistrings aim to unify and simplify all these features into one extensible construct that could become (perhaps already is becoming) standard in programming languages, portable configuration or data-interchange formats, or other text-syntax-driven languages, new or already existing.

I think that if you were forced to choose only one type of strings for your language, multistrings would be an excellent choice.

The idea

I propose one simple syntax construct, called the “multistring”, which shall look very similar to Markdown syntax for embedding code. I choose Markdown as the basis, because it offers syntax which is perhaps the simplest and most familiar, and among the least flawed.

The syntax I propose is very similar to the Markdown fenced code block syntax. The difference is that it uses ' (apostrophes)1 instead of linebreaks to separate the multistring content from the delimiters. With that adjustment, almost all (except empty) valid Markdown code blocks would also be valid multistrings, e.g.

```'a multiline
multistring'```

is a valid multistring.

Compare that to a Markdown code block:

```
a multiline
multistring
```

The only difference is the use of apostrophes instead of linebreaks.

Multistrings may also admit blocks delimited with double and single backticks2, e.g.:

``'also a multiline
multistring'``

`'another multiline
multistring'`

Formal definition

The syntax for multistrings cannot, in principle, be fully defined as a rule in a standard metasyntax such as EBNF or ABNF. We need a hyperrule3 instead: a parametrizable kind of rule that can accept arguments to produce concrete rules.

This is a lot less scary than it sounds and actually easy to implement in any programming language. A “hyperrule” is simply a function with parameters.

For this reason, I will first introduce the proposed multistring syntax rule (or hyperrule if you will) in the form of Python-like pseudocode:

# the `multistring` hyperrule accepts one parameter `n` which specifies
# the number of repetitions of the multistring delimiter "`"
def multistring(
  # `n` is an integer in range 1 to infinity
  # in practice uint > 1 or even uint8 > 1 shall be sufficient
  n
):
  return sequence(
    "`".times(n),
    "'",
    # `anychar` is any unicode character
    zeroOrMore(anychar).terminatedBy(
      sequence(
        "'",
        "`".times(n)
      )
    ),
  )

Now here is the above more concisely expressed in customized ABNF:

multistring(n) = {n}"`" "'" *any⇥("'" {n}"`")

where the customizations are:

This hyperrule would “expand to” an infinite5 number of regular rules, such as:

multistring_1 = "`" "'" *any⇥("'" "`")
multistring_2 = "``" "'" *any⇥("'" "``")
multistring_3 = "```" "'" *any⇥("'" "```")
...
multistring_5 = "`````" "'" *any⇥("'" "`````")
...
multistring_10 = "``````````" "'" *any⇥("'" "``````````")
...

This formal definition should be sufficient to implement the feature.

Possible extensions

Multistrings, as specified above, are very basic and solve only one problem, i.e. verbatim embedding of arbitrary text into another syntax (e.g. source code in some programming lanugage) without needing to modify the embedded text. Thanks to multistrings, we can literally copy-paste anything into a syntax that supports them as a string and not worry about delimiter collision. When in doubt: add more backticks.

However, there is a number of possible extensions that we can add to the feature. One I’ll mention here briefly is what I call tags.

Tags

Tags can be seen as a generalization of Markdown language specifiers.

A tagged multistring may look like:

```tag'multistring'```

Such tags can be used as metadata to describe the content within the multistring. This metadata may direct transformation(s) of the multistring: e.g. to interpret it as a specific language, to adjust or remove indentation, to enable interpretation of \x escape sequences, to enable ${substitutions}, etc.

To allow tags, we’d extend our syntax (in a backwards-compatible way) like so (again in customized ABNF):

multistring(n) = {n}"`" tag "'" *any⇥("'" {n}"`")

The only thing that was changed is the addition of tag inbetween the apostrophe and the backtick.

I’ll leave the precise definition of the tag rule for another time. Certainly it should not contain backticks or apostrophes and generally we’d want tags to be kept on the same line as the multistring delimiter. For an actual implementation I’d recommend starting with a conservative syntax for tags, with limited special symbols – we may want to use one or more of those to create a generalized version of multistrings (but that’s perhaps for another article).

I give a few examples of how tags could be used in Appendix III.

What’s next

Having defined the multistring rule, it may be time to use it as part of a new syntax! I encourage language designers and implementers to try it out. If you do, let me know how it went! Meanwhile, in a future article I intend to present a little format for configuration which makes extensive use of multistrings.

The described multistring syntax can be generalized – perhaps I’ll discuss the details in yet another article.

Should I write these, they shall be linked here. If you want to be notified of that, you can follow @jevko@layer8.space if you are on Mastodon or you can subscribe directly to that via RSS.

Conclusion

I hope multistrings will prove useful and we’ll be seeing more of them in the wild.

This is it for now.

Thank you for reading and until next time.

Appendix I: why choose apostrophe as the separator?

Other more or less viable separators for multistrings include:

A nice advantage of ' also is that if we find ourselves needing to convert from a regular '-delimited string into a multistring, there is no need to delete or replace the delimiters. We only need to add a layer of `-delimiters around the regular string. This is particularly easy in a modern code editor.

For example if we have an '-delimited string such as:

'a string with an apostrophe: '

and we find that the next character we insert is ', making the string invalid:

'a string with an apostrophe: ''

we can quickly fix this by surrounding the whole string with `:

`'a string with an apostrophe: ''`

That said, | is a nice separator too and you can choose whatever you like. You could also replace ` with another delimiter that suits your language better.

Appendix II: edge cases

Implementers beware.

Empty multistrings are not like empty Markdown code blocks

In Markdown, an empty code block is denoted as:

```
```

Note the single linebreak between the delimiters.

However, an equivalent empty multistring is:

```''```

rather than:

```'```

The stated formal definition does not allow to “fuse together” the opening and closing delimiters like this, which is what effectively happens in Markdown.

Instead, an empty multistring is always an opening delimiter immediately next to a closing delimiter. This is the same principle as in the familiar "" or '' empty strings.

Thanks to that the following multistring:

```'```'```

is valid and contains ```.

Whereas in Markdown an analogous syntax:

```
```
```

means an empty code block followed by ``` (an unclosed, and thus invalid, code block). To make that work in Markdown, we would need to increase the number of backticks around the middle ```.

This edge case illustrates that the multistring syntax is more regular than Markdown, thanks to the simple formal definition.

By the way, the following is a minimal edge case of a multistring that can be used to test a parser:

`'`'`

It should parse as a multistring which contains `.

For completeness, this is the minimal empty multistring:

`''`

Appendix III: examples for how tags could be used

For example, a dedent tag could signify that the multistring should be postprocessed by removing the first linebreak, all indentation that goes beyond the indentation of the last line, as well as the last line, achieving the behavior of raw string literals from C# 11 (thanks @useerup on reddit for mentioning those). For example using dedent this multistring:

    ```dedent'
    {
      "key": "value
    }
    '```

would be equivalent to this one without dedent:

```'{
  "key": "value
}'```

An esc tag could signify that C-style escapes should be recognized and replaced within the multistring. E.g.:

```esc'\n\n\n'```

would be equivalent to:

```'


'```

i.e. a string which contains 3 linebreaks.

A $ tag could turn the multistring into a template literal, where names of variables or expressions could be substituted for their values, e.g.:

`$'Hello, ${username}!'`

could be equivalent to:

`'Hello, John!'`

assuming John as the value of the username variable.

Multiple tags could be allowed for one multistring, perhaps by comma-separating them. E.g.:

`$,esc'Hello,\n${username}!'`

Here we are using both the $ and esc tags to achieve something like:

`'Hello,
John!'`

In this way could make up all kinds of useful tags and rules for them.

Comments

This post was discussed on reddit.com/r/ProgrammingLanguages.

Comments welcome on Mastodon.

Support TAO

If you like, you can support my work with a small donation.

Donate directly via Stripe   or   Buy Me a Coffee at ko-fi.com   Postaw mi kawę na buycoffee.to

Thank you!

Darius J Chuck