Parascriptural Tab Separated Value Format Specification (v2)

This post documents version 2 of the Parascriptural Tab Separated Value Format Specification. See below for history.

Format Overview

A Tab Separated Value (TSV) file is like a Comma Separated Value (CSV) file except that the tab character is what divides the values instead of a comma. This makes it easier to include prose text in the files because many languages require the use of commas, single quotes, and double quotes in their sentences and paragraphs. Note however, that it still takes special effort to put some special characters such as the newLine character and the tab character itself into fields/cells within the TSV rows. TSV files can be envisaged as tables, whereby each line is a row in the table, and each tab-separated field is a cell (and they can usually be opened by spreadsheet software including the free LibreOffice-Calc, as well as proprietary software like MS-Excel).

Many people are familiar with 66 (or more) Bible books and there is a 3-character standard book code used for representing them. unfoldingWord® Open Bible Stories (OBS) don’t have chapters and verses like Bible books, but do have story numbers and frame numbers. Although OBS is conceptually very different from a Bible book, there are several advantages in treating it similarly to Bible books, including the possibility of being able to work with OBS in the same tools. Hence, OBS is treated similarly to GEN or 3JN in some instances below and “book” is used in quotes as a reminder that the term includes OBS in this context. Also BBB is used below to represent a 3-character UPPERCASE “book” code, and bbb for the lowercase form.

Many unfoldingWord resources use Markdown formatting. This may be abbreviated to md below. Repo below is short for Repository.

Repo Structure

History

We currently (2020) have TN (9-column TSV tables), TQ (with one folder per book containing a total of 17,000+ markdown files), TW markdown articles in three sub-folders (kt, names, and other all under bible), TA markdown articles in four folders (intro, process, translate, and checking), as well as OBS TN, OBS TQ, OBS SN, and OBS SQ, all in separate repositories. Links from words or phrases to TW articles are “hard-coded” in UHB/UGNT. OBS TW links are in the old v2 catalog here.

New TSV Repo Structures

With the new TSV files (described below), there will be no change to repo structures and OBS resources will be in separate repos for ease of publishing.

[Having general en_translation-annotations and en_study-annotations repos was suggested (when a general annotation TSV format was also in scope), but was rejected in late-2020 by the content team who preferred separate repos for each of the various different resources.]

[With the new TSV files (described below), we considered including OBS resources in the same repos as Bible book resources – appearing like an OBS “book name” alongside GEN, EXO, etc., but this was rejected in a meeting on 10 March 2021.]

Various TSV File Formats

This section describes the format of the .tsv files. The files are structured as one file per book of the bible or OBS, and encoded in TSV format, for example, xx(x)_BBB.tsv (where xx(x) is tn, tq, sn, sq, or twl), e.g., tn_GEN, tq_OBS, or twl_3JN. (NOTE: A former version of this spec had BBB_xx(x).tsv until 5 March 2021.)

[Note that by losing the USFM number prefix, DCS will now list Bible books in alphabetical order, but specialist software will still fetch the correct Bible book order from the manifest. Note also that we have deliberately switched to UPPERCASE “book” codes in filenames to become more compatible with this. However, RC links still specify using lowercase.]

Loading software should not crash if it encounters a 3-character “book” abbreviation (BBB) that it is not able to process. Any code listed in the identification column here should be accepted as well as OBS. Unexpected “book” abbreviations can be logged, but after that, if they can’t be processed, they should just be ignored.

\n (2-characters) is used for line breaks within fields as per the TSV convention listed here (we formerly used HTML <br>) – these \n’s should be automatically converted to newLine in the low-level TSV read functions immediately after each row has been divided into fields, and vice versa when writing rows. In other words, a \n for example should never reach a markdown processor – it should already be a newLine character by then. It is recommended that all four escaped characters are implemented by this low-level software: \n, \t, \r, and \\.

The columns are dependent on the particular resource, so loading software could read the first row and count the number of tabs to determine the number of columns (equals number of tab characters in first row, plus one). Then it could check that all NEEDED column headings are there. [Of course, the number of columns is generally known in advance for the resource in question so an alternative is for loading software to check for the expected number of columns, but this is less resilient if a new column later needed to be added to an existing resource.] The headings in the first row (which should NEVER be translated into GLs) indicate the type of content of each column/field.

  • The first line contains the names of the columns as specified below.
  • The word optional below means that the column can be empty, i.e., no text at all in that column from line 2 in the file onward. It does not mean that the column name in line 1, or the column itself (tab character) can be omitted.
  • The final newline character at the end of the file is expected to be appended by the low-level TSV writing software and will be ignored by the low-level TSV loading software – at the end of the file it does not indicate a blank row.
  • Software using these TSV resources should not crash if an extra column is added later – it should handle it gracefully as long as all the columns it needs are still there.

All of the new TSV resources include Reference, (four-character) ID, and Tags (usually optional) in the first three columns. There are another eight fields also used in different files, making a total of eleven fields in current use:

  • Reference (string)–This must be one of the following, in order of preference:
    • Generally two numeric sub-fields joined with a colon, e.g., 2:3 for chapter 2, verse 3 or 2:4-5 for a verse bridge. In OBS, 2:3 would be story 2, frame 3 (and the generic title “Reference” covers that better than using “Chapter” & “Verse”). Some limited text strings are allowed such as front:intro. Note that “1:” must be specified in the file even for single-chapter books (even if display software chooses not to display the superfluous chapter 1 number).
    • Note that even if software cannot handle verse ranges like “2:4-5” (with a hyphen, even if display software later converts it to an en-dash), etc., it should not crash when it encounters them (and should probably fail to 2:4—stopping at the first non-digit character after the first verse number). It should also not crash when handling fields like “2:3a”, “2-3b-4a”, “2:7,12” (without a space after the comma), “7:11-8:2”, or “7-8” (chapter range separated also separated by hyphen) even though we have not yet encountered reasons to process them. Note that a chapter range can be distinguished from a verse range, because it either has no colon or else has two colons, but never exactly one colon like a verse range.
    • Note that the above also means that software should not crash if there’s no colon in the reference field, even if it can’t yet handle a chapter range. Fault-tolerant software may also choose to handle en-dashes just-in-case, even though hyphen is actually specified as the correct computer-readable range indicator character.
    • This field should not be localised in the TSV files even if other parts of the world use different conventions, i.e., the file format must always follow these conventions. How display software chooses to present this information is, of course, quite a different matter.
  • ID - Four character alphanumeric string (e.g. swi9) that is unique within the file
    • The ID must start with a lowercase letter in the range a…z and the next three characters must be a lowercase letter or a digit, i.e., a…z or 0…9.
    • The Universal ID (UID) of a note is the combination of the protocol, the book code (from the file name and/or enclosing folder name, theReference start (i.e., without the range), and the ID fields. For example, bcv://tit/1/3/swi9 – where did this come from???. (Note that the USFM book identifier must be converted to lowercase as specified here.)
    • An RC link can resolve to a specific note like this: rc://en/tn/help/tit/01/01/swi9.
    • Note that originally an ID only had to be unique within the Reference (i.e., usually within the verse), but the spec was tightened in early-2021 to make it unique within the file (more strict) for search convenience, etc. This means that when a new row is added, the user or software must check that the newly invented ID does not occur anywhere else in that entire column (not just in the current verse as per the prior definition).
  • Tags – (can be left empty) A list of one or more tags (strings, the allowed set of which depends on and varies with the type of resource, separated by a semicolon and a space if there’s more than one, e.g., keyterm; name for a translation note about Jesus.
  • SupportReference – only used in TN and SN – see below
  • OrigWords (string) – only used in TWL
    • Note that the original language is Hebrew/Aramaic for OT, Greek for NT, and English for OBS
    • Should not begin or end with sentence level punctuation unless it’s a matched pair from the original (e.g., speech marks), e.g., went home. is invalid, but He said, “Yes!” is valid
    • If there’s sentence punctuation within the snippet, it should be included, e.g., Cyrus, the king of Persia
    • Usually just one word, but can be multiple original-language words separated by space (or Hebrew maqaf in some cases) if they’re contiguous
    • A three-character sequence & (ampersand with surrounding spaces) indicates that the parts before and after are discontiguous, i.e., not directly connected in the original sentence
    • Note that the relation fields in the file manifest.yaml indicate the specific versions of those resources that were quoted
  • Quote (string) – used in TN, SN, TQ, and SQ
    • Has the same features as OrigWords above, except that it’s often a longer snippet and so more frequently discontiguous
  • Occurrence (integer string) – used in all resources to qualify the OrigWords or Quote fields
    • Note that word matches are defined to be WHOLE WORD matches – simple string searches will fail as they might find (English example) “one” in “undone” – the Occurrence indicates the occurrence of the entire word including accents and cantillation marks and any punctuation that’s included
  • Note that the pieces on either side of Hebrew maqaf “־” can be specified as stand-alone words, i.e., maqaf should also be considered a word divider
  • -1 : entry applies to every occurrence of OrigWords/Quote in the verse (cannot be used with ampersand discontiguity divider in OrigWords/Quote)
  • 0 : (not used in TWL) entry does not occur in Quote (for example, “Connecting Statement:”)
  • 1 : entry applies to first occurrence of OrigWords/Quote only
  • 2 : entry applies to second occurrence of OrigWords/Quote only
  • etc.
  • It should be noted that we only use ONE Occurrence number even though a OrigWords/Quote field may have multiple parts. So the Occurrence number only refers to the first part in that case. The next part is assumed to be the first occurrence of that part AFTER the first part. (This may prove inadequate to specify the specific second part in some cases???)
  • TWLink – only used in TWL – see below
  • Note (markdown) – only used in TN and SN – see below
  • Question (markdown single line) – only used in TQ and SQ – see below
  • Response (markdown single line) – only used in TQ and SQ – see below

There are currently THREE basic TSV formats defined below:

  • TWLinks with SIX columns: Reference, ID, Tags, OrigWords, Occurrence, TWLink
  • Question/Response type resources – TQ and SQ with SEVEN columns: Reference, ID, Tags, Quote, Occurrence, Question, Response
  • Notes type resources – TN and SN with SEVEN columns: Reference, ID, Tags, SupportReference, Quote, Occurrence, Note

This currently results in ELEVEN different column names (with NINE unique formats) across all the formats: Reference (string usually containing separator :), ID (4-character string), Tags (strings which can contain separator sequence ; ), Question and Response (character md only), Occurrence (integer string), TWLink (link string to TW), SupportReference (link string, currently only defined to link from TNs to TA), OrigWords and Quote (strings which can contain discontiguity sequence &), Note (md). Any specific details on these fields below override the general details listed above.

Translation Word Links (TWL) resource

This resource is used to encode links to TWs that have been copied out of the UHB and UGNT (and eventually the TW links in those two resources will be deprecated).

These TSV files have SIX columns: Reference, ID, Tags, OrigWords, Occurrence, TWLink.

For example (from twl_2JN.tsv):
1:3 k7ar rc://*/tw/dict/bible/other/peace εἰρήνη 1 (no tags, single word)
and
1:3 v7tr keyterm rc://*/tw/dict/bible/kt/godthefather Θεοῦ Πατρός 1 (two words)
and
1:3 nmm7 keyterm; name rc://*/tw/dict/bible/kt/jesus Ἰησοῦ 1 (two tags)

  • Tags - the following (optional) tags are currently defined, and may be used together (separated by ; ): keyterm and name.
    • Joel Ruark wrote: I would echo Perry’s desire for semantic domains here, and I think we would want to have a larger discussion about this, and especially to include Johan. Because I think we would want semantic domains in tW to mirror semantic domains in GL lexicon projects, and vice versa.
  • OrigWords - Usually just one Hebrew or Greek word (or English for OBS)
  • Occurrence - Integer string that specifies which occurrence in the original language text the entry applies to.
  • TWLink - A full Resource Container link to a single TW entry

TWL repo

This is in en_twl repo.

  • Contains one file per “book” (e.g., twl_GEN.tsv, twl_PSA.tsv, twl_3JN.tsv)

OBS is twl_OBS.tsv in en_obs-twl repo.

TO BE DECIDED: how to encode OBS in the manifest, and where to put the OBS version number, e.g., rc://en/obs/book/obs?v4 ??? (TWL currently specifies the UHB and UGNT versions in the relation field in the manifest.)

Translation Questions (TQ) and Study Questions (SQ) question/response type resources

These TSV files have SEVEN columns: Reference, ID, Tags, Quote, Occurrence, Question, Response.

  • Tags - no (optional) tags are defined yet for TQ
    • Joel Ruark suggested: Grammar; Syntax; Vocabulary; Semantics; Discourse (just kinda thinking out loud here how we might categorize different types of tQ’s)
    • sq_OBS.tsv uses tags meaning (for “What the Story Says”) and application (for “What the Story Means to Us”)
  • Quote - (Optional – currently only found in Bible book SQs) Original language quote (e.g. ἐφανέρωσεν & τὸν λόγον αὐτοῦ )
    • Note that the original language is English for OBS
    • Software (such as tC) uses this for determining what is highlighted rather than using the former GLQuote field
    • A 3-character sequence of ampersand character WITH surrounding spaces indicates that the quote is discontinuous; software should interpret this in a non-greedy manner – these Quote parts must always occur in the same order that they appear in the text
    • Note that the relation fields in the file manifest.yaml indicate the specific versions of those resources that were quoted
  • Occurrence - (Optional – currently only found in Bible book SQs) Integer string that specifies which occurrence in the original language text the entry (or the first part of the entry if there’s multiple discontinuous parts) applies to.
  • Question - The Markdown formatted question itself. It’s normally expected that this will be a single line (i.e., contain no paragraph formatting or newLine characters).
    • The text should be Markdown formatted, which means the following are also acceptable:
      • Plaintext - if you have no need for extra markup, just use plain text in this column
      • HTML - if you need to use inline HTML for markup, that works because it is supported in Markdown
  • Response - The Markdown formatted (as above for Question) answer or response.

TQ repo

This will be in en_tq repo. Note that the Quote and Occurrence columns will be empty. (To be clear: all the columns will exist.)

  • Contains one file per “book” (e.g., tq_GEN.tsv, tq_PSA.tsv, tq_3JN.tsv)

OBS is tq_OBS.tsv in en_obs-tq repo.

TO BE DECIDED: how to encode OBS in the manifest, and where to put the OBS version number, e.g., rc://en/obs/book/obs?v4 ???

SQ repo

This is in en_sq repo. Note that Bible books currently will have the Response column empty. OBS will have the Quote and Occurrence columns empty. (To be clear: all the columns will exist.)

  • Contains one file per “book” (e.g., sq_GEN.tsv, sq_PSA.tsv, sq_3JN.tsv)

OBS is sq_OBS.tsv in en_obs-sq repo.

TO BE DECIDED: how to encode OBS in the manifest, and where to put the OBS version number, e.g., rc://en/obs/book/obs?v4 ???

Translation Notes (TN) and Study Notes (SN) resources

These TSV files have SEVEN columns: Reference, ID, Tags, SupportReference, Quote, Occurrence, Note.

Translation Notes (TN) resource

This resource is adapted from the existing nine-column TN TSV files, most noticeably by dropping the (always out of date) GLQuote fields. (The other changes are adding Tags, dropping Book, and combining Chapter and Verse into Reference.)

For example (from tn_2JN.tsv):
front:intro vpa9 0 # Introduction to 2 John\n\n## Part 1: General Introduction\n\n### Outline of the Book of 2 John\n\n1. Opening of letter (1:1-3)\n2. Encouragement and the commandment to love one another (1:4-6)\n3. Warning about false teachers (1:7–11)\n4. Closing of letter (1:12-13)\n\n### Who wrote the Book of 2 John?\n\nThe author of this letter identifies himself only as “the Elder.” However, the content of 2 John is similar to the content in John’s Gospel. This suggests that the Apostle John probably wrote this letter, and he would have done so near the end of his life.\n\n### To whom was the Book of 2 John written?\n\nThe author addresses this letter to someone he calls “the chosen lady” and to “her children” (1:1). This could refer to a specific woman and her children. Or it could refer figuratively to a specific group of believers. (See: [[rc://*/ta/man/translate/figs-metaphor]])\n\n### What is the Book of 2 John about?\n\nJohn addressed this letter to someone he called “the chosen lady” and to “her children” (1:1). This could refer to a specific friend and her children. Or it could refer to a specific group of believers or to believers in general. John’s purpose in writing this letter was to warn his audience about false teachers. John did not want believers helping or giving money to false teachers. (See: [[rc://*/ta/man/translate/figs-metaphor]])\n\n### How should the title of this book be translated?\n\nTranslators may choose to call this book by its traditional title, “2 John” or “Second John.” Or they may choose a different title, such as “The Second Letter from John” or “The Second Letter John Wrote.” (See: [[rc://*/ta/man/translate/translate-names]])\n\n## Part 2: Important Religious and Cultural Concepts\n\n### What is hospitality?\n\nHospitality was an important concept in the ancient Near East. It was important to be friendly towards foreigners or outsiders and provide help to them if they needed it. John wanted believers to offer hospitality to guests. However, he did not want believers to offer hospitality to false teachers.\n\n### Who were the people John spoke against?\n\nThe people John spoke against were possibly those who would become known as Gnostics. These people believed that the physical world was evil. Since they believed Jesus was divine, they denied that he was truly human. This is because they thought God would not become human if the physical body were evil. (See: [[rc://*/tw/dict/bible/kt/evil]])\n\n## Part 3: Important Translation Issues\n\n### What are the major textual issues in the text of the Book of 2 John?\n\n### In [1:12](../01/12.md), most modern versions of the Bible read “our joy.” There is another traditional reading that says “your joy.” If a version of the Bible already exists in your region, you should consider using the reading of that version in your translation. If not, you may wish to follow the reading that most modern versions prefer and say “our joy.” (See: [[rc://*/ta/man/translate/translate-textvariants]])
and
1:1 uspy ὁ πρεσβύτερος 1 In this culture, letter writers would give their own names first. Your language may have a particular way of introducing the author of a letter, and if it would be helpful to your readers, you could use it here. Alternate translation: “I, the elder, am writing this letter”
and
1:13 qjdz rc://*/ta/man/translate/figs-you σε & σου 1 The pronouns **you** and **your** are singular. John tells the lady specifically that her sister’s children send greetings to her in particular. (See: [[rc://*/ta/man/translate/figs-you]])

  • Tags - no (optional) tags are defined yet.
    • Joel Ruark suggested: (some of these are defined in tCore already) –– Culture; Discourse; Connect ; Figure of Speech; Grammar; Numbers; Body Part; Human Quality; Human Behavior; Manmade Object; Natural Phenomena; Farming; Plants
  • SupportReference - (optional) A link to a supporting reference text, else empty
    • This will be a link to translationAcademy, like rc://*/ta/man/translate/figs-metaphor (where the asterisk tells the processing software to look for the tA article in the same language as is used for these tNs – see here)
  • Quote - Original language quote (e.g. ἐφανέρωσεν & τὸν λόγον αὐτοῦ )
    • Note that the original language is English for OBS
    • Software (such as tC) uses this for determining what is highlighted rather than using the former GLQuote field
    • A 3-character sequence of ampersand character WITH surrounding spaces indicates that the quote is discontinuous; software should interpret this in a non-greedy manner – these Quote parts must always occur in the same order that they appear in the text
    • Note that the relation fields in the file manifest.yaml indicate the specific versions of those resources that were quoted
  • Occurrence - Integer string that specifies which occurrence in the original language text the entry (or the first part of the entry if there’s multiple discontinuous parts) applies to.
  • Note - The Markdown formatted note itself. For example: Paul speaks of God’s message as if it were an object that could be visibly shown to people. Alternate translation: “He caused me to understand his message” (See: [[rc://en/ta/man/translate/figs-metaphor]])
    • The text should be Markdown formatted, which means the following are also acceptable:
      • Plaintext - if you have no need for extra markup, just use plain text in this column
      • HTML - if you need to use inline HTML for markup, that works because it is supported in Markdown
    • By convention (Jane can probably write more in here)
      • ** is used to mark direct quotations from the source text
      • *** is used to mark other suggested renderings

UTN repo

This will be in en_tn repo.

  • Contains one file per “book” (e.g., tn_GEN.tsv, tn_PSA.tsv, tn_3JN.tsv)

OBS is tn_OBS.tsv in en_obs-tn repo.

TO BE DECIDED: how to encode OBS in the manifest, and where to put the OBS version number, e.g., rc://en/obs/book/obs?v4 ??? (UTN currently specifies the UHB and UGNT versions in the relation field in the manifest.)

Note that an automatic process will need to calculate (using Proskomma) and add the GLQuote field for the v3 catalog.

Study Notes (SN) resource

For example (adapted from sn_TIT.tsv):
front:intro m2jl # Introduction to Titus\n\nThis is the introduction to the book of Titus
and
1:intro c7me # Introduction to Titus chapter 01\n\nPaul begins his letter by reminding Titus who Paul is to God, and who Titus is to Paul. He then instructs Titus about the kind of man that Titus must appoint as elders. These elders are necessary for the health of the new believers because there were so many people in Crete who are teaching things that were not true about God, and turning people away from God.
and
1:1 rtc9 δοῦλος Θεοῦ 1 Paul said that he was a servant or a slave of God because he did only what he knew that God, his master, wanted him to do. Other servants of God were Moses, David, and the other prophets.
and (from sn_OBS.tsv)
1:6 grg3 the sun, the moon, and the stars 1 God created these and placed them in the empty sky that he had created on the second day (See: [01:03](01/03)).

  • Tags - no (optional) tags are defined yet
  • SupportReference - (optional) A future link to a supporting reference text, else empty
  • Quote - Original language quote (e.g. ἐφανέρωσεν & τὸν λόγον αὐτοῦ )
    • Note that the original language is English for OBS
    • Software (such as tC) uses this for determining what is highlighted
  • Occurrence - Integer string that specifies which occurrence in the original language text the entry (or the first part of the entry if there’s multiple discontinuous parts) applies to.
  • Note - The Markdown formatted note itself
    • The text should be Markdown formatted, which means the following are also acceptable:
      • Plaintext - if you have no need for extra markup, just use plain text in this column
      • HTML - if you need to use inline HTML for markup, that works because it is supported in Markdown
    • By convention
      • ** is used to mark direct quotations from the source text

SN repo

This is in en_sn repo.

  • Contains one file per “book” (e.g., sn_GEN.tsv, sn_PSA.tsv, sn_3JN.tsv)

OBS is sn_OBS.tsv in en_obs-sn repo.

TO BE DECIDED: how to encode OBS in the manifest, and where to put the OBS version number, e.g., rc://en/obs/book/obs?v4 ??? (SN currently specifies the UHB and UGNT versions in the relation field in the manifest.)

Relevant RFCs

Both RFC 4180 and RFC 7111 contain relevant information.

History

Formerly, version 1 of this specification was documented in the UTN v13 Readme file, but it has moved here to allow for easier updating and reference.

In 2020, there was a proposal for a general (7-column) annotation file format. (One of the driving forces was that in Translation Notes (TNs), the pasted GLQuote column was often out-of-date as the ULT itself was in constant revision, so we wanted the GLQuote to be automatically generated by software instead). In Feb 2021, a general “annotation” format was rejected in favour of having custom formats (columns and headers) for each resource, thus allowing for things like separating the Question and Response in TQ and SQ type resources. Since the acceptance of those formats, we have continued trying to tighten up the spec as needs evolve and ambiguities become clearer.

ToDo

  • Create new home for specification (here)
  • Pick a descriptive name
  • Address the need to reference what original language (and GL?) resource and version is being referred to
  • Address the slipperiness of the ellipsis (document that ellipsis cannot be used with -1
  • Address the resource that SupportReference must refer to
  • Is Occurrence sufficient?
  • Is GLQuote required or is it getting in the way? (Yes, it is required and must match)
  • What about “Connecting Statement” and “General Information” ? (Handled by the fact that GLQuote is now required)

In Reference, do we want to also support format such as “2:5;3:1” ?

For me, yes, I would have assumed that (although up to the content team, really) and should have included an example like that. (There are many longer accounts, especially in the OT like building the tabernacle, that go beyond artificial chapter breaks, and where one could envisage a notes writer wanting to discuss use of a particular word or concept in multiple chapters.)