Whitespace in unfoldingWord USFM files

RobH · December 9, 2019, 8:30am

The Door43 ecosystem tries to use modern standards wherever possible and practicable even with our extensive deep linking between resources. Since USFM3 is the modern standard for Bibles (especially in-progress works) then Bible files use that standard, The USFM format is documented at USFM Documentation — Unified Standard Format Markers 3.0.0 documentation and the USX format (XML-based equivalent of USFM, used more for completed works, especially for the Digital Bible Library) is documented at USX Documentation — Unified Scripture XML 3.0.0 documentation. This forum has some information on USFM here.

When a translated Bible has individual words and phrases aligned to the original Hebrew or Greek words (using the WordAlignment tool of translationCore), the \z user-defined milestone markers and x- user-defined character markers are used. This is defined more here and you can see how these attributes are used here.

The aligned data files (see ULT Jonah here and Titus here) become very large and the actual translated text is not easy to read. And it turns out, that it’s not perfectly straight-forward for a program to extract the translated text – effectively discarding the alignment data.

And it’s not only translations which have been aligned to original texts, we also have to consider the Hebrew (e.g., Jonah) and Greek (e.g., Titus) texts themselves. Although they don’t contain word alignment information, they do contain key term (KT) start and end milestone markers, as well as having one word per line similar to the aligned translations.

The reason that it’s not straight-forward for a program to extract the text is to do with ambiguities regarding whitespace in USFM. (This ambiguity was part of the design of USFM and has to be handled by the various USFM processors as explained here.) So this article is about an attempt to investigate the removal of milestones and word information from the original language files and also alignment information from translationCore v2.0 output files.

Firstly some comments about the whitespace decisions in the original language files:

KT start and KT end milestone markers occupy their own lines, so of course those newline markers should also be removed.
Words \w start on new lines and there is usually only one word per line. Thus it’s clear that the newline marker is intended to be converted to space in that case.
Hebrew maqqef-joined words are saved with all three parts on the same line, i.e., \w first word…\w* then the maqqef then the \w second word…\w* with NO SPACE between them so no adjustments to the whitespace there are required.
Sentence punctuation is usually attached to the end of lines, e.g., directly after \w … \w* words.
There’s an extra blank line before each verse which should be ignored.

Then some comments about the whitespace decisions in the translated output files:

The outer self-closed \zaln-s start milestones usually start on new lines. Nested \zaln-s milestones occur within lines.
The outer self-closed \zaln-e* end milestones usually appear at or near the end of lines.
Words \w often start on new lines and there is usually only one word per line. Thus it’s clear that the newline marker is intended to be converted to space in that case.
Hyphenated words (like “self-controlled” in Titus) are saved with all three parts on the same line, i.e., \w first word…\w* then the hyphen then the \w second word…\w* with NO SPACE between them so no adjustments to the whitespace there are required.
Footnotes often start on new lines, but since a footnote should not normally be preceded by a space, so the newline markers before footnotes should NOT be converted to spaces.
Sentence punctuation is usually attached to the end of lines, e.g., directly after \zaln-e* milestones or \w … \w* words.

Also note:

The original language files are not fully USFM3 compliant yet because the \k-s markers are not closed properly with *.
The translated files are not fully USFM3 compliant yet – the use of \s5 markers as translation chunk milestones is non-conformant.

A short Python script was written to extract the milestones and superfluous word information from the original language and translated Bible files, and then to adjust whitespace in order to end up with a satisfactory text and footnotes only USFM basic text. This can be viewed at tools/test_uW_USFM_read.py at develop · unfoldingWord-dev/tools · GitHub.

As well as removing milestones and word information and moving words onto one line and removing blank lines, it also checks for and highlights unnecessary spaces at the ends of lines as well as other hard-to-see features like no-break spaces, etc., producing a report after converting a file.