USFM3 Alignment Data Encoding

jag3773 · January 30, 2018, 6:36pm

Edit: see the UBXF spec at http://resource-container.readthedocs.io/en/latest/ubxf.html .

In order to more easily exchange alignment data between software we’re providing a working example of how that data can be encoded in USFM 3. To learn more about USFM3, see the USFM Documentation, in particular, reference the section on Milestones.

Note that this encoding format is not yet finalized, we are open to suggestions of how it may be improved. To our knowledge, it does account for the various ways that languages need to be aligned to one another (since USFM 3 milestones can be overlapping or non-overlapping).

Here is an example from Titus 1:1. Note the following characteristics:

Punctuation occurs outside the \w markers.
~~We are using the Keyword / keyterm marker for our milestones because we can’t find any that are more appropriate in the USFM 3 specification.~~
Even for 1 to 1 word correspondence, we use the milestone marker for encoding the alignment. This makes it easier for code to import as it doesn’t have to look at word attributes and milestone attributes.
We encode the actual Greek text in the x-ugnt attribute so that it is clear what the Greek source is (the UGNT in this case).
The occurrence and occurrences attributes can help software identify individual occurrences of identical words within a verse.

\v 1 \zaln-s | x-strong="G39720" x-occurrence="1" x-occurrences="1" x-ugnt="Παῦλος"\*
\w Paul|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*,
\zaln-s | x-strong="G14010" x-occurrence="1" x-occurrences="1" x-ugnt="δοῦλος"\*
\w a|x-occurrence="1" x-occurrences="1"\w*
\w servant|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\zaln-s | x-strong="G23160" x-occurrence="1" x-occurrences="2" x-ugnt="Θεοῦ"\*
\w of|x-occurrence="1" x-occurrences="4"\w*
\w God|x-occurrence="1" x-occurrences="2"\w*
\zaln-e\*
\zaln-s | x-strong="G06520" x-occurrence="1" x-occurrences="1" x-ugnt="ἀπόστολος"\*
\w and|x-occurrence="1" x-occurrences="2"\w*
\w an|x-occurrence="1" x-occurrences="4"\w*
\w apostle|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\zaln-s | x-strong="G11610" x-occurrence="1" x-occurrences="1" x-ugnt="δὲ"\*
\w of|x-occurrence="2" x-occurrences="4"\w*
\zaln-e\*
\zaln-s | x-strong="G24240" x-occurrence="1" x-occurrences="1" x-ugnt="Ἰησοῦ"\*
\w Jesus|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\zaln-s | x-strong="G55470" x-occurrence="1" x-occurrences="1" x-ugnt="Χριστοῦ"\*
\w Christ|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*,
\zaln-s | x-strong="G25960" x-occurrence="1" x-occurrences="1" x-ugnt="κατὰ"\*
\w for|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\zaln-s | x-strong="G41020" x-occurrence="1" x-occurrences="1" x-ugnt="πίστιν"\*
\w the|x-occurrence="1" x-occurrences="3"\w*
\w faith|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\zaln-s | x-strong="G23160" x-occurrence="2" x-occurrences="2" x-ugnt="Θεοῦ"\*
\w of|x-occurrence="3" x-occurrences="4"\w*
\w God's|x-occurrence="2" x-occurrences="2"\w*
\zaln-e\*
\zaln-s | x-strong="G15880" x-occurrence="1" x-occurrences="1" x-ugnt="ἐκλεκτῶν"\*
\w chosen|x-occurrence="1" x-occurrences="1"\w*
\w people|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\zaln-s | x-strong="G25320" x-occurrence="1" x-occurrences="1" x-ugnt="καὶ"\*
\w and|x-occurrence="2" x-occurrences="2"\w*
\zaln-e\*
\zaln-s | x-strong="G02250" x-occurrence="1" x-occurrences="1" x-ugnt="ἀληθείας"\*
\w the|x-occurrence="2" x-occurrences="3"\w*
\zaln-e\*
\zaln-s | x-strong="G19220" x-occurrence="1" x-occurrences="1" x-ugnt="ἐπίγνωσιν"\*
\w knowledge|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\zaln-s | x-strong="G02250" x-occurrence="1" x-occurrences="1" x-ugnt="ἀληθείας"\*
\w of|x-occurrence="4" x-occurrences="4"\w*
\w truth|x-occurrence="1" x-occurrences="1"\w*
\w the|x-occurrence="3" x-occurrences="3"\w*
\zaln-e\*
\zaln-s | x-strong="G35880" x-occurrence="1" x-occurrences="1" x-ugnt="τῆς"\*
\w that|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\zaln-s | x-strong="G25960" x-occurrence="1" x-occurrences="1" x-ugnt="κατ’"\*
\w agrees|x-occurrence="1" x-occurrences="1"\w*
\w with|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\zaln-s | x-strong="G21500" x-occurrence="1" x-occurrences="1" x-ugnt="εὐσέβειαν"\*
\w godliness|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*,

(edit): We are now using x-aln in the example as our custom milestone marker, following Alignment data in Milestones · Issue #59 · ubsicap/usfm · GitHub .

(edit 2): We are now using zaln in the example as our custom milestone marker, following Alignment data in Milestones · Issue #59 · ubsicap/usfm · GitHub .

jag3773 · February 1, 2018, 1:41pm

Note that I’m not really happy with the abuse of the Keyterm marker (\k) for the alignment data. David Haslam has suggested a general purpose milestone marker which would be a much better solution. I created a new issue in the USFM queue to ask about this problem in particular, see it at Alignment data in Milestones · Issue #59 · ubsicap/usfm · GitHub.

klappy · February 2, 2018, 2:39pm

In theory, we are using the milestone as if it is a word marker as we are adding attributes that were added in USFM 3 for word markers. To prevent confusion and clarity of usage we used another marker, keyterm, instead. If it is most appropriate, we can go back to using word markers since there is clarity in the \w-s and \w-e are different than typical words \w as well as there is no text in the marker it is all included in attributes.

jag3773 · February 2, 2018, 4:04pm

That is a fascinating idea, is w-s and w-e valid? I guess I was thinking that those were a different type of markup than say the \k marker.

klappy · February 2, 2018, 7:19pm

I’ve been all over the map with my understanding of the milestone markers. At one point, I thought we could just add -s and -e to any inline marker. In this use case, it might be nice if that were true. I don’t think that word markers were intended this way and the closest we could find was the keyterm marker.

andiwu · February 2, 2018, 8:01pm

A few clarification questions:

• All the Strong numbers have an additional 0 at the end. Is this digit there in case there are sub-divisions in that number?
• What’s the function of the vertical bar ( | )? Why are there spaces around it in the Greek part but not in the translation part?
• I see one-to-one and one-to-many mappings there, but not many-to-one and many-to-many? Can I see samples of those, too? In particular, I want to see how multiple words on the Greek side are represented.

jag3773 · February 2, 2018, 9:07pm

These are enhanced Strong’s numbers. Both the OSHB and the BHP use this system. And yes it helps account for further division of an entry.

That’s how the attributes are denoted in the USFM spec. The inconsistent spacing is an anomaly, but it is allowed. Even the word attribute specification has the difference.

They are accounted for, but they don’t show up in this example. Maybe @klappy can provide an example of those here?

klappy · February 2, 2018, 9:25pm

All I have now are contrived examples in our tests and they don’t include exporting to usfm3 yet. Once we do, I’ll have to post those on here. We’re finalizing our implementation now and maybe next week or the following until we can have a real example to share.

andiwu · February 10, 2018, 3:11pm

Are we going to have one file for each chapter? If the file is to contain more than a chapter or more than a book, what should it look like? Can we have a sample consisting of multiple chapters or multiple books?

jag3773 · February 12, 2018, 1:13pm

Hi Andi,

The files should follow the USFM standard, which is 1 file per book of the Bible. Here is an example from our ULB text, which does not have the word attributes, but it demonstrates the file layout.

andiwu · February 13, 2018, 2:53pm

Thank you, Jesse! That’s very helpful!

jag3773 · February 13, 2018, 6:42pm

Following up from Alignment data in Milestones · Issue #59 · ubsicap/usfm · GitHub , we are allowed to use user defined attributes as milestone names.

I’ve updated the example in the original post to showcase the use of ~~x-aln~~zaln as a potential marker for encoding our alignment data. Using this will prevent improper user of the k tag and prevent confusion down the road.

jag3773 · February 14, 2018, 6:54pm

I have added a new section to the Resource Container documentation that covers the Unlocked Bible Interchange Format (UBXF), which is the outgrowth of this post. Please check it out and provide any feedback here (the UBXF isn’t official yet, so we still have time to incorporate changes).

See the doc at http://resource-container.readthedocs.io/en/latest/ubxf.html .

jag3773 · February 14, 2018, 8:26pm

Sorry for the multiple updates today, please let me describe recent changes that were made to better conform to the USFM 3 specification.

Three changes were made in the example above (and on the UBXF page):

We need to use \zaln as our milestone marker because \z is the private namespace for user created extensions. The previous x-aln was a misappropriation from word level attributes.
We need to use \* to end the milestone start marker. For example:
\v 1 \zaln-s | x-strong="G39720" x-occurrence="1" x-occurrences="1" x-ugnt="Παῦλος"\* , notice the \* at the end of the line.
We need to use x-strong in milestone attributes because strong is only a recognized attribute for word level attributes.

andiwu · February 27, 2018, 2:41am

Jesse, how do you represent words that are not aligned? If a Greek word is not linked to any word in the translation or a translation word is not linked to any Greek word, we still want to keep those words in the file to make text complete, right? If so, what will it look like? Could you give some sample code, too? Thanks!

jag3773 · February 27, 2018, 9:45pm

@andiwu Unaligned words or phrases would show up outside of the \zaln milestones. For example, we’ll pretend like “of God / Θεοῦ” didn’t get aligned in the example below:

\v 1 \zaln-s | x-strong="G39720" x-occurrence="1" x-occurrences="1" x-ugnt="Παῦλος"\*
\w Paul|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*,
\zaln-s | x-strong="G14010" x-occurrence="1" x-occurrences="1" x-ugnt="δοῦλος"\*
\w a|x-occurrence="1" x-occurrences="1"\w*
\w servant|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*
\w of|x-occurrence="1" x-occurrences="4"\w*
\w God|x-occurrence="1" x-occurrences="2"\w*
\zaln-s | x-strong="G06520" x-occurrence="1" x-occurrences="1" x-ugnt="ἀπόστολος"\*
\w and|x-occurrence="1" x-occurrences="2"\w*
\w an|x-occurrence="1" x-occurrences="4"\w*
\w apostle|x-occurrence="1" x-occurrences="1"\w*
\zaln-e\*

Note that since the “base text” in these files is the translation (English in the example), that is what we would want to be text complete. Missing Greek or Hebrew words is OK because the app should be providing the full text for those via a different resource.

jag3773 · March 2, 2018, 7:56pm

Note that the documentation is now at http://resource-container.readthedocs.io/en/latest/ubxf.html .

@andiwu would it be helpful if we added the example of unaligned words ?

andiwu · March 6, 2018, 3:02am

Jesse,

Sorry for the slow response and thank you for your reply! I just flew back from China where Google is blocked.

I understand now. It will be great to add an example of this to the doc.

Andi