In order to more easily exchange alignment data between software we’re providing a working example of how that data can be encoded in USFM 3. To learn more about USFM3, see the USFM Documentation, in particular, reference the section on Milestones.
Note that this encoding format is not yet finalized, we are open to suggestions of how it may be improved. To our knowledge, it does account for the various ways that languages need to be aligned to one another (since USFM 3 milestones can be overlapping or non-overlapping).
Here is an example from Titus 1:1. Note the following characteristics:
Punctuation occurs outside the \w markers.
We are using the Keyword / keyterm marker for our milestones because we can’t find any that are more appropriate in the USFM 3 specification.
Even for 1 to 1 word correspondence, we use the milestone marker for encoding the alignment. This makes it easier for code to import as it doesn’t have to look at word attributes and milestone attributes.
We encode the actual Greek text in the x-ugnt attribute so that it is clear what the Greek source is (the UGNT in this case).
The occurrence and occurrences attributes can help software identify individual occurrences of identical words within a verse.
In theory, we are using the milestone as if it is a word marker as we are adding attributes that were added in USFM 3 for word markers. To prevent confusion and clarity of usage we used another marker, keyterm, instead. If it is most appropriate, we can go back to using word markers since there is clarity in the \w-s and \w-e are different than typical words \w as well as there is no text in the marker it is all included in attributes.
I’ve been all over the map with my understanding of the milestone markers. At one point, I thought we could just add -s and -e to any inline marker. In this use case, it might be nice if that were true. I don’t think that word markers were intended this way and the closest we could find was the keyterm marker.
• All the Strong numbers have an additional 0 at the end. Is this digit there in case there are sub-divisions in that number?
• What’s the function of the vertical bar ( | )? Why are there spaces around it in the Greek part but not in the translation part?
• I see one-to-one and one-to-many mappings there, but not many-to-one and many-to-many? Can I see samples of those, too? In particular, I want to see how multiple words on the Greek side are represented.
These are enhanced Strong’s numbers. Both the OSHB and the BHP use this system. And yes it helps account for further division of an entry.
That’s how the attributes are denoted in the USFM spec. The inconsistent spacing is an anomaly, but it is allowed. Even the word attribute specification has the difference.
They are accounted for, but they don’t show up in this example. Maybe @klappy can provide an example of those here?
All I have now are contrived examples in our tests and they don’t include exporting to usfm3 yet. Once we do, I’ll have to post those on here. We’re finalizing our implementation now and maybe next week or the following until we can have a real example to share.
Are we going to have one file for each chapter? If the file is to contain more than a chapter or more than a book, what should it look like? Can we have a sample consisting of multiple chapters or multiple books?
The files should follow the USFM standard, which is 1 file per book of the Bible. Here is an example from our ULB text, which does not have the word attributes, but it demonstrates the file layout.
I’ve updated the example in the original post to showcase the use of x-alnzaln as a potential marker for encoding our alignment data. Using this will prevent improper user of the k tag and prevent confusion down the road.
I have added a new section to the Resource Container documentation that covers the Unlocked Bible Interchange Format (UBXF), which is the outgrowth of this post. Please check it out and provide any feedback here (the UBXF isn’t official yet, so we still have time to incorporate changes).
Sorry for the multiple updates today, please let me describe recent changes that were made to better conform to the USFM 3 specification.
Three changes were made in the example above (and on the UBXF page):
We need to use \zaln as our milestone marker because \z is the private namespace for user created extensions. The previous x-aln was a misappropriation from word level attributes.
We need to use \* to end the milestone start marker. For example: \v 1 \zaln-s | x-strong="G39720" x-occurrence="1" x-occurrences="1" x-ugnt="Παῦλος"\* , notice the \* at the end of the line.
We need to use x-strong in milestone attributes because strong is only a recognized attribute for word level attributes.
Jesse, how do you represent words that are not aligned? If a Greek word is not linked to any word in the translation or a translation word is not linked to any Greek word, we still want to keep those words in the file to make text complete, right? If so, what will it look like? Could you give some sample code, too? Thanks!
@andiwu Unaligned words or phrases would show up outside of the \zaln milestones. For example, we’ll pretend like “of God / Θεοῦ” didn’t get aligned in the example below:
Note that since the “base text” in these files is the translation (English in the example), that is what we would want to be text complete. Missing Greek or Hebrew words is OK because the app should be providing the full text for those via a different resource.