Catalog Next Proposal

jag3773 · August 22, 2019, 2:57pm

Overview

This document briefly describes our current catalog/API configuration and offers a brief proposal for a new configuration that aims to simplify usage for diverse development needs.

Context

Documentation for our current APIs is at https://api-info.readthedocs.io/

Fun fact, one of my first public contributions to uW in 2013 was turning the DokuWiki formated OBS into JSON for consumption by mobile apps and PDF generators.

Current API Setup

Currently, we have 2 active endpoints as it relates to translation work:

Door43 Resource Catalog (v3)–A read only catalog described at Door43 Resource Catalog (v3) — uW and Door43 API Information 0.1 documentation
Door43 Content Service API–A read/write API described at Door43 Content Service (DCS) — uW and Door43 API Information 0.1 documentation

Proposed Changes

In a nutshell, the proposal is to merge the two endpoints above into a single API which provides access to all the data we intend to serve.

API

REST

The functionality provided by the Door43 Resource Catalog (v3) will be merged into the Door43 Content Service REST API.

Of course, one big advantage here is that developers only need to know about and code for a single endpoint. Ideally, we can provide a higher level component library that goes even further in easing implementation (e.g. Gitea React Toolkit).

The key here is that the API provides access to 100% of the necessary information.

GraphQL

We also intend to add a GraphQL layer which will provide access to 100% of the information in the REST API, but in a more configurable manner. We may use one of these GraphQL Go libraries for that.

Database and API Diagram

graph TB; subgraph AWS RDS DB A["Door43 Content Service DB"]-->E C["Scripture Burrito DB"]-->E E["Gitea XORM"] end E-->B["Door43 Content Service Rest API"] E-->D["GraphQL Engine"]

Staging

We are planning to provide three stages in the new catalog.

Default/Normal Stage: The normal stage is production which will serve resources matching the Minimum Requirements listed below.
Pre-Release Stage: Subject to the same Minimum Requirements except that any Release marked as Pre-Release is included.
Experimental Stage: Serves the default branch (usually master) of the repository.

This configuration will allow consistency checks to be made between multiple resources before they are marked for release.

Catalog Features

Following the above proposal means that we need clear ways to replicate the features of the preexisting Catalog, namely:

Publishing workflow
Versioning of resources
Digital signing of resources
Metadata for resource
Pivoting/Querying for resources based on metadata

We’ll look at each of these features below.

Publishing Workflow

Previous Workflow

The overall workflow for publishing a resource will change significantly. The v3 catalog requires that a Source Text Request form be filled out which starts a set of steps that looks like this:

User fills out Source Text Request form
Verify license agreements
Fork/copy the data into the STR organization
Massage data and metadata to meet publishing standards
Move repository to Door43-Catalog organization
Fork project back into STR organization so that future updates can be staged easily

Minimum Requirements

What is absolutely necessary to publish a resource is that it meets these requirements:

The metadata validates against the Scripture Burrito schema.
A valid Release version tag is in the repository (this needs further specified, is it simply v1 or v1.3 that counts, or should be enforce something like catalog_v1 ?)

Effectively, anyone could make the above happen if they had the technical know-how. Of course, we are anticipating that most people will need some expertise to make those two things happen. So, keep reading…

Interim Workflow

The data and a metadata manipulation will still need to happen, but we’ll no longer need the approval process nor the moving of repositories out of their original locations. Instead, a process like the following will (typically) occur:

User fills out Publishing Aid form
A tech/developer forks the repository and massages data and metadata to meet publishing standards
A PR is issued against original repository
User merges PR
User creates a Release in DCS with an appropriate version tag

At that point the resource will published.

Ideal Workflow

Ideally, tC will actually manage this publishing process. This will require that we will have a Scripture Burrito aware React Component library that is capable of reading and writing SB metadata. Naturally, the library will need to be able to migrate Resource Container and previous metadata to SB compliant metadata.

At that point, tC (or any tool) can “mix in” in the SB aware Component Library and the DCS Toolkit component library and it will then be capable of,

writing valid SB metadata, and
creating a valid Release in DCS,

which will automagically publish the resource.

Building this functionality as Component Libraries means that publishing no longer has to be an extraneous step that only a few can master. Instead, anyone, using many apps, can publish their resource with a few clicks!

Versioning

Scripture Burrito Metadata

Scripture Burrito metadata will replace our previous Resource Container metadata (and previous tS, tC, uW app metadata that is not app specific). There are at least two steps to making this a reality:

We need to add a Scripture Burrito aware API to DCS (see following section)
We need to create Scripture Burrito aware React Component library that all of our tools can use to read and write SB metadata.

The proposal is that our applications manage the SB metadata directly and DCS merely processes and presents the projects in the API.

Querying for Resources

The DCS API will create new catalog endpoints as needed to fill the needs of software applications looking for published content (ironically, this might be similar to Github’s new package registry). This means that we’ll be able to replicate something similar to the preexisting v3 Catalog endpoints.

In addition to the existing DCS API endpoints, we should consider something like the following:

/catalog/

GET–returns all valid SB projects

/catalog/search

GET–returns valid SB projects according to search criteria (likely similar to repos search criteria). However, would be ideal to support searching based on SB metadata fields too (e.g Flavor, FlavorType, etc).

/catalog/owners/{owner}

GET–returns all valid SB projects for specified {owner} (either DCS user or organization)

We also need to keep in mind that there may be some jointly created code for a Scripture Burrito Aware API, which might be exactly what we are looking for. Depending on how and when this code is written, we may be able to “bolt” this on to the side of DCS, possibly integrated via default webhooks or git hooks.

Git Related Workflow Notes

Much of this is covered above, but here is a bullet point list of some things that may be different than preexisting conventions:

Repositories will remain under the control of the User or Organization that created/owns it.
The Door43-Catalog Organization will be replaced. Projects that need “generic” hosting may persist in this organization or a similarly named one.
A release/version will always be a tag
Branches may be used for work in progress (as in Protected Branch Workflow)
A translation project is always a hard fork (copy) of a version tag (e.g. a translation will never be merged to it’s upstream since it is a different language, so don’t attempt to preserve the fork upstream in the database relation)
- Create new repo → es_tn
  - Use latest tag for source content (en_tn)
  - Create a new branch from latest version tag
    - master branch will be blank
    - …translation happens…
    - … Create release with a git tag
      - Tag points at master branch
    - (old RC idea, needs updated for SB) Under source, add location to each entry in manifest.yaml:
      - location: ‘https://git.door43.org/unfoldingWord/en_tn/src/tag/v15’
      - location: ‘https://git.door43.org/unfoldingWord/en_tn/src/tag/v15’

Interesting Stuff

Follow git flow loosely: A successful Git branching model » nvie.com ?

Continue using our versioning schema: Versioning — unfoldingWord

Future Work

Signing

The current catalogs provide cryptographic signing of all content presented. This is great and will be difficult to replicate in DCS. However, it seems to make more sense to recreate the same sort of functionality in git and Scripture Burrito aware methods. There are two specific needs that must be addressed here:

Need 1: Verify content hasn’t changed during transit

Essentially, this amounts to checksumming. Both DCS and SB have methods of identifying files and checksums for those files.

Pretty easily accomplished using SHAs from DCS and locally computed versions on a per file basis
Possibly to download versioned tag zip/tar.gz AND the tree from the API which provides the SHAs for all the files – both from same TAG
- Solves for happy path
  - Doesn’t solve bad actor path where someone unpacks and repacks zip and tree

Need 2: Verify that org I love has signed my content

Signing tags/commits
Create DCS api mechanism allows you to sign a tag/release
- Signature for each one
- Public key for each org/user for verification purposes (already built?)
Possibly create an easy to use method for storing private keys for users in DCS / also recommend something like Keybase
Possibly use something like https://www.zdnet.com/article/googles-web-packaging-standard-arises-as-a-new-tool-for-privacy-enthusiasts/ ?

Implementation Plan (Rough)

List of things that we’ll need to do to move in this direction:

Write scripture_burrito.go in the DCS models directory (this will define the database table(s) for all the SB metadata that we need to store)
- We need a translation type scale identifier field for how to use the resource (e.g. ULT vs. UST) (translation strategy, form centric, or meaning centric)
- Maybe same as above, but migrate subject field, possibly maps onto flavor
Probably add a Scripture Burrito tab to Repo page to provide view/edit access to metadata
- Maybe add SB export button on this page
- Include a SB validation badge on this page
Add REST API endpoints for SB information
Code to migrate from RC metadata to SB database
Undecided: do we save SB metadata in the repo? Jesse leans toward no.
Add SB download option to the Download button
Add a catalog badge to each valid Release entry
Add a graphql layer to DCS, direct access to DB
Figure out what do with OLD catalog endpoints and get rid of LAMBDA functions

joel · October 17, 2019, 4:15am

Ok here’s my review/opinion!

Minimum requirements > release tag
Tags should follow unfoldingWord Content Versioning . Anything else would be inconsistent. If an additional flag needs to be added to the tag we could follow the pattern seen in semantic versioning v1.3.1-catalog. Though, I’m not certain what benefit we get by having -catalog and it could find it’s way into SB at the hands of confused translators.

Publishing workflow > Ideal Workflow
Just to clarify a few parts.

a valid SB must be written to the repo.
a properly formatted tag must be created.

At this point the release has been published in minimal form. However, you could optionally complete the following.

draft a release on the above tag. Perhaps the text here could appear in some “release notes” section in the api.
upload compiled formats to the release. These files will appears as additional formats (sorry for using RC terminology) within the api. e.g. pdf, html, etc.
- additional thoughts on formats. Perhaps there could be some standard for including links to online media like youtube? Maybe as a separate field in the release page?

Querying for Resources
Would viewing and searching the API be limited to just the latest versions of content? How could older versions of a resource be found via the api?

Fetching all SB projects for an owner should be at /catalog/owners/{owner} to avoid name collisions. This is also consistent with the current DCS api.

Git Related Workflow Notes

generic projects
Generic projects could simply be hosted under an individual’s account. I don’t see a reason why we would need to keep those under Door43-Catalog.

publishing
As mentioned under the publishing workflow, in addition to releases/versions always being a tag, Additional formats can be included by drafting a release on that tag and uploading the additional formats. e.g. pdfs, html, etc.

translation
The translation process explained here is quite different from how it’s been done in the past with tS. The translation has always been a blank slate. But I guess this makes sense for translating in the web. However, when translating in an a tool we’ll follow a similar pattern as before correct? Which would not include forking the source repo.
This might be tool specific but perhaps some distinction should be made between the different ways of starting a translation.

verification
For verifying content you should be able to look up the release on DCS and get it’s SHA. Then compare it with your SB content. So any time you want to verify some content you’ll need to have an internet connection.

signing
For signing it should be sufficient to sign just the commit hash of the release.

Rather than storing private keys on DCS we can just use the public keys that already exists on profiles. Then we can let organizations generate signatures on their own (see above) and attach them to the release in some fashion. This could be done on the UI or via an API. The signatures can then be verified with the public key already on file with DCS, and content consumers can do the same.

jag3773 · October 17, 2019, 2:28pm

A few responses here:

It’s actually not any different, by hard fork I mean to indicate this is not a normal DCS/Github soft fork. In other words, it’s a copy with no intent of ever being merged back upstream since it’s a translation.

Good point, this will work great as long as users can figure this process out, which isn’t very user friendly.

What I mean by generic is that no one actually is the owner. So think of a public domain translation, where does that get housed?

Yes, I wasn’t going into the detail of creating a “release” but that would be included in the workflow. And yes, uploaded formats to the release would be a handy way to host and distribute them.

For additional formats, you are right that Scripture Burrito needs to think through how to do that, I’ll put that on our list of things to consider.

jag3773 · November 27, 2019, 8:47pm

FYI, I edited the main post and

modified the API section, including a new graph
added the “Implementation Plan (Rough)” section at the end

joel · November 28, 2019, 6:30am

You mention getting rid of the lambda functions, which will obviously halt any further updates to the legacy APIs. Is there concern for legacy software that is currently being used? I’d personally love to see everyone upgrade, but realistically this may take a very long time. Also, what will the upgrade path look like for existing tools? e.g. with the future of tS uncertain, would users be left with a stale api until tS can be replaced with newer tools?

Perhaps we should keep the old APIs around with frozen content until we can provide an upgrade/migration path for existing tools.

jag3773 · December 2, 2019, 3:47pm

I see two possibilities here, the one you mentioned, freeze the endpoints and they get the content that is there unless the developers upgrade. Or, option 2, is to convert the lambdas into docker containers and continue running them. The lambda setup has turned out to be not so great in lots of ways for this use case, so if we are going to keep processing the old catalogs we may be better off to convert them.