OCR Pipeline reference¶
The OCR Pipeline provides an interface for OCRing scanned texts in the BDRC library.
Given a BDRC Scan ID, the OCR Pipeline:
- Retrieves the scans that make up the text from the BDRC library.
- OCRs them with Google Cloud Vision.
- Converts the results into the OPF format using the OpenPecha toolkit.
- Creates a new repo on OpenPecha Data's GitHub.
- Puts the OPF files for the text into the new repo.
On this page¶
Note Using the OCR Pipeline requires a Google Cloud Vision service account key. Learn how to get one here.
Adding your email prompts your browser to save your email and key in your browser settings so you don't have to reenter them every time you use this tool.
Google Cloud Service JSON key file¶
This is contained in the
.json file that Google Cloud provides as a key for its Cloud Vision service.
Open the file in a text editor and copy the JSON code into this field.
Note If you don't have a key or you need help getting one, read this guide.
In this field, you can name the batch that you are scanning.
The OCR Pipeline currently only supports OCRing images in the BDRC library. Texts are retrieved using their BDRC Scan ID.
Example of a BDRC scan ID:
The scan ID follows
bdr:. In this case, the ID is
Multiple scans can be OCRd in one batch. Add one BDRC Scan ID per line.
Warning BDRC Work IDs and Version IDs aren't supported. If used, the OCR will result in failure.
The OCR Pipeline currently only supports Google Cloud Vision.
These model types are accessible by the OCR Pipeline.
builtin/weekly seems to produce the best results, but this needs more testing. Feel free to experiment.
builtin/stabledoesn't currently work for Tibetan.
The OP Pipeline can use these language hints to improve results:
Auto seems to produce the best results, but this needs more testing. Feel free to experiment.
The could be your name, your organization's name, or the person who bought the Google Cloud credit used to OCR the text(s) in this job. The name that is entered gets added to the OPF metadata.
Allow BDRC and OpenPecha to use the results to improve this service¶
By ticking this box, the results get put in a public OpenPecha Data repository on GitHub and you agree to allow BDRC and OpenPecha to use the resulting data.
If you don't agree, the file will be put in a private repo on OpenPecha Data's GitHub. In this case, after your job is successfully completed, email us at openpecha[at]gmail.com for access.
The right side of the OCR Pipeline interface contains a list of recent batches of files that have been processed. Select Details next to your batch to see its progress and results.
Here you can:
- Select the link under Result to go to the repo(s) that contain(s) the OCRd file(s).
- Toggle the chevron next to Inputs to see the list of files that were OCRd.
- Toggle the chevron next to Pipeline Config to see the language hint, model type, and OCR engine that were used.
- Select Details under Actions to see more metadata about the batch.
FileNotFoundError: The supplied ID(s) weren't found. This could be because the supplied ID(s) were BDRC Work IDs or Version IDs.
Solution: Find the Scan ID for the text(s) you'd like to OCR and try again.
AttributeError: 'str' object has no attribute 'keys': The provided key wasn't in the correct format. This could be because you entered the name of key file instead of the contents of the file.
Solution: See the directions above for entering your Google Cloud Service JSON key file.
GoogleVisionCredentialsError: The supplied key is correctly formatted, but may have expired.
- Solution: Regenerate a key on Google Cloud Vision and try again.
Processing an OCR job may take several minutes or more, depending on the number of images that are scanned.