OCR Tibetan: How to turn scanned Tibetan and English images into searchable text

O

What is OCR?

OCR is Optical Character Recognition — turning an image of text into editable and searchable text. More technically, it is the process of recognizing an image of text and turning it into an encoded font, which a is series of character codes representing the letters, symbols, or ligatures of the text. I sometimes talk about this as the process of turning an image of text into unicode.

Unicode is a modern computer font encoding system that was introduced and became standard in the mid-90s. It was a major breakthrough for international computer users because it standardized the encoding of fonts for almost every written language in the world in popular use (including Tibetan and Sanskrit). It did this by leveraging the greatly increased capacities of modern computers to increase the bits used to encode each character from 7 or 8 bits to 16 or 32 bits. 7-bit ASCII can encode 128 possible characters. 32-bit unicode can encode a total possible 1,114,112 code points.

Many unicode characters use multiple code points, so that doesn’t necessary map to 1,114,112 characters. However, currently, only about 20% of those code points are assigned, and this includes all of the characters required to most write world’s written languages as well as the ever expanding set of emojis, symbols, and control characters. This includes ideographic languages like Japanese and Chinese.

If you’ve ever tried to copy and paste text from an old PDF file and been frustrated when the pasted text comes out as gibberish, this is probably because that PDF was made before Unicode and the Tibetan text had to be encoded in a custom, Tibetan-specific font. Unicode fixed this and standardized text encoding for most written languages in the world using a much larger encoding size per character. (You might still have problems copying and pasting from modern PDFs, but that is often because the author of the PDF has intentionally scrambled the encoding to block copying and pasting).

Once we have text turned into a Unicode-encoded font, we can edit it, copy and paste, and search in it. How do we do this?

OCR for English has been pretty good for years now. Tibetan OCR, however, lagged behind. Not anymore. One of the first projects I became aware of that achieved usable OCR for Tibetan was the Namsel project. This project is now abandoned in favor of more modern, and more accurate, projects. However, I wanted to mention it because it was public, open-source, and was used to OCR a lot of texts in databases previously. Unfortunately, it was also error-prone and tricky to use.

OCR Engines: Google Cloud Vision and Tesseract

Now, there are some great new options for OCR of Tibetan text. I don’t pretend this is an exhaustive list (if anyone reading this knows of other options, let me know). The two main options I’ve used are:

  • Google Drive with Google Cloud Vision
  • OCRmyPDF with Tesseract

Google Cloud Vision (also known as Vision AI) and Tesseract are OCR engines. Google Cloud Vision is Googles proprietary, commercial OCR engine. Tesseract is open source AI engine that was created initially by HP, who open-sourced it in 2006, and then developed for several years by Google. They are the underlying AI-powered programs that take raw image data and return structured text and position information. These engines typically are not very user-friendly and require front-ends to be useful to the average person.

Both Tesseract and Google Cloud Vision are mature technologies and impressively accurate. You can read a comparison here. Google Cloud Vision might be a little more accurate, and Tesseract might be faster. But both of these tools are great and are under active development. The entire field of AI is currently experiencing tectonic shifts, so I expect the performance of these tools will improve even more in the next few years.

End-User Applications

The two applications I use to actually perform OCR on scanned images are:

  • Google Docs / Drive
  • OCRmy PDF

Using Google Drive for OCR is very simple. You can read Google’s instructions here. All you have to do is upload a scanned image to your Google Drive account. Right-click on the file and select Open With > Google Docs.

In the image below, you see an example of me doing this from a scan of a page from the bka’ ‘gyur dpe bsdur ma (a comparative version of the Kangyur).

After a few seconds, the text from the file will appear below the image.

As you can see above, Google Docs does not OCR in place, overlaying the text on the image. All of the detected text from the file is found below the image. This is a great way to extract the text from a file.

That file above was pretty simple. It was a nice, clean photo of only Tibetan text arranged in simple lines and paragraphs. Let’s see how Google does with something a little more complex.

The image below is from Paul Hackett’s Tibetan Verb Lexicon. It’s a little blurry. It is distorted because of the page bending. It also has a complex structure that mixes Tibetan, English, Sanskrit, and some fancy symbols and bracketed abbreviations.

This is the output from Google Docs.

Obviously, it’s not copying any of the page layout. There are some errors and plenty of wacky formatting. But, really, on the whole, it did a pretty incredible job of detecting the languages and recognizing a mixed set of Tibetan and English. It pretty much ignores the Sanskrit diacritics (oh, well, maybe next year).

The main limitations of using Google Docs are (1) that it can only OCR a single page at a time and (2) that it can’t, for example, OCR a pdf and make it searchable in place; it just spits out the text it detects in the page.

Turning an entire book into text using this method would not be feasible. There are programmatic interfaces to Google Cloud Vision that could be used to OCR a lot of scanned pages in a batch, but that’s beyond the scope of this tutorial.

The Google Docs documentation lists a few requirements for using this service.

  • Format: You can convert PDFs (multipage documents) or photo files (.jpeg, .png and .gif)
  • File size: The file should be 2 MB or smaller.
  • Resolution: Text should be at least 10 pixels high.
  • Orientation: Documents must be right-side up. If your image faces the wrong way, rotate it before you upload it to Google Drive.

OCRmyPDF – turn a scanned PDF into a searchable PDF

When I first started learning Tibetan — at least after the stage when everything in Tibetan looked like incomprehensible Martian and I began to recognize some character and learn a few words — I was frequently frustrated with the inability to search our main textbook, Joe Wilson’s fine tome on classical Tibetan grammar Translating Buddhism from Tibetan.

I had questions such as:

  • Where was that section on the non-case uses of the agentive particles?
  • Where did he explain the use of auxiliary verbs?
  • Where all in the book does he mention ཅིག་?

Sadly, I had to search the book like it was 1985 — with my fingers turning pages. Until recently!

OCRmyPDF is a cross-platform application that uses the Tesseract OCR engine to turn a pdf of scanned images into a searchable pdf. It does this by recognizing the characters in the page and inserting an invisible layer in the page with the recognized text.

From the introduction in their documentation (which is also a nice introduction to OCR in general and the challenges performing OCR on PDF files):

OCRmyPDF is an application and library that adds text “layers” to images in PDFs, making scanned image PDFs searchable. It uses OCR to guess what text is contained in images. It is written in Python. OCRmyPDF analyzes each page of a PDF to determine the colorspace and resolution (DPI) needed to capture all of the information on that page without losing content. It uses Ghostscript to rasterize the page, and then performs OCR on the rasterized image to create an OCR “layer”. The layer is then grafted back onto the original PDF.

The simplest use case with OCRmyPDF is with a PDF document that is entirely scanned, essentially a bunch of images. This is what you get if you scan a book to a pdf. Things get a little more complicated if the pdf contains both text and image data, or already has some OCR data in it.

Making Wilson Searchable

Imagine you have a file called wilson.pdf that is a scanned copy of your Wilson textbook. First you need to install OCRmyPDF according to the instructions on their website. You may also need to install language packs for Tibetan or any other non-English languages.

OCRmyPDF is not a graphical program. It’s a command line program, so you have to be able to run basic commands in a shell. How you do this will differ on each platform.

The command to create a searchable PDF for Wilson would be:

ocrmypdf -l eng+bod --clean /path/to/wilson.pdf /path/to/wilson-searchable.pdf

The -l parameter is used to tell the program what languages to expect. This command specifies English and Tibetan. You use ISO-639-1 three-letter codes. Here is a list of supported languages and ISO codes.

The --clean parameter tells OCRmyPDF to perform some basic cleaning of the images in the PDF before attempting to recognize the characters. It does not alter the PDF files themselves. It and the other pre-processing option are described in the docs.

Other than that, it’s just the input file and the output file.

There are a lot of other options. OCRmyPDF can actually deskew pdfs (fix rotation) and perform image optimization and compression. It can also be used to simply OCR images and export text.

Case 2: Mixed English Text and Non-Unicode Tibetan

Another common scenario with Tibetan files is to have text data with English and with Tibetan encoded in non-Unicode fonts. If you try and copy and past the Tibetan from the document and you end up with something like ;#]-az#-zd^Xr-d-vn-R^c-az#-N´ç! when paste, this is text that has been encoded in an older, non-Unicode format. This text is also not searchable.

In this example we’ll work with Elizabeth Napper’s Tibetan definitions file, compiled by Paul Hackett, titled Basic Buddhist Terms and Concepts: A Student’s Guide for the Study of Tibetan Buddhism. It is a collection of terms and definitions from Collected Topics. This is provided freely by UMA Tibet and Jeffery Hopkins. The most recent edition I know of was published in 2003.

You can download the original file on the Tibetan wiki. Or here.

This file has English that is searchable and copy-and-paste-able. The Tibetan, however, is non-Unicode. Let’s bring it kicking and screaming out of the halcyon days of the 80s into the comic-dystopian-AI-generated-and-properly-encoded modern era!

If you just try to use the same command I showed you above, you’ll get an error that says, `PriorOcrFoundError: page already has text!

By default, OCRmyPDF will not perform OCR on any pages that already have text or OCR data. Because this document is a text document, it skip the entire file. You have to use either --redo-ocr or --force-ocr. You can read the docs on these two options here.

The command I used to make this file searchable is the following:

ocrmypdf -l eng+bod --force-ocr ~/Desktop/napper-definitions.pdf ~/Desktop/napper-definitions-searchable.pdf

The only new param is the --force-ocr flag. This tells OCRmyPDF to rasterize all of the text (that is, turn the encoded font characters into an image, as if one has taken a scan of the virtual page) before performing OCR. It’s sort of like printing it, scanning it, and then performing OCR on the scanned document, but, of course, it’s all done in the program electronically.

This sidesteps the problem with the medieval text encoding of the retro pdf. The drawback is that you end up with a pdf full of image files instead of text font data. This creates two problems: (1) a much larger file size, and (2) reduced resolution of the file.

The output file is 7.42X larger, 450k vs 3.3 megabytes. Not such a big deal with this file, but with a larger file that increase in size can be an issue.

You can download the output file I generated here.

To see the effect of rasterization (turning text into an image of text) look at these two photos. The first shows the original on the left and the rasterized, output file on the right. Zoomed in, the decrease in resolution is noticeable but not terrible.

Just to get a better sense of it, here they are at a higher zoom level. The pixelation in the output file is clearly visible.

Of course, there isn’t generally a reason to zoom this far into a document like this. In practice, at normal reading levels of zoom, it’s actually essentially unnoticeable.

This works pretty well for small files. However, I did the same thing for Thubten Jinpa’s Tibetan grammar text, which, despite being a new publication, is encoded strangely and in a way that cannot be searched. The file size increased from 8 megabytes to 183 megabytes. That’s a HUGE jump!

Conclusion

That’s it. Let me know if you have any comment or questions, or if you find any other useful OCR tools for Tibetan.

About the author

Andrew Hughes

Add Comment

By Andrew Hughes