Intro¶

OpenPecha Data is a collection of 14,000 repositories—and growing— that each contain free open-source Tibetan text files in the OpenPecha format (OPF), and in some cases aligned translations.

Most repos contain individual texts, and some contain collections. These collections include corpuses, such those created to train translation models, and collections of texts, such as various editions of the Kangyur and Tengyur.

Developers use OpenPecha Data make corpuses, train large language models, and create Tibetan AI. Publishers use it to create e-texts. Academics use it for data-driven research.

Download a featured dataset

Get the latest OpenPecha datasets to train Tibetan-language AI models.

Featured datasets
Get to know the OPF Format

Learn about how the OpenPecha Format is structured and how it works.

The OPF Format
Understand OpenPecha Data on GitHub

Get up to speed on how OpenPecha Data is organized on GitHub.

OpenPecha Data on GitHub