Skip to content

Intro

OpenPecha Data is a collection of 14,000 repositories—and growing— that each contain free open-source Tibetan text files in the OpenPecha format (OPF), and in some cases aligned translations.

Most repos contain individual texts, and some contain collections. These collections include corpuses, such those created to train translation models, and collections of texts, such as various editions of the Kangyur and Tengyur.

Developers use OpenPecha Data make corpuses, train large language models, and create Tibetan AI. Publishers use it to create e-texts. Academics use it for data-driven research.

  • Download a featured dataset


    Get the latest OpenPecha datasets to train Tibetan-language AI models.

    Featured datasets

  • Get to know the OPF Format


    Learn about how the OpenPecha Format is structured and how it works.

    The OPF Format

  • Understand OpenPecha Data on GitHub


    Get up to speed on how OpenPecha Data is organized on GitHub.

    OpenPecha Data on GitHub