Featured datasets

  • Open Parallel Corpus

    This corpus contains an up-to-date, ever-growing collection of multilingual texts aligned to Tibetan texts (bo) at the sentence-level. It is intended to be used to train an MT model.

    Get it on GitHub

  • Vulgate Kangyur

    This Kangyur was created with OpenPecha's Vulgate Generator, which compares instances of a work and compiles a new version using the most common character at each position in the work.

    Get it on GitHub