joshua-decoder/indian-parallel-corpora

语言: OCaml

git: https://github.com/joshua-decoder/indian-parallel-corpora

README.md (中文)

在Mechanical Turk创建的6种印度语言的平行语料库

这个目录包含孟加拉语,印地语,马拉雅拉姆语, 泰米尔语,泰卢固语和乌尔都语。每个数据集都是通过四处创建的 100个印度语维基百科页面并获得四个独立的 这些文件中每个句子的翻译。该 用于创建它们的过程,以及初始化的描述 实验,描述于:

“通过六种语言构建六种印度语的平行语料库”     众包”。   Matt Post,Chris Callison-Burch和Miles Osborne   NAACL统计机器研讨会论文集     翻译(WMT)。 2012。

PDF和BibTeX文件位于doc /目录中。

语料库按语言对组织到目录中:

bn-en/      Bengali-English
hi-en/      Hindi-English
ml-en/      Malayalam-English
ta-en/      Tamil-English
te-en/      Telugu-English
ur-en/      Urdu-English

在每个目录中,您将找到以下文件:

对/        PAIR.metadata        dict.PAIR {LANG,EN}        training.PAIR {LANG,恩,seg_ids}        dev.PAIR {LANG,S。{0,1,2,3},seg_ids}        devtest.PAIR {LANG,S。{0,1,2,3},seg_ids}        test.PAIR {LANG,S。{0,1,2,3},seg_ids}        votes.LANG

元数据文件被组织成行,每行包含四列。该 行对应于已翻译的原始文档,以及 colums表示(1)分配给的(内部)段ID 文件(2)文件的原始标题(3)翻译的 title(4)我们分配给文档的手动类别分配。

通过手动分配文档来构建数据分割 八个类别之一(技术,性别,语言和文化, 宗教,地方,人物,事件和事物),然后选择 开发,开发和测试的每个类别中大约10%的文档 数据(即大约30%的数据),其余为 培训数据。对应于每个拆分是包含该文件的文件 每个句子的段ID。段ID标识原始 文档ID和该文档中的句号。元数据 每个目录中的文件在文档ID,维基百科页面之间匹配 名称,相应的英文翻译和手册 分类。

这些词典是在一个单独的MTurk工作中创建的。我们建议 您在训练时将它们附加到训练数据的末尾 翻译模型(如论文所述)。

投票文件包含来自单独的MTurk任务的结果 新的Turkers被要求投票选出四个翻译中的哪一个 给出的句子是最好的。我们拥有所有语言的此类信息 除了马拉雅拉姆语。投票文件的格式是:

seg_id num_votes句子投票[句子投票...]

由于数据是由聘请的非专业翻译人员创建的 机械土耳其人,这是混合质量。我们正在研究 提高我们所征求的翻译质量的方法 这条路。但是,这些数据应足以让您入门 训练模式。你可以在这里下载:

HTTP://Joshua-decoder.org/Indian-parallel-corpora/

此外,脚本中有一些脚本可以操作 各种方式的数据。

本文使用googletrans自动翻译,仅供参考, 原文来自github.com

en_README.md

Parallel corpora for 6 Indian languages created on Mechanical Turk

This directory contains data sets for Bengali, Hindi, Malayalam,
Tamil, Telugu and Urdu. Each data set was created by taking around
100 Indian-language Wikipedia pages and obtaining four independent
translations of each of the sentences in those documents. The
procedure used to create them, along with descriptions of initial
experiments, is described in:

"Constructing Parallel Corpora for Six Indian Languages via
Crowdsourcing".
Matt Post, Chris Callison-Burch, and Miles Osborne
Proceedings of the NAACL Workshop for Statistical Machine
Translation (WMT). 2012.

The PDF and BibTeX files are in the doc/ directory.

The corpora are organized into directories by language pairs:

bn-en/      Bengali-English
hi-en/      Hindi-English
ml-en/      Malayalam-English
ta-en/      Tamil-English
te-en/      Telugu-English
ur-en/      Urdu-English

Within each directory, you'll find the following files:

PAIR/
PAIR.metadata
dict.PAIR.{LANG,en}
training.PAIR.{LANG,en,seg_ids}
dev.PAIR.{LANG,en.{0,1,2,3},seg_ids}
devtest.PAIR.{LANG,en.{0,1,2,3},seg_ids}
test.PAIR.{LANG,en.{0,1,2,3},seg_ids}
votes.LANG

The metadata file is organized into rows with four columns each. The
rows correspond to the original documents that were translated, and
the colums denote (1) the (internal) segment ID assigned to the
document (2) the document's original title (3) a translation of the
title (4) the manual category assignment we assigned to the document.

The data splits were constructed by manually assigning the documents
to one of eight categories (Technology, Sex, Language and Culture,
Religion, Places, People, Events, and Things), and then selecting
about 10% of the documents in each category for dev, devtest, and test
data (that is, roughly 30% of the data), and the remaining for
training data. Corresponding to each split is a file containing the
segment ID of each sentence. The segment ID identifies the original
document ID and the sentence number within that document. A metadata
file in each directory matches between document IDs, Wikipedia page
name, a corresponding English translation, and the manual
categorization.

The dictionaries were created in a separate MTurk job. We suggest
that you append them to the end of your training data when you train
the translation model (as was done in the paper).

The votes files contain the results from a separate MTurk task wherein
new Turkers were asked to vote on which of the four translations of a
given sentence was the best. We have such information for all languages
except Malayalam. The format of the votes file is:

seg_id num_votes sentence votes [sentence votes ...]

Since the data was created by non-expert translators hired over
Mechanical Turk, it's of mixed quality. We are currently researching
ways of improving the quality of the translations that we solicit in
this way. However, this data should be enough to get you started
training models. You can download it here:

http://joshua-decoder.org/indian-parallel-corpora/

In addition, there are some scripts in the scripts/ that manipulate
the data in various ways.