Hi,
When I run clean-corpus-n.perl with max-1000 on the dataset with
14k(tourism corpus) lines, I get only 2.5k lines as clean corpus.
I see the script in addition to removing blank lines, and lines >1000(max)
words, the script is removing lines which violates 9-1 sentence ratio of
Giza. I don
Steven Huang writes:
>
> It seems that the XML is not correctly paresed and is taken as plain text.
> Is there anything wrong with my training configuration or training corpus?
> Thanks a lot.
Hi Steven,
The Moses XML format isn't pure and still cares about white space. Each
sentence should be
** apologies for cross-posting **
Call for Papers: 20th International Conference on Application of Natural
Language to Information Systems (NLDB'15)
Conference website: http://nldb2015.org/
NLDB 2015 invites researchers from academia and industry to submit
papers for oral or poster pres