Dear all,
In case one would like a good excuse to visit Paris March 2-3 2018,
there will be a workshop on OpenNMT.
Here is the registration website.
http://workshop-paris-2018.opennmt.net/
Cheers,
Vincent
___
Moses-support mailing list
nano give also the "right" number 270769 but I got some script which
find a difference.
Le 14/09/2017 à 08:48, Vincent Nguyen a écrit :
> okay really weird.
> wc gives me the same numbers as you, but gedit give another 2 different
> numbers for each file. Must be special c
*
>> 270769 news-commentary-v12.de-en.de
>> 270769 news-commentary-v12.de-en.en
>> 541538 total
>
> What are you running that shows you different line numbers?
>
> cheers - Barry
>
> On 12/09/17 10:06, Vincent Nguyen wrote:
>> Hi,
>> Is there an
Hi,
Is there an updated version of NCv12 for this
http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz
the number of lines for de-en is not the same in the 2 languages.
Cheers,
Vincent
___
Moses-support mailing list
Hello team,
I have read many post and it looks like most people tend to use the
Stanford segmenter.
Do you have some good experience with other tools ?
Also, what "detokenizer" do you actually use. It seems, that it is not
just a question of removing space, especially when Chinese target
I think you mixed up input/ouput because in your example at the end, you
would like to get pronunciation of a given new word.
input is the left hand side and output is the pron.
If you are able to rework a little bit the right hand side of your data
(you need to stretch the phones one by
Hi Michael,
Trying to check if you're tests on this subject were successful or not,
can you follow up ?
thanks
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
re de-duping, and before we
> didn't.
>
> I would say if you want to compare to recent WMT experiments, take the
> most recent version of the data,
>
> cheers - Barry
>
> On 04/10/16 21:34, Vincent Nguyen wrote:
>>
>> ok
>> this one http://www.statmt.org/wmt11/t
sed files?
>
> cheers - Barry
>
> On 04/10/16 14:40, Vincent Nguyen wrote:
>> Hi,
>>
>> on this link:
>>
>> http://www.statmt.org/wmt11/translation-task.html
>>
>> on the download section for monolingual data, there is :
>>
&
Hi,
on this link:
http://www.statmt.org/wmt11/translation-task.html
on the download section for monolingual data, there is :
one big file : http://www.statmt.org/wmt11/training-monolingual.tgz
And separate files, of which news crawls per year.
However, when you take a single file for a
2016 at 9:57 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
Hi,
I have a basic question on EMS.
If I want no recasing and no truecasing, I just put IGNORE next to
the 2
sections.
However I have the feeling it does n
Hi,
I have a basic question on EMS.
If I want no recasing and no truecasing, I just put IGNORE next to the 2
sections.
However I have the feeling it does not eliminate this step for the
EVALUATION step, and there is no ignore within this one.
Is this the case ?
Thanks,
Vincent
First, many thanks for the huge work. open some new languages
possibilities not in the europarl.
I just made one test comparison :
Config 1:
Corpus UN v1.0
LM : UN V1.0 + News2014FR
DEV+TEST=Newsdiscuss2015
Nist=29.61
Config 2:
Corpus Europarl
LM : Europarl + News2014FR
SSD drive ? if not, then forget it.
try cat > NULL
Le 10/04/2016 08:29, Jorg Tiedemann a écrit :
Hi,
I have a large language model from the common crawl data set and it
takes forever to load when running moses.
My model is a trigram kenlm binarized with quantization, trie
structures and
of phrase tables and language models matter, too, but not
as much, and it seems that in your scenario you are just wondering
about splitting up a fixed pool of data.
-phi
On Wed, Apr 6, 2016 at 6:50 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
Hi,
Hi,
What are (in terms of performance) the difference between the 3
following solutions :
2 corpus, 2 LM, 2 weights calculated at tuning time
2 corpus merged into one, 1 LM
2 corpus, 2 LM interpolated into 1 LM with tuning
Will the results be different in the end ?
thanks.
Apostrophe is tricky to handle properly
the tokenizer is language sensitive (see -l option)
in French :
l'été => l été [with a space between ; and é]
in English :
today's story => today s story
BUT
the issue is sometime in corpora you will find some misplaced spaces
before or after the
, 2016 at 2:58 PM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
Hello,
Does someone have some support to this (found in the doc) :
Maximum Phrase Length
The maximum length of phrases is limited to 7 words. The maximum
phrase
length impa
Hello,
Does someone have some support to this (found in the doc) :
Maximum Phrase Length
The maximum length of phrases is limited to 7 words. The maximum phrase
length impacts the size of the phrase translation table, so shorter
limits may be desirable, if phrase table size is an issue.
Hi,
I have been fighting with some reordering issues.
I have tried both LM interpolation and OSM but with no luck.
Here is an example
Source English :
Canada remains very active within the Working Group, and our law
enforcement officials also participate in the Working Group’s informal
law
output of the decoder, and see how it is
changed by the detokenizer.
-phi
On Wed, Mar 9, 2016 at 11:44 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
Hi,
I got the following situation:
This group age
is translated sometimes in:
ce groupe
d see how it is
changed by the detokenizer.
-phi
On Wed, Mar 9, 2016 at 11:44 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
Hi,
I got the following situation:
This group age
is translated sometimes in:
ce groupe d'âge (correct)
Hi,
I got the following situation:
This group age
is translated sometimes in:
ce groupe d'âge (correct)
ce groupe d" âge (incorrect)
ce groupe d "âge (incorrect)
I am wondering if this is more a detokenizer issue or a corpus issue, or
both.
Technically in French, there shouldn't be any space
Guys,
I got a question to the mathematicians that you all are :)
I have been working and testing Moses as well as Groundhog for months now.
When I compare results (when comparability is possible, using same
corpus, in-domain, blablabla, ...) I do not see much difference in both
systems.
So
However I believe this is still not right for unigram sentences.
____
De : "Vincent Nguyen"
Date : 26 févr. 2016 22:21:59
A : moses-support@mit.edu <mailto:moses-support@mit.edu>
Sujet : Re: [Moses-support] bleu-annotation / analysis.perl
owever be somewhat faster than only
a single thread.
On 17.02.2016 22:44, Vincent Nguyen wrote:
I have the feeling it's not.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
__
Am I correct saying that when sentences length is less or equal to 4
tokens then the BLEU score should be 1 for exact matches and 0 when not
exact match ?
(by definition of http://www1.cs.columbia.edu/nlp/sgd/bleu.pdf)
Le 26/02/2016 10:02, Vincent Nguyen a écrit :
> Hi,
>
> I w
Hi,
I would like to understand better the analysis.perl script that
generates the bleu-annotation file.
Is there an easy way to get the uncased bleu score of each line instead
of the cased calculation ?
Am I right that this script recompute its own Bleu score without calling
the Nist-Bleu nor
in Junczys-Dowmunt wrote:
>> It is, just not very well done. It generally does not make sense to have
>> more than 8-10 threads. That should however be somewhat faster than only
>> a single thread.
>>
>> On 17.02.2016 22:44, V
I have the feeling it's not.
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
did you add -exec at the end (behind -continue 1) ?
Le 08/01/2016 18:16, Nicholas Ruiz a écrit :
> Thanks, Tomasz. Unfortunately modifying the config file in the steps
> directory didn't work for me. My block looks something like this:
>
> [EVALUATION:test4]
>
> tokenized-input =
this is fine for tuning. if you want to make it quicker, drag it down to
1000 sentences.
Le 28/12/2015 16:37, Read, James C a écrit :
Hi,
I'm setting up some Moses baseline systems for various language pairs
to compare the systems against my own work. I've largely been
following the
You managed to install it, so you will need a little efforts to learn
basics by yourself
here is the starting point :
http://www.statmt.org/moses/?n=Moses.Baseline
Le 10/12/2015 19:03, Shaimaa Marzouk a écrit :
> Dear support team,
>
> I would be extremely grateful, if you could help me with
ively
> using across Windows and Posix systems.
>
> Tom
>
>
> On 12/5/2015 6:13 AM, moses-support-requ...@mit.edu wrote:
>> Date: Fri, 4 Dec 2015 23:13:10 +
>> From: Ulrich Germann<ulrich.germ...@gmail.com>
>> Subject: Re: [Moses-support] decoder questi
Actually I don't know if this is a decoder question or such.
Here is my issue
Let's say I have a text string with 2 sentences, with a period ending
the first sentence, but no CR+LF, just a space before the second sentence.
When I pass the full string to the pipe :
tokenizer + truecaser + moses
n I have the feeling that we really need to
"sentence-tokenize" first before word-tokenizing.
Le 04/12/2015 13:52, John D Burger a écrit :
> I think you're asking if Moses translates one sentence at a time. The answer
> is yes.
>
> - John Burger
> MITRE
>
&
Hieu,
here :
http://www.statmt.org/moses/RELEASE-3.0/models/fr-en/config.pb.recase
I read :
input-tokenizer = "$moses-script-dir/tokenizer/normalize-punctuation.perl
$input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l
$input-extension"
output-tokenizer =
Hi all,
I have a question regarding LMs.
Let's take the example of news.2014.shuffle.en
When we process it through punctuation normalization for english
language, it will for instance put a " " before an apostrophe
"it is'nt" = > "it is 'nt"
BUT it contains some noise, for instance there is
no relative paths.
And of course, the binaries need to be executable on all nodes as well.
-phi
On Thu, Oct 29, 2015 at 10:12 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
OK guys, not an easy stuff ...
I fought to get the prerequisites working but but
out :
How does Moses steps deal with "Nb of Jobs submitted" versus -threads in
the various steps ?
Le 29/10/2015 17:45, Vincent Nguyen a écrit :
> Ken,
>
> I just did some further testing on the master node that HAS all installed.
> same error as is.
>
> /netshr/m
)
Le 29/10/2015 15:18, Philipp Koehn a écrit :
Hi,
make sure that all the paths are valid on all the nodes --- so
definitely no relative paths.
And of course, the binaries need to be executable on all nodes as well.
-phi
On Thu, Oct 29, 2015 at 10:12 AM, Vincent Nguyen <vngu...@neuf
n Wed, Oct 28, 2015 at 10:20 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
Hi there,
I need some clarification before screwing up some files.
I just setup a SGE cluster with a Master + 2 Nodes.
to make it clear let say my cluster name i
clear, it runs correctly on the local machine but not when you
> run it through SGE? In that case, I suspect it's library version
> differences.
>
> On 10/29/2015 03:09 PM, Vincent Nguyen wrote:
>> I get this error :
>>
>> moses@sgenode1:/netshr/working-en-fr$ /net
es on SAMBA is pretty low
> priority. However, if you can provide a backtrace (after compiling with
> "debug" added to the command) I can try to turn that segfault into an
> error message.
>
> Kenneth
>
> On 10/29/2015 08:15 PM, Vincent Nguyen wrote:
>> it's
tuning now
so working fine so far
btw, in SMB there was another issue with the split command in extraction.
Le 29/10/2015 21:44, Vincent Nguyen a écrit :
> I'll mount NFS instead and will confirm if working.
> thanks
>
> Le 29/10/2015 21:31, Kenneth Heafield a écrit :
>> Hi,
Hi there,
I need some clarification before screwing up some files.
I just setup a SGE cluster with a Master + 2 Nodes.
to make it clear let say my cluster name is "default", my master
headnode is "master", my 2 other nodes are "node1" and "node2"
for EMS :
I opened the default
Hello,
Pretty sure there is no academic importance to this, but :
For the tokenizer we have the -x option to skip XML/HTML tags
For the detokenizer it WILL SKIP whatever.
cf :
while() {
if (/^<.+>$/ || /^\s*$/) {
#don't try to detokenize XML/HTML tag lines
Michael,
what score-setting do you use to achieve these results ?
if search algo= 1 what cube pruning number ?
Le 08/10/2015 19:05, Michael Denkowski a écrit :
Hi all,
I extended the multi_moses.py script to support multi-threaded moses
instances for cases where memory limits the number of
LEU/TER/Meteor but this is just one
data point and a fairly simple system. I would be curious to see how
things work out in other users' systems.
Best,
Michael
On Thu, Oct 8, 2015 at 2:34 PM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
out of curiosity, what gain do
After many tests, as mentioned before I had made these changes in EMS
score-settings = "--GoodTuring --MinScore 2:0.001"
and
pop limit cube pruning at 400 (instead of 5000 in EMS )
speed is much much higher (without impact on translation)
Le 05/10/2015 17:20, Philipp Koehn a écrit :
Hi,
Hello,
Quick question regarding this script behavior.
Les Banques de la zone Euro sont soumises à :
becomes
les banques de la zone euro sont soumises à :
lowercasing is fine
the space between >Les is fine
but it did not insert a space between the after the : in :
any clue ?
Vincent
actually after > space is always inserted, but before < never inserted.
Le 26/09/2015 16:37, Vincent Nguyen a écrit :
> Hello,
>
> Quick question regarding this script behavior.
>
> Les Banques de la zone Euro sont soumises à :
>
> becomes
>
> les banque
/15 a las 16:50, Vincent Nguyen escribió:
I agree and would like to.
But this is tricky, look at the first 30 lines of my phrase table below.
and this happens a lot in the first line of tables where there are
or weird codes, EN/FR pairs do not match.
! ! ! ! ||| ! ! ! ! ||| 0.103413 0.132185
e used:
>> 1 ||| One Million Roofs
>>
>> oui ||| no
>>
>> To use this list, add the following to your moses.ini file
>>
>> [feature]
>> DeleteRules path=/path/to/list
>>
>> Not tested.
>>
>>
>>
&
er bad translation
> options which pop up.
>
> On Thu, 2015-09-24 at 16:08 +0200, Vincent Nguyen wrote:
>> Matthias,
>>
>> Pruning :
>> I use the cube pop limit at 400 instead of default values (1000 or 5000)
>> I use the MinScore 0.001
> It seems to me th
try modified Moore-Lewis filtering for data selection.
> https://aclweb.org/anthology/D/D11/D11-1033.pdf
>
>
> Cheers,
> Matthias
>
>
> On Thu, 2015-09-24 at 18:19 +0200, Vincent Nguyen wrote:
>> This is an interesting subject ..
>>
>> As a matter
ct: Re: [Moses-support] is there a way to remove a bad entry in
the phrase table ?
To: Vincent Nguyen<vngu...@neuf.fr>
Cc: moses-support<moses-support@mit.edu>
Hi,
you can remove it manually (just edit the text file), there will be no
negative consequences.
However, it
Hi,
I was wondering if after an analysis of the BLEU-Annotation file we
realize that there must be a bad entry in the phrase table,
we could remove it manually or in some other ways ?
Gracias.
V.
___
Moses-support mailing list
Moses-support@mit.edu
.
big debate ?
Le 16/09/2015 17:30, Vincent Nguyen a écrit :
I am struggling with a pipeline .
Here is the text1.txt file I would like to translate from FR to EN
Les banques de la zone euro sont soumises :
au ratio de capital lié à la détention d’actifs risqués
(nous nous intéressons
I am struggling with a pipeline .
Here is the text1.txt file I would like to translate from FR to EN
Les banques de la zone euro sont soumises :
au ratio de capital lié à la détention d’actifs risqués (nous
nous intéressons ici au crédit) ;
au ratio de levier, qui détermine le capital
Guys,
While running EMS with a big test file I realized that the analysis.perl
was executed very quickly while the actual Nist-Bleu was much much longer.
Also one thing is that the file "BLEU-Annotation" generated during
analysis does not contain the right line numbering.
it takes 0 as the
gt; On 9/13/2015 11:01 PM, moses-support-requ...@mit.edu wrote:
>> Date: Sun, 13 Sep 2015 10:44:02 +0200
>> From: Vincent Nguyen<vngu...@neuf.fr>
>> Subject: Re: [Moses-support] sgm generation for personalized test sets
>> To: moses-support<moses-support@mit.edu>
in order to use makemteval.py we need to remove 0D and E2 80 A8 from txt
files.
python handles them as additional line breakers.
Le 12/09/2015 22:07, Vincent Nguyen a écrit :
> Hi,
>
> What script do you guys use to generate sgm sets based on txt file ?
>
> I have tried makemteva
Hi,
What script do you guys use to generate sgm sets based on txt file ?
I have tried makemteval.py in contrib
but there are a few issues.
I think these lines:
lines =
[l.replace('','\"').replace('','\'').replace('','>').replace('','<').replace('','&')
for l in filein.read().splitlines()]
Hi experts,
I have a question about the phrase table theory.
If we take a corpus A to create a TM model TMA and a LM model LMA.
if we consider a corpus B.
Method 1 :
We add corpus B to A => corpus AB => TM-AB and LM-AB
Method 2:
We process corpus B => TMB and LMB
then we combine TMA + TMB and
, 2015 at 10:33 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
is there any benchmark on what value / what impact ?
what should I start with as a test 0.001 ?
the standard value 0.0001 seems really really low to me
maybe I am not getting what t
Hi,
Unless I am mistaken, it seems that binarizing the TM step in EMS in not
multi core.
ttable-binarizer = "$moses-bin-dir/processPhraseTableMin"
[training]
training-options = "-mgiza -mgiza-cpus 8 -sort-compress gzip
-sort-parallel 4 -cores 4"
binarize-all =
if you're new to linux you will fight for ever.
I would probably go to Slate instead for sure.
Le 02/09/2015 17:34, Anita Pal a écrit :
For the time being, I'm trying to finish building the baseline system.
I've just been following the commands as listed on the Moses website.
It's still not
Le 01/09/2015 17:41, Christophe Servan a écrit :
> Hello Vincent,
> Did you checked whether you have enough disk space?
>
> Best,
>
> Christophe
>
>
> -Message d'origine-
> De : moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu] De
> la par
-orphan-phrase-pairs-from-reordering-table.perl
-phi
On Mon, Aug 31, 2015 at 10:50 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
thanks, will try and post results.
just to be clear:
I can re-use the previous extract file
I have to rebuild the
2015 at 1:11 PM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
Hi Uli,
For your point3. here is what I would like to do / understand :
I have an LM and a TM built with EMS but alignment being done by
FastAlign. So there is no vcb files for the base
yes plenty.
Le 01/09/2015 17:41, Christophe Servan a écrit :
> Hello Vincent,
> Did you checked whether you have enough disk space?
>
> Best,
>
> Christophe
>
>
> -Message d'origine-
> De : moses-support-boun...@mit.edu [mailto:moses-support-boun...@mi
Hi,
Here are some results with several values with cube pruning pop limit :
(pop limit / decoding time for 3000 sentences / BLEU score)
5000 - 15m45 - 29.59
1000 - 4m27 - 29.59
500 - 3m35 - 29.59
200 - 3m15 - 29.51
100 - 3m00 - 29.40
Therefore I took 400 - 3m19 - 29.58
If I am not mistaken
nScore 2:0.0001"
in EMS.
-phi
On Mon, Aug 31, 2015 at 3:03 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
Hi,
Here are some results with several values with cube pruning pop
limit :
(pop limit / decoding time for 3000 sentences / BLEU
:
hI,
0.0001 should have no impact on translation quality,
0.001 will have some impact
0.01 is probably a bit too drastic.
But that's the range you should explore.
-phi
On Mon, Aug 31, 2015 at 10:33 AM, Vincent Nguyen <vngu...@neuf.fr
<mailto:vngu...@neuf.fr>> wrote:
:
- EMS includes the mmsapt option to train and binarize the arrays
- EMS does NOT include the part of incrementally adding the new data in
an automated way. Has to be done manually.
Am I understanding things properly ?
Le 23/08/2015 09:06, Vincent Nguyen a écrit :
Hello,
I have a few
Guys,
I tried the mt adaptive server package from Matecat and I am fighting
for the past 3 days but I think now I know why.
the mt adaptive application uses some undocumented -print-passthrough
option in moses.
then I saw some functions to actually Output the passthrough info to
STDOUT in
, Prashant Mathur a écrit :
Hi Vincent,
Forgot to tell you that the adaptive MT server works with Moses
Release 1.0
There is another version on github which works with the latest
version. Try this out.
https://github.com/hlt-mt/adaptiveMT
—Prashant
On Aug 25, 2015, at 9:39 AM, Vincent
more
about it. I am not familiar with the other parts of code.
—Prashant
On Aug 25, 2015, at 11:02 AM, Vincent Nguyen vngu...@neuf.fr
mailto:vngu...@neuf.fr wrote:
well 2 things :
- I still don't see any of the methods OutputPassthroughInformation
in the previous version of moses
Hello,
I have a few questions on running MMSAPT within EMS. I am refering to
the doc here : http://www.statmt.org/moses/?n=Advanced.Incremental
and to the sections of the config.basic file of EMS.
1) the doc says
initial training run EMS as usual but use modified version of Giza++ and
add
/~riezler/publications/papers/MTJOURNAL2014.pdf
http://www.cl.uni-heidelberg.de/%7Eriezler/publications/papers/MTJOURNAL2014.pdf
[2] http://mt4cat.org/software/adaptive-mt-server
On Wed, Aug 19, 2015 at 6:53 PM, Vincent Nguyen vngu...@neuf.fr
mailto:vngu...@neuf.fr wrote
-entries.perl (someting like that, I am
writing this from memory.). You give the pruned phrase-table and the
unpruned reordering model to the script, and the script takes care
that the contents match. The good thing is, is hardly requires any RAM.
Best,
Marcin
W dniu 2015-08-19 13:44, Vincent Nguyen
Hi,
it crashed (whereas the sigtest filetring ttable continues ...) and no
message for disk space nor out of memory.
just a simple killed at the end of the stderr, any clue ?
-l = a+e
P(f|e) filter limit: 50
Loading Vocabulary...
Loading existing vocabulary file:
Hello support,
Going into advanced features of Moses, I am a bit confused by the
differences and therefore which path to follow, regarding the 2 features
CBPT and MMSAPT.
I have the feeling the ultimate goal of both is the same but maybe I am
wrong.
Can someone explain the actual difference
the build-osm crashes in EMS with following error
any clue ?
23396000 23397000 23398000 23399000 2340Converting Bilingual
Sentence Pair into Operation Corpus
Executing: /home/moses/mosesdecoder/bin/generateSequences
/home/moses/working/model/OSM.2//e /home/moses/working/model/OSM.2//f
ran out of disk space. Can you find the stderr
of lmplz?
Kenneth
On 08/16/2015 11:11 AM, Vincent Nguyen wrote:
the build-osm crashes in EMS with following error
any clue ?
23396000 23397000 23398000 23399000 2340Converting Bilingual
Sentence Pair into Operation Corpus
Executing
/2015 20:02, Vincent Nguyen wrote:
right but the config file is the config.basic from which I
uncommented the 3 lines for OSM.
So I guess the parameters are redundant with what is in the perl script.
which one to keep ? either way there is something to correct in the
github.
Le 16/08/2015 17
a double declaration of -S when running lmplz. That's either a
mistake in the config file or in the script
On 16/08/2015 14:11, Vincent Nguyen wrote:
the build-osm crashes in EMS with following error
any clue ?
23396000 23397000 23398000 23399000 2340Converting Bilingual
Sentence Pair
selection, instance weighting, model interpolation
and domain features are different methods that give you the benefits of
out-of-domain data, but reduce its harmful effects, and are often better
than just concatenating all the data you have.
best wishes,
Rico
On 14/08/15 16:22, Vincent
Hi,
I am wondering if I could get better results with a larger tuning data set.
Is there a way in EMS to cumulate several data set files or do I need to
concatenate sets.
is last option, how can I do this easily ? just concat the sgm files ?
thanks,
Vincent
, Vincent Nguyen wrote:
thanks for your insights.
I am just stuck by the Bleu difference between my 26 and the 30 of
WMT11, and some results of WMT14 close to 36 or even 39
I am currently having trouble with hierarchical rule set instead of
lexical reordering
wondering if I will get better
for the system description (like in table 6 in the UEDIN
paper).
best wishes,
Rico
On 10/08/15 08:32, Vincent Nguyen wrote:
similarly reading the WMT14 paper from UEDIN, If not mistaken I read :
35.9 in the matrix : http://matrix.statmt.org/systems/show/2106
31.76 for B1 best system on page
and no more, you're gonna have a hard time doing the
rest of the experiments.
Hieu Hoang
Researcher
New York University, Abu Dhabi
http://www.hoang.co.uk/hieu
On 8 August 2015 at 13:55, Vincent Nguyen vngu...@neuf.fr
mailto:vngu...@neuf.fr wrote:
Hi,
I keep adding 100GB on my space
, Vincent Nguyen wrote:
Hi,
Just a heads up on some EMS results, to get your experienced opinions.
Corpus: Europarlv7 + NC2010
fr = en
Evaluation NC2011.
1) IRSTLM vs KenLM is much slower for training / tuning.
that sounds right. KenLM is also multithreaded, IRSTLM can only be
used
some
newstest data sets from several years for tuning.
does it help a lot to tune with bigger sets ?
Cheers,
Vincent
Le 09/08/2015 13:47, Vincent Nguyen a écrit :
I think at 400GB I was not very far. 500GB was more than enough
without the -sort-compress gzip options.
Now it's binarizing
Hi,
I keep adding 100GB on my space, even at 400GB it crashed at sorting
time after the extract tables
now trying 500GB
Will I need more ?
is there a rule ?
cheers,
Vincent
___
Moses-support mailing list
Moses-support@mit.edu
it running with mgiza it will still take a week or so.
Just add
fast-align-settings = -d -o -v
to the TRAINING section of ems, and make sure that fast_align is in
your external-bin-dir.
cheers - Barry
On 06/08/15 08:40, Vincent Nguyen wrote:
so I dropped my hierarchical model since I got
external-bin-dir.
cheers - Barry
On 06/08/15 08:40, Vincent Nguyen wrote:
so I dropped my hierarchical model since I got an error.
Switched back to the more data by adding the Giga FR EN source
but now another error pops un running Giza Inverse :
Using SCRIPTS_ROOTDIR: /home
if you manage to
get it running with mgiza it will still take a week or so.
Just add
fast-align-settings = -d -o -v
to the TRAINING section of ems, and make sure that fast_align is
in your external-bin-dir.
cheers - Barry
On 06/08/15 08:40, Vincent Nguyen wrote
you will need more
disk. For fr-en/en-fr it's probably not worth the extra effort,
cheers - Barry
On 04/08/15 15:58, Vincent Nguyen wrote:
thanks for your insights.
I am just stuck by the Bleu difference between my 26 and the 30 of
WMT11, and some results of WMT14 close to 36 or even 39
I am
1 - 100 of 127 matches
Mail list logo