I think you're asking if Moses translates one sentence at a time. The answer is
yes.
- John Burger
MITRE
> On Dec 4, 2015, at 04:43, Vincent Nguyen wrote:
>
> Actually I don't know if this is a decoder question or such.
>
> Here is my issue
>
> Let's say I have a text
On Jun 24, 2015, at 10:47 , Read, James C jcr...@essex.ac.uk wrote:
So you still think it's fine that the default would perform at 37 BLEU points
less than just selecting the most likely translation of each phrase?
Yes, I'm pretty sure we all think that's fine, because one of the steps of
I've observed this as well. It seems to me there are several competing
pressures affecting the number of ngram types in a corpus. On the one hand, as
the size of the corpus increases, so does the vocabulary. This obviously
increases the number of unigram types (which is the same as the
This is also a reason to turn Unicode normalization on. If the
tokenizer did NFKC at the beginning, then the problem would go away.
If I understand the situation correctly, this would only fix this particular
example and a few others like it. There are many base+combining grapheme
clusters
...@ed.ac.uk wrote:
what OS are you on and do you have libtool (or glibtool on macport/osx)?
i sometimes see this on older machines
On 8 April 2014 18:52, John D. Burger j...@mitre.org wrote:
I should add that simply creating the subdirectory doesn't work, later steps
expect to find something
Hi -
I'm having autotools troubles while installing irstlm-5.80.03 per the
directions here:
http://www.statmt.org/moses/?n=Moses.Baseline
On the very first step I get this:
./regenerate-makefiles.sh
Calling /usr/bin/libtoolize
You should add the contents of
I should add that simply creating the subdirectory doesn't work, later steps
expect to find something there.
- JB
On Apr 8, 2014, at 13:40 , John D. Burger j...@mitre.org wrote:
Hi -
I'm having autotools troubles while installing irstlm-5.80.03 per the
directions here:
http
On Mar 6, 2014, at 16:00 , Momo Jeng momo_j...@outlook.com wrote:
I'm having a problem getting results from Moses, although I think it's really
a problem with GIZA++; please let me know if there's a better place for GIZA
questions.
When I run Moses instructing GIZA++ to only do model1
The default tokenizer script only knows specific rules for a few languages. The
fallback (English) rules may suffice for your purposes, they do the obvious
thing with spaces and English punctuation, and also handle some special cases
for abbreviations like Mr. and Mrs..
I'd suggest you
Blue scores are not commensurate even between different corpora in the same
translation direction. Bleu is really only comparable for different systems or
system variants on the exact same data.
In the case of the same corpus in two directions, an imperfect analogy might be
gas mileage between
We've done something like this in the past. The fact that the check for a
non-empty LM happens at the very beginning is somewhat annoying if you have a
setup that builds the phrase models and language models in parallel, for
instance on a cluster.
- JB
On Nov 4, 2013, at 07:48 , Tom Hoar
If you treat entire paragraphs as segments, then you'll presumably end up with
very long segments. This will make it difficult to get good alignments, and so
the resulting models may be of poor quality. Also note that there will be
nothing to prevent the extracted phrases from spanning
We did some experiments a long time ago on tuning set size (for Chinese to
English). For the standard Moses setup, there are only a dozen or so
meta-features to find weights for, so it's no surprise that improvements
asymptote sharply after the tuning set gets much bigger than 1-2000 segment
This sounds like our workaround. Just to make sure I understand, Tom, it
sounds like you add your own extra markers to everything, both for alignment
and language modeling, so the parallel files look like this (using ss and
/ss instead of your music symbols):
ss das ist ein kleines haus .
If you move the count field to the beginning of the line, you can use the
-text-has-weights switch of ngram-counts:
-text-has-weights
Treat the first field in each text input line as a weight factor by which
the N-gram counts for that line are to be multiplied.
More here:
Are there any such placeholders in your language modeling data and your
parallel training data? If not, all the models are going to treat them as
unknown words. In the case of the language model, it doesn't surprise me too
much that the placeholders all get pushed together, as that will
-Ursprüngliche Nachricht-
Von: moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu] Im
Auftrag von John D Burger
Gesendet: 31 July 2012 16:09
An: Henry Hu
Cc: moses-support@mit.edu
Betreff: Re: [Moses-support] Placeholder drift
Are there any such placeholders in your language
Daniel Schaut wrote:
To conclude, one could say that I’ve created an engine suitable for a
specific domain? However, the engine’s performance outside my domain equals
almost to zero?
This is always a problem, especially with statistical MT. For example, we've
evaluated high-performing
I =think= I recall that pairwise BLEU scores for human translators are usually
around 0.50, so anything much better than that is indeed suspect.
- JB
On Apr 26, 2012, at 14:18 , Daniel Schaut wrote:
Hi all,
I’m running some experiments for my thesis and I’ve been told by a more
Mark Fishel wrote:
the --parallel switch of the train-model.perl script is only
effective during the first 2 steps -- is there a good reason not to
make the phrase scoring (step 6) parallel? Currently it contains a
'for my $direction (f2e,e2f)...', and on a large corpus the
scoring can take
Hi -
We switched to using IRSTLM recently, in order to build bigger
language models. I am starting to think, however, that the entire
model is still being loaded into memory. Here's part of what Moses
prints out now:
Start loading LanguageModel /net/tidesserver/tidesserver_raid7/clasr/
Oops, forgot to CC the list.
From: John D. Burger [EMAIL PROTECTED]
Date: August 4, 2008 13:30:30 EDT
To: [EMAIL PROTECTED]
Subject: Re: [Moses-support] decoding: reordering only
Sanne Korzec wrote:
Is there a way to force the moses or pharaoh decoder, to use a
certain set of phrases
Hi -
I'm still trying to debug my differences between old and new versions
of Moses, which (for us) use SRILM and IRSTLM respectively. My
current puzzle is over the very different sizes of the language
models resulting from SRILM and IRSTLM - the latter has 5 times as
many 5-grams, for
Miles Osborne wrote:
by default the srilm prunes singletons
OK, that's good to know. But when I prune the IRST LM, I still get
lots =more= 4-grams than the SRI LM, but lots =fewer= 5-grams
(although less than a factor of two in either case).
But perhaps I'm a bit in the weeds here ... :)
comparing it against?
It's almost exactly a year old, sadly. What's the easiest way to
tell what version it is?
Miles asked about the size of the tuning set - it's 812 segments.
That's not that small, is it?
Thanks for your prompt replies and suggestions.
- John D. Burger
MITRE
Miles Osborne wrote:
i'd check to see how unknown words are handled in either the SRILM
or in IRSTLM --that may explain the differences
Ah, good suggestion, thanks - OOV is very high in this data.
(as for the size of a tuning set, the more the better; right now
i'm doing Europarl runs
Sanne Korzec wrote:
I am having trouble understanding what the recaser is doing exactly
when evaluating a (dev) test set.
Why do we need to train a recaser?
Because the default setup in Moses is to train caseless models. This
is done by lowercasing the parallel corpus before anything
installed it, but I'm certain it was before
that. I will try the newer version - thanks!
- John D. Burger
MITRE
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
Ham, Michael wrote:
Those escape numbers are Unicode characters. The Chinese character
set
does not exist in ASCII, so you have to use UTF-8.
Sorry if I wasn't clear: I'm talking about the Chinese side of
LDC2004E12, which is not in ASCII or Unicode, it's in GB18030.
Apparently,
Yee Seng Chan wrote:
However, when I tried to parallelize it by submitting say.. 10
jobs, I don’t get faster MERT iterations. In fact, it’s slower.
Sometimes, a job can be stuck on one of the grid nodes and after
hours, it’s still not completed. Its corresponding output-file e.g…
Chris Dyer wrote:
I haven't looked into what's causing the particular problem on this
corpus, but another known problem with the GIZA HMM model is that it
doesn't do a fairly standard kind of normalization in the
forward-backward training, which causes underflow errors in some
sentences
31 matches
Mail list logo