On Sat, Jul 27, 2013 at 10:39 AM, C. Scott Ananian
<canan...@wikimedia.org>wrote:

> My main point was just that there is a chicken-and-egg problem here.  You
> assume that machine translation can't work because we don't have enough
> parallel texts.  But, to the extent that machine-aided translation of WP is
> successful, it creates a large amount of parallel text.   I agree that
> there are challenges.  I simply disagree, as a matter of logic, with the
> blanket dismissal of the chickens because there aren't yet any eggs.
>

I think we both agree about the need and usefulness of having a copious
amount of parallel text. The main difficulty is how to get there from
scratch. As I see it there are several possible paths
- volunteers creating the corpus manually (some work done, however not
properly tagged)
- use a statistic approach to create the base text and volunteers would
improve that text only
- use rules and statistics to create the base text and volunteers would
improve the text and optionally the rules

The end result of all options is the creation of a parallel corpus that can
be reused for statistic translation. In my opinion, the efectivity of
giving users the option to improve/select the rules is much larger than
improving the text only. It complements statistic analysis rather than
replacing it and it provides a good starting point to solve the egg-chicken
conundrum, specially in small Wikipedias.

Currently translatewiki is relying on external tools where we don't have
much control, besides of being propietary and with the risk that they can
be disabled any time.

I think you're attributing the faults of a single implementation/UX to the
> technique as a whole.  (Which is why I felt that "step 1" should be to
> create better tools for maintaining information about parallel structures
> in the wikidata.)
>

Good call. Now that you mention it, yes, it would be great to have a place
where to keep a parallel corpus, and it would be even more useful if it can
be annotated with wikidata-wiktionary senses. A wikibase repo might be the
way to go. No idea if Wikidata or Translatewiki are the right places to
store/display it. Maybe it will be a good time to discuss it during
Wikimania. I have added it to the "elements" section.


>
> In a world with an active Moore's law, WP *does* have the computing power
> to approximate this effort.  Again, the beauty of the statistical approach
> is that it scales.
>

My main concern about statistic-based machine translation is that it needs
volume to be effective, hence the proposal to use rule-based translation to
reach the critical point faster than just using statistics on existing text
alone.


>
> I'm sure we can agree to disagree here.  Probably our main differences are
> in answers to the question, "where should we start work"?  I think
> annotating parallel texts is the most interesting research question
> ("research" because I agree that wiki editing by volunteers makes the UX
> problem nontrivial).  I think your suggestion is to start work on the
> "semantic multilingual dictionary"?
>

It is quite possible to have multiple developments in parallel. That a
semantic dictionary is in development doesn't hinder the creation of a
parallel corpus or an interface for annotating. The same applies to
statistics/rules, they are not incompatible, in fact they complement each
other pretty well.


> ps. note that the inter-language links in the sidebar of wikipedia articles
> already comprise a very interesting corpus of noun translations.  I don't
> think this dataset is currently exploited fully.
>

I couldn't agree more. I would ask to take a close look to CoSyne. I'm sure
some of it can be reused:
http://www.cosyne.eu/index.php/Main_Page

Cheers,
David
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to