Here is a paper which describes Chinese sentence segmentation:
www.aclweb.org/anthology/P/P11/P11-2111.pdf

There they say that commas can be an end-of-sentence marker as well,
but they are ambiguous.

So we would need to add it as an eos char and
we should create a new feature generator.

Are there any free training data sets which could be used?

Jörn


On 03/21/2012 03:34 PM, Joern Kottmann wrote:
Wikipedia says: "Languages like Japanese and Chinese have unambiguous sentence-ending markers." In this case we might be able to write a rule based sentence detector for these languages?

Jörn

On Wed, Mar 21, 2012 at 3:18 PM, [email protected] <mailto:[email protected]> <[email protected] <mailto:[email protected]>> wrote:

    Hi

    There is a Thai model for sentence detector. I don't know who
    created it,
    but someone from the list knows and can point to some article
    about it.
    What I can say is that OpenNLP had to be customized to work with Thai,
    including the EOS Characters that are ' ' and '\n'

    
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup


    William


    On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
    <[email protected] <mailto:[email protected]>>wrote:

    > Basically you need to know the punctuation signs indicating end of
    > sentence or find someone who does...then use regex to split the
    sentences
    > at those signs! it's not gonna be perfect - you may have to pass
    it once or
    > twice with your own eyes to make sure everything is ok before
    training.
    > everything depends on the language and how ambiguous punctuation
    it has.
    >
    >
    > Jim
    >
    > On 20/03/12 18:38, Jairo Sarabia wrote:
    >
    >> Hi all,
    >>
    >> I see there aren't Sentence Detect Models for Asian languages
    in openNLP
    >> repository and I need these ones.
    >> I've to train Sentence Detect Models for Chinese, Japanese and
    Korean
    >> languages, but I don't know these languages.
    >> How coud I get the data train files for these languages?
    >>
    >> Thanks in advance!,
    >>
    >> Jairo Sarabia
    >>
    >>
    >



Reply via email to