Doesn't the guy that posted the original question have any sample texts? I got the idea that he had but does not know the language(s)...

Jim

On 21/03/12 23:22, James Kosin wrote:
Jorn,

If there isn't anything for Korean, I could put something together.
Only problem would be getting free text.
I can start looking if needed.

James

On 3/21/2012 2:38 PM, Jörn Kottmann wrote:
Here is a paper which describes Chinese sentence segmentation:
www.aclweb.org/anthology/P/P11/P11-2111.pdf

There they say that commas can be an end-of-sentence marker as well,
but they are ambiguous.

So we would need to add it as an eos char and
we should create a new feature generator.

Are there any free training data sets which could be used?

Jörn


On 03/21/2012 03:34 PM, Joern Kottmann wrote:
Wikipedia says: "Languages like Japanese and Chinese have unambiguous
sentence-ending markers."
In this case we might be able to write a rule based sentence detector
for these languages?

Jörn

On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
<mailto:[email protected]>  <[email protected]
<mailto:[email protected]>>  wrote:

     Hi

     There is a Thai model for sentence detector. I don't know who
     created it,
     but someone from the list knows and can point to some article
     about it.
     What I can say is that OpenNLP had to be customized to work with
Thai,
     including the EOS Characters that are ' ' and '\n'


http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup


     William


     On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
     <[email protected]<mailto:[email protected]>>wrote:

     >  Basically you need to know the punctuation signs indicating end of
     >  sentence or find someone who does...then use regex to split the
     sentences
     >  at those signs! it's not gonna be perfect - you may have to pass
     it once or
     >  twice with your own eyes to make sure everything is ok before
     training.
     >  everything depends on the language and how ambiguous punctuation
     it has.
     >
     >
     >  Jim
     >
     >  On 20/03/12 18:38, Jairo Sarabia wrote:
     >
     >>  Hi all,
     >>
     >>  I see there aren't Sentence Detect Models for Asian languages
     in openNLP
     >>  repository and I need these ones.
     >>  I've to train Sentence Detect Models for Chinese, Japanese and
     Korean
     >>  languages, but I don't know these languages.
     >>  How coud I get the data train files for these languages?
     >>
     >>  Thanks in advance!,
     >>
     >>  Jairo Sarabia
     >>
     >>
     >




Reply via email to