Jorn, If there isn't anything for Korean, I could put something together. Only problem would be getting free text. I can start looking if needed.
James On 3/21/2012 2:38 PM, Jörn Kottmann wrote: > Here is a paper which describes Chinese sentence segmentation: > www.aclweb.org/anthology/P/P11/P11-2111.pdf > > There they say that commas can be an end-of-sentence marker as well, > but they are ambiguous. > > So we would need to add it as an eos char and > we should create a new feature generator. > > Are there any free training data sets which could be used? > > Jörn > > > On 03/21/2012 03:34 PM, Joern Kottmann wrote: >> Wikipedia says: "Languages like Japanese and Chinese have unambiguous >> sentence-ending markers." >> In this case we might be able to write a rule based sentence detector >> for these languages? >> >> Jörn >> >> On Wed, Mar 21, 2012 at 3:18 PM, [email protected] >> <mailto:[email protected]> <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi >> >> There is a Thai model for sentence detector. I don't know who >> created it, >> but someone from the list knows and can point to some article >> about it. >> What I can say is that OpenNLP had to be customized to work with >> Thai, >> including the EOS Characters that are ' ' and '\n' >> >> >> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup >> >> >> William >> >> >> On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar(); >> <[email protected] <mailto:[email protected]>>wrote: >> >> > Basically you need to know the punctuation signs indicating end of >> > sentence or find someone who does...then use regex to split the >> sentences >> > at those signs! it's not gonna be perfect - you may have to pass >> it once or >> > twice with your own eyes to make sure everything is ok before >> training. >> > everything depends on the language and how ambiguous punctuation >> it has. >> > >> > >> > Jim >> > >> > On 20/03/12 18:38, Jairo Sarabia wrote: >> > >> >> Hi all, >> >> >> >> I see there aren't Sentence Detect Models for Asian languages >> in openNLP >> >> repository and I need these ones. >> >> I've to train Sentence Detect Models for Chinese, Japanese and >> Korean >> >> languages, but I don't know these languages. >> >> How coud I get the data train files for these languages? >> >> >> >> Thanks in advance!, >> >> >> >> Jairo Sarabia >> >> >> >> >> > >> >> > >
