Re: Asian Sentence Detector Models

James Kosin Wed, 21 Mar 2012 16:22:54 -0700

Jorn,

If there isn't anything for Korean, I could put something together. 
Only problem would be getting free text.
I can start looking if needed.


James

On 3/21/2012 2:38 PM, Jörn Kottmann wrote:
> Here is a paper which describes Chinese sentence segmentation:
> www.aclweb.org/anthology/P/P11/P11-2111.pdf
>
> There they say that commas can be an end-of-sentence marker as well,
> but they are ambiguous.
>
> So we would need to add it as an eos char and
> we should create a new feature generator.
>
> Are there any free training data sets which could be used?
>
> Jörn
>
>
> On 03/21/2012 03:34 PM, Joern Kottmann wrote:
>> Wikipedia says: "Languages like Japanese and Chinese have unambiguous
>> sentence-ending markers."
>> In this case we might be able to write a rule based sentence detector
>> for these languages?
>>
>> Jörn
>>
>> On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
>> <mailto:[email protected]> <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Hi
>>
>>     There is a Thai model for sentence detector. I don't know who
>>     created it,
>>     but someone from the list knows and can point to some article
>>     about it.
>>     What I can say is that OpenNLP had to be customized to work with
>> Thai,
>>     including the EOS Characters that are ' ' and '\n'
>>
>>    
>> http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup
>>
>>
>>     William
>>
>>
>>     On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
>>     <[email protected] <mailto:[email protected]>>wrote:
>>
>>     > Basically you need to know the punctuation signs indicating end of
>>     > sentence or find someone who does...then use regex to split the
>>     sentences
>>     > at those signs! it's not gonna be perfect - you may have to pass
>>     it once or
>>     > twice with your own eyes to make sure everything is ok before
>>     training.
>>     > everything depends on the language and how ambiguous punctuation
>>     it has.
>>     >
>>     >
>>     > Jim
>>     >
>>     > On 20/03/12 18:38, Jairo Sarabia wrote:
>>     >
>>     >> Hi all,
>>     >>
>>     >> I see there aren't Sentence Detect Models for Asian languages
>>     in openNLP
>>     >> repository and I need these ones.
>>     >> I've to train Sentence Detect Models for Chinese, Japanese and
>>     Korean
>>     >> languages, but I don't know these languages.
>>     >> How coud I get the data train files for these languages?
>>     >>
>>     >> Thanks in advance!,
>>     >>
>>     >> Jairo Sarabia
>>     >>
>>     >>
>>     >
>>
>>
>
>

Re: Asian Sentence Detector Models

Reply via email to