Doesn't the guy that posted the original question have any sample
texts? I got the idea that he had but does not know the language(s)...
Jim
On 21/03/12 23:22, James Kosin wrote:
Jorn,
If there isn't anything for Korean, I could put something together.
Only problem would be getting free text.
I can start looking if needed.
James
On 3/21/2012 2:38 PM, Jörn Kottmann wrote:
Here is a paper which describes Chinese sentence segmentation:
www.aclweb.org/anthology/P/P11/P11-2111.pdf
There they say that commas can be an end-of-sentence marker as well,
but they are ambiguous.
So we would need to add it as an eos char and
we should create a new feature generator.
Are there any free training data sets which could be used?
Jörn
On 03/21/2012 03:34 PM, Joern Kottmann wrote:
Wikipedia says: "Languages like Japanese and Chinese have unambiguous
sentence-ending markers."
In this case we might be able to write a rule based sentence detector
for these languages?
Jörn
On Wed, Mar 21, 2012 at 3:18 PM, [email protected]
<mailto:[email protected]> <[email protected]
<mailto:[email protected]>> wrote:
Hi
There is a Thai model for sentence detector. I don't know who
created it,
but someone from the list knows and can point to some article
about it.
What I can say is that OpenNLP had to be customized to work with
Thai,
including the EOS Characters that are ' ' and '\n'
http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/sentdetect/lang/th/SentenceContextGenerator.java?view=markup
William
On Wed, Mar 21, 2012 at 8:05 AM, Jim - FooBar();
<[email protected]<mailto:[email protected]>>wrote:
> Basically you need to know the punctuation signs indicating end of
> sentence or find someone who does...then use regex to split the
sentences
> at those signs! it's not gonna be perfect - you may have to pass
it once or
> twice with your own eyes to make sure everything is ok before
training.
> everything depends on the language and how ambiguous punctuation
it has.
>
>
> Jim
>
> On 20/03/12 18:38, Jairo Sarabia wrote:
>
>> Hi all,
>>
>> I see there aren't Sentence Detect Models for Asian languages
in openNLP
>> repository and I need these ones.
>> I've to train Sentence Detect Models for Chinese, Japanese and
Korean
>> languages, but I don't know these languages.
>> How coud I get the data train files for these languages?
>>
>> Thanks in advance!,
>>
>> Jairo Sarabia
>>
>>
>