Re: CJK LPs

2018-02-20 Thread Tommaso Teofili
thanks for the hints Matt!

Regards,
Tommaso

Il giorno lun 19 feb 2018 alle ore 16:49 Matt Post <p...@cs.jhu.edu> ha
scritto:

> You just have to make sure that the language pack makes it easy to apply
> the same pre-processing to test data that you applied at training time.
> Which means bundling the segmentation model with the language pack (or
> doing something simple, like single-character words—that degrades
> performance but would be easier). I typically use the Stanford segmenter
> but I'm not sure it would matter that much.
>
> matt
>
>
> > On Feb 19, 2018, at 1:45 PM, Tommaso Teofili <tommaso.teof...@gmail.com>
> wrote:
> >
> > thanks Matt.
> > Would you be able to point out such additional step in a bit more detail
> > when you have time ?
> > Not sure what you used for segmentation, perhaps could use either
> Lucene's
> > CJK [1] or Kuromoji [2] analyzers.
> >
> > Regards,
> > Tommaso
> >
> > [1] :
> >
> https://lucene.apache.org/core/7_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKAnalyzer.html
> > [2] : https://lucene.apache.org/core/7_0_0/analyzers-kuromoji/
> >
> > Il giorno lun 19 feb 2018 alle ore 12:12 Matt Post <p...@cs.jhu.edu> ha
> > scritto:
> >
> >> I don’t think I ever built these. There is an additional step of
> properly
> >> and consistently segmenting Chinese which complicates things and
> creates an
> >> external dependency.
> >>
> >> matt (from my phone)
> >>
> >>> Le 19 févr. 2018 à 10:46, Tommaso Teofili <tommaso.teof...@gmail.com>
> a
> >> écrit :
> >>>
> >>> Hi all,
> >>>
> >>> I am not sure if I am missing something, but I somewhat recalled that
> >>> language packs for Chinese (but also Japanese / Korean) existed at [1],
> >>> however I can't find any.
> >>> Reading through the comments it seems at least that was the plan.
> >>> If that is a leftout from the recent LP migration we could try to fix
> it
> >>> otherwise it'd be nice to build and provide such CJK LPs.
> >>> Can anyone help clarify ?
> >>>
> >>> Regards,
> >>> Tommaso
> >>>
> >>> [1] :
> https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
> >>
> >>
>
>


Re: CJK LPs

2018-02-19 Thread Matt Post
You just have to make sure that the language pack makes it easy to apply the 
same pre-processing to test data that you applied at training time. Which means 
bundling the segmentation model with the language pack (or doing something 
simple, like single-character words—that degrades performance but would be 
easier). I typically use the Stanford segmenter but I'm not sure it would 
matter that much.

matt


> On Feb 19, 2018, at 1:45 PM, Tommaso Teofili <tommaso.teof...@gmail.com> 
> wrote:
> 
> thanks Matt.
> Would you be able to point out such additional step in a bit more detail
> when you have time ?
> Not sure what you used for segmentation, perhaps could use either Lucene's
> CJK [1] or Kuromoji [2] analyzers.
> 
> Regards,
> Tommaso
> 
> [1] :
> https://lucene.apache.org/core/7_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKAnalyzer.html
> [2] : https://lucene.apache.org/core/7_0_0/analyzers-kuromoji/
> 
> Il giorno lun 19 feb 2018 alle ore 12:12 Matt Post <p...@cs.jhu.edu> ha
> scritto:
> 
>> I don’t think I ever built these. There is an additional step of properly
>> and consistently segmenting Chinese which complicates things and creates an
>> external dependency.
>> 
>> matt (from my phone)
>> 
>>> Le 19 févr. 2018 à 10:46, Tommaso Teofili <tommaso.teof...@gmail.com> a
>> écrit :
>>> 
>>> Hi all,
>>> 
>>> I am not sure if I am missing something, but I somewhat recalled that
>>> language packs for Chinese (but also Japanese / Korean) existed at [1],
>>> however I can't find any.
>>> Reading through the comments it seems at least that was the plan.
>>> If that is a leftout from the recent LP migration we could try to fix it
>>> otherwise it'd be nice to build and provide such CJK LPs.
>>> Can anyone help clarify ?
>>> 
>>> Regards,
>>> Tommaso
>>> 
>>> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>> 
>> 



Re: CJK LPs

2018-02-19 Thread Tommaso Teofili
thanks Matt.
Would you be able to point out such additional step in a bit more detail
when you have time ?
Not sure what you used for segmentation, perhaps could use either Lucene's
CJK [1] or Kuromoji [2] analyzers.

Regards,
Tommaso

[1] :
https://lucene.apache.org/core/7_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKAnalyzer.html
[2] : https://lucene.apache.org/core/7_0_0/analyzers-kuromoji/

Il giorno lun 19 feb 2018 alle ore 12:12 Matt Post <p...@cs.jhu.edu> ha
scritto:

> I don’t think I ever built these. There is an additional step of properly
> and consistently segmenting Chinese which complicates things and creates an
> external dependency.
>
> matt (from my phone)
>
> > Le 19 févr. 2018 à 10:46, Tommaso Teofili <tommaso.teof...@gmail.com> a
> écrit :
> >
> > Hi all,
> >
> > I am not sure if I am missing something, but I somewhat recalled that
> > language packs for Chinese (but also Japanese / Korean) existed at [1],
> > however I can't find any.
> > Reading through the comments it seems at least that was the plan.
> > If that is a leftout from the recent LP migration we could try to fix it
> > otherwise it'd be nice to build and provide such CJK LPs.
> > Can anyone help clarify ?
> >
> > Regards,
> > Tommaso
> >
> > [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
>
>


CJK LPs

2018-02-19 Thread Tommaso Teofili
Hi all,

I am not sure if I am missing something, but I somewhat recalled that
language packs for Chinese (but also Japanese / Korean) existed at [1],
however I can't find any.
Reading through the comments it seems at least that was the plan.
If that is a leftout from the recent LP migration we could try to fix it
otherwise it'd be nice to build and provide such CJK LPs.
Can anyone help clarify ?

Regards,
Tommaso

[1] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs