Re: CJK LPs
thanks for the hints Matt! Regards, Tommaso Il giorno lun 19 feb 2018 alle ore 16:49 Matt Post <p...@cs.jhu.edu> ha scritto: > You just have to make sure that the language pack makes it easy to apply > the same pre-processing to test data that you applied at training time. > Which means bundling the segmentation model with the language pack (or > doing something simple, like single-character words—that degrades > performance but would be easier). I typically use the Stanford segmenter > but I'm not sure it would matter that much. > > matt > > > > On Feb 19, 2018, at 1:45 PM, Tommaso Teofili <tommaso.teof...@gmail.com> > wrote: > > > > thanks Matt. > > Would you be able to point out such additional step in a bit more detail > > when you have time ? > > Not sure what you used for segmentation, perhaps could use either > Lucene's > > CJK [1] or Kuromoji [2] analyzers. > > > > Regards, > > Tommaso > > > > [1] : > > > https://lucene.apache.org/core/7_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKAnalyzer.html > > [2] : https://lucene.apache.org/core/7_0_0/analyzers-kuromoji/ > > > > Il giorno lun 19 feb 2018 alle ore 12:12 Matt Post <p...@cs.jhu.edu> ha > > scritto: > > > >> I don’t think I ever built these. There is an additional step of > properly > >> and consistently segmenting Chinese which complicates things and > creates an > >> external dependency. > >> > >> matt (from my phone) > >> > >>> Le 19 févr. 2018 à 10:46, Tommaso Teofili <tommaso.teof...@gmail.com> > a > >> écrit : > >>> > >>> Hi all, > >>> > >>> I am not sure if I am missing something, but I somewhat recalled that > >>> language packs for Chinese (but also Japanese / Korean) existed at [1], > >>> however I can't find any. > >>> Reading through the comments it seems at least that was the plan. > >>> If that is a leftout from the recent LP migration we could try to fix > it > >>> otherwise it'd be nice to build and provide such CJK LPs. > >>> Can anyone help clarify ? > >>> > >>> Regards, > >>> Tommaso > >>> > >>> [1] : > https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs > >> > >> > >
Re: CJK LPs
You just have to make sure that the language pack makes it easy to apply the same pre-processing to test data that you applied at training time. Which means bundling the segmentation model with the language pack (or doing something simple, like single-character words—that degrades performance but would be easier). I typically use the Stanford segmenter but I'm not sure it would matter that much. matt > On Feb 19, 2018, at 1:45 PM, Tommaso Teofili <tommaso.teof...@gmail.com> > wrote: > > thanks Matt. > Would you be able to point out such additional step in a bit more detail > when you have time ? > Not sure what you used for segmentation, perhaps could use either Lucene's > CJK [1] or Kuromoji [2] analyzers. > > Regards, > Tommaso > > [1] : > https://lucene.apache.org/core/7_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKAnalyzer.html > [2] : https://lucene.apache.org/core/7_0_0/analyzers-kuromoji/ > > Il giorno lun 19 feb 2018 alle ore 12:12 Matt Post <p...@cs.jhu.edu> ha > scritto: > >> I don’t think I ever built these. There is an additional step of properly >> and consistently segmenting Chinese which complicates things and creates an >> external dependency. >> >> matt (from my phone) >> >>> Le 19 févr. 2018 à 10:46, Tommaso Teofili <tommaso.teof...@gmail.com> a >> écrit : >>> >>> Hi all, >>> >>> I am not sure if I am missing something, but I somewhat recalled that >>> language packs for Chinese (but also Japanese / Korean) existed at [1], >>> however I can't find any. >>> Reading through the comments it seems at least that was the plan. >>> If that is a leftout from the recent LP migration we could try to fix it >>> otherwise it'd be nice to build and provide such CJK LPs. >>> Can anyone help clarify ? >>> >>> Regards, >>> Tommaso >>> >>> [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs >> >>
Re: CJK LPs
thanks Matt. Would you be able to point out such additional step in a bit more detail when you have time ? Not sure what you used for segmentation, perhaps could use either Lucene's CJK [1] or Kuromoji [2] analyzers. Regards, Tommaso [1] : https://lucene.apache.org/core/7_0_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKAnalyzer.html [2] : https://lucene.apache.org/core/7_0_0/analyzers-kuromoji/ Il giorno lun 19 feb 2018 alle ore 12:12 Matt Post <p...@cs.jhu.edu> ha scritto: > I don’t think I ever built these. There is an additional step of properly > and consistently segmenting Chinese which complicates things and creates an > external dependency. > > matt (from my phone) > > > Le 19 févr. 2018 à 10:46, Tommaso Teofili <tommaso.teof...@gmail.com> a > écrit : > > > > Hi all, > > > > I am not sure if I am missing something, but I somewhat recalled that > > language packs for Chinese (but also Japanese / Korean) existed at [1], > > however I can't find any. > > Reading through the comments it seems at least that was the plan. > > If that is a leftout from the recent LP migration we could try to fix it > > otherwise it'd be nice to build and provide such CJK LPs. > > Can anyone help clarify ? > > > > Regards, > > Tommaso > > > > [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs > >
CJK LPs
Hi all, I am not sure if I am missing something, but I somewhat recalled that language packs for Chinese (but also Japanese / Korean) existed at [1], however I can't find any. Reading through the comments it seems at least that was the plan. If that is a leftout from the recent LP migration we could try to fix it otherwise it'd be nice to build and provide such CJK LPs. Can anyone help clarify ? Regards, Tommaso [1] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs