Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Tomoko Uchida
> Anyway, in my personal opinion, Lucene does not need to consider whether
the system dictionary status is good or not.

Please don't get me wrong, but I don't think so.
Creating a customized or re-trained system dictionary still needs deep
knowledge about language and machine-learning. Even among in us,
native Japanese, very few people can do so.
The system dictionary is a key component for tokenization, so badly
customized system dictionary directly affects to the search quality
and I think we should prevent it. Instead of messing up the system
dictionary without sufficient knowledge, please use the user
dictionary. That is the reason why it exists.

Anyway building the system dictionary (MeCab IPADIIC extensions), you
do not need read or fix the DictionaryBuilder class.
Just modify analysis/kuromoji/build.xml to use the
customized/re-trained dictionary (tar ball).

Tomoko

2019年5月27日(月) 1:48 Namgyu Kim :
>
> Oh, I think my explanation was not enough. Sorry...
>
> I mentioned the following sentences.
> =
> 1. Modify your dictionary file and rebuild.
>   1-1) Install MeCab
>   1-2) Install MeCab Dictionary
>   1-3) Modify your dictionary file
>   1-4) Make it to tar.gz
> =
> The "1-3)" does not mean user modifies the csv files and compresses it back
> to tar.gz.
> It means re-training, of course user has to be careful and have knowledge
> of the Natural Language Processing.
> Column 2, 3 and 4 in csv values are the values produced by training.
> (2 : left context id, 3 : right context id, 4 : cost)
> These values are dependent on the model and matrix.def values. (when use
> mecab-dict-index)
>
> That's why I mentioned "1-1)" and "1-2)" processes first.
>
> Anyway, in my personal opinion, Lucene does not need to consider whether
> the system dictionary status is good or not.
> I just think when some user wants to use a custom system dictionary, it is
> not user-friendly to modify the ant file or find some code for a long time
> to run the DictionaryBuilder.
> I think there should be at least a guide.
>
> Warm regards,
> Namgyu Kim
>
> P.S. Although not as good as the Tomoko's contents, there is a list of
> dictionaries supported by kuromoji.
> (https://github.com/atilika/kuromoji#supported-dictionaries)
>
>
> 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida 님이
> 작성:
>
> > Hi,
> >
> > The system dictionary is not a mere "word collection", it includes a
> > machine-learned language model which is carefully trained by
> > researchers. If you want to replace the system dictionary, you have to
> > start from "re-train" the model. This needs expert knowledge so I do
> > not recommend to just modify the CSVs and rebuild it (if you do not
> > have an expert about it).
> >
> > As far as relates to "modern words" which is not included the current
> > system dictionary, there are already a few options.
> >
> > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > Kuromoji's default dictionary)
> >
> > For Solr:
> > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > (The branch is mine. A little bit old, but you can cherry-pick the
> > changes in the kuromoji's build.xml.)
> >
> > For Elasticsearch:
> > https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> >
> > 2. Use Sudachi dictionary
> >
> > For Elasticsearch:
> > https://github.com/WorksApplications/elasticsearch-sudachi
> > This includes Lucene jar, so I think you can extract the jar for Solr
> > (I've never tried to use with Solr).
> >
> > Both are actively maintained by linguistics & NLP researchers/engineers.
> > Please be careful, those are rather huge jars...
> >
> > Hope that helps.
> >
> > Tomoko
> >
> > 2019年5月26日(日) 23:11 Trejkaz :
> > >
> > > On Sun, 26 May 2019 at 23:49, Namgyu Kim  wrote:
> > >
> > > > I think so about that approach.
> > > > It's not user-friendly and it is not good for the user.
> > >
> > > I think it's better to get the parameters in
> > >
> > > JapaneseTokenizer.
> > > >
> > > > What do you think about this?
> > >
> > >
> > > A way to override the system dictionary would be useful for us as well.
> > We
> > > often get people complaining that the current dictionary is missing a lot
> > > of common modern words, and there are alternate mecab dictionaries
> > sitting
> > > around already which solve this problem.
> > >
> > > TX
> > >
> > >
> > > >
> > > >
> > > >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Namgyu Kim
Oh, I think my explanation was not enough. Sorry...

I mentioned the following sentences.
=
1. Modify your dictionary file and rebuild.
  1-1) Install MeCab
  1-2) Install MeCab Dictionary
  1-3) Modify your dictionary file
  1-4) Make it to tar.gz
=
The "1-3)" does not mean user modifies the csv files and compresses it back
to tar.gz.
It means re-training, of course user has to be careful and have knowledge
of the Natural Language Processing.
Column 2, 3 and 4 in csv values are the values produced by training.
(2 : left context id, 3 : right context id, 4 : cost)
These values are dependent on the model and matrix.def values. (when use
mecab-dict-index)

That's why I mentioned "1-1)" and "1-2)" processes first.

Anyway, in my personal opinion, Lucene does not need to consider whether
the system dictionary status is good or not.
I just think when some user wants to use a custom system dictionary, it is
not user-friendly to modify the ant file or find some code for a long time
to run the DictionaryBuilder.
I think there should be at least a guide.

Warm regards,
Namgyu Kim

P.S. Although not as good as the Tomoko's contents, there is a list of
dictionaries supported by kuromoji.
(https://github.com/atilika/kuromoji#supported-dictionaries)


2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida 님이
작성:

> Hi,
>
> The system dictionary is not a mere "word collection", it includes a
> machine-learned language model which is carefully trained by
> researchers. If you want to replace the system dictionary, you have to
> start from "re-train" the model. This needs expert knowledge so I do
> not recommend to just modify the CSVs and rebuild it (if you do not
> have an expert about it).
>
> As far as relates to "modern words" which is not included the current
> system dictionary, there are already a few options.
>
> 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> Kuromoji's default dictionary)
>
> For Solr:
> https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> (The branch is mine. A little bit old, but you can cherry-pick the
> changes in the kuromoji's build.xml.)
>
> For Elasticsearch:
> https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
>
> 2. Use Sudachi dictionary
>
> For Elasticsearch:
> https://github.com/WorksApplications/elasticsearch-sudachi
> This includes Lucene jar, so I think you can extract the jar for Solr
> (I've never tried to use with Solr).
>
> Both are actively maintained by linguistics & NLP researchers/engineers.
> Please be careful, those are rather huge jars...
>
> Hope that helps.
>
> Tomoko
>
> 2019年5月26日(日) 23:11 Trejkaz :
> >
> > On Sun, 26 May 2019 at 23:49, Namgyu Kim  wrote:
> >
> > > I think so about that approach.
> > > It's not user-friendly and it is not good for the user.
> >
> > I think it's better to get the parameters in
> >
> > JapaneseTokenizer.
> > >
> > > What do you think about this?
> >
> >
> > A way to override the system dictionary would be useful for us as well.
> We
> > often get people complaining that the current dictionary is missing a lot
> > of common modern words, and there are alternate mecab dictionaries
> sitting
> > around already which solve this problem.
> >
> > TX
> >
> >
> > >
> > >
> > >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Tomoko Uchida
Hi,

The system dictionary is not a mere "word collection", it includes a
machine-learned language model which is carefully trained by
researchers. If you want to replace the system dictionary, you have to
start from "re-train" the model. This needs expert knowledge so I do
not recommend to just modify the CSVs and rebuild it (if you do not
have an expert about it).

As far as relates to "modern words" which is not included the current
system dictionary, there are already a few options.

1. Use neologd dictionary (it's an extension of MeCab IPADIC,
Kuromoji's default dictionary)

For Solr: https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
(The branch is mine. A little bit old, but you can cherry-pick the
changes in the kuromoji's build.xml.)

For Elasticsearch:
https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd

2. Use Sudachi dictionary

For Elasticsearch: https://github.com/WorksApplications/elasticsearch-sudachi
This includes Lucene jar, so I think you can extract the jar for Solr
(I've never tried to use with Solr).

Both are actively maintained by linguistics & NLP researchers/engineers.
Please be careful, those are rather huge jars...

Hope that helps.

Tomoko

2019年5月26日(日) 23:11 Trejkaz :
>
> On Sun, 26 May 2019 at 23:49, Namgyu Kim  wrote:
>
> > I think so about that approach.
> > It's not user-friendly and it is not good for the user.
>
> I think it's better to get the parameters in
>
> JapaneseTokenizer.
> >
> > What do you think about this?
>
>
> A way to override the system dictionary would be useful for us as well. We
> often get people complaining that the current dictionary is missing a lot
> of common modern words, and there are alternate mecab dictionaries sitting
> around already which solve this problem.
>
> TX
>
>
> >
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Trejkaz
On Sun, 26 May 2019 at 23:49, Namgyu Kim  wrote:

> I think so about that approach.
> It's not user-friendly and it is not good for the user.

I think it's better to get the parameters in

JapaneseTokenizer.
>
> What do you think about this?


A way to override the system dictionary would be useful for us as well. We
often get people complaining that the current dictionary is missing a lot
of common modern words, and there are alternate mecab dictionaries sitting
around already which solve this problem.

TX


>
>
>


Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Namgyu Kim
I've been able to build a dictionary using DictionaryBuilder (I guess that
is what the "regenerate" task must be using?)
=>
Yes. That's right.
The "regenerate" run commands in the following order:
1) Compile the code (compile-tools)
2) Download the jar file (download-dict)
3) Save Noun.proper.csv diffs (patch-dict)
4) Run DictionaryBuilder. (build-dict)

Not a very user-friendly approach
=>
I think so about that approach.
It's not user-friendly and it is not good for the user.
I think it's better to get the parameters in constructor of
JapaneseTokenizer.

What do you think about this?

Warm regards,
Namgyu Kim


2019년 5월 26일 (일) 오후 9:19, Michael Sokolov 님이 작성:

> Thanks, Namgyu. I've been able to build a dictionary using
> DictionaryBuilder (I guess that is what the "regenerate" task must be
> using?) and I can replace the existing one on the classpath with jar
> surgery for now. Not a very user-friendly approach, but it will enable
> me to run some experiments and see whether this is truly necessary for
> my use case.
>
> On Sun, May 26, 2019 at 7:56 AM Namgyu Kim  wrote:
> >
> > Sorry for the wrong information, Mike.
> > Tomoko is right.
> > I checked it wrong.
> >
> > User dictionary is independent from the system dictionary. If you give
> > the user entries, JapaneseTokenizer builds two FSTs one for the
> > built-in dictionary and one for the user dictionary and they are
> > retrieved separately.
> >
> > Please ignore the following lines in my e-mail.
> > 
> > Japanese Analyzer does not load dictionaries by default.
> > ...
> > Since it is a way to create and pass the UserDictionary object, there is
> no
> > conflict between user dictionary and system dictionary.
> > (You may choose only one of them! -> means userFST instance in
> > JapaneseTokenizer)
> > =
> >
> > The System dictionary and the User dictionary are separated and can have
> > each.
> >
> > About System dictionary,
> > As I know, it is not possible to change the System dictionary at the code
> > level.
> > The part that reads the System dictionary is hard-coded.
> > (TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
> > If you really need it, can you make a JIRA issue and proceed with me?
> >
> > But there is a way to build a new kuromoji jar.
> > 1. Modify your dictionary file and rebuild.
> >   1-1) Install MeCab
> >   1-2) Install MeCab Dictionary
> >   1-3) Modify your dictionary file
> >   1-4) Make it to tar.gz
> > 2. change kuromoji/ivy.xml from
> > https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
> > "/>
> > to
> > 
> > 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
> > 4. "ant jar"
> >
> > I wish I could help you.
> >
> > Warm regards,
> > Namgyu Kim
> >
> > 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov 님이 작성:
> >
> > > Thank you for the detailed responses! What Tomoko is saying seems
> > > consistent with my cursory reading of the code. The reason I asked is
> > > I have a customer that thinks they want to replace the system
> > > dictionary, and I am trying to see if that is necessary. It seems as
> > > if for the most part, we can supply a comprehensive user dictionary
> > > and it would pretty much take the place of the system dictionary,
> > > assuming it is a superset (covers at least the original system dict
> > > tokens), but there is probably no way to "remove" a token that is
> > > present in the system dictionary (or maybe it can effectively be
> > > removed by adding it to user dictionary with a high penalty?). I'm not
> > > sure why one would want to do this removal, just trying to understand
> > > the design parameters.
> > >
> > > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > > If I provide entries in the user
> > > > dictionary is it just as if I had included them in the system
> > > > dictionary? If the same entry occurs in both, do the user dictionary
> > > > weights supersede those in the system dictionary? Is there some way
> to
> > > > suppress entries in the system dict?
> > > >
> > > > User dictionary is independent from the system dictionary. If you
> give
> > > > the user entries, JapaneseTokenizer builds two FSTs one for the
> > > > built-in dictionary and one for the user dictionary and they are
> > > > retrieved separately.
> > > >
> > > > First the user dictionary is retrieved, and if there are no entries
> > > > matched then the system dictionary is retrieved. So if any entry is
> > > > found in the user dictionary, all possible candidates in the system
> > > > dictionary are ignored (suppressed).
> > > >
> > > > (I think this is kuromoji specific behaviour, the original mecab pos
> > > > tagger retrieves both of the system dictionary and user dictionary
> and
> > > > compares their weights by performing Viterbi. In fact the behaviour -
> > > > always gives priority to 

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Michael Sokolov
Thanks, Namgyu. I've been able to build a dictionary using
DictionaryBuilder (I guess that is what the "regenerate" task must be
using?) and I can replace the existing one on the classpath with jar
surgery for now. Not a very user-friendly approach, but it will enable
me to run some experiments and see whether this is truly necessary for
my use case.

On Sun, May 26, 2019 at 7:56 AM Namgyu Kim  wrote:
>
> Sorry for the wrong information, Mike.
> Tomoko is right.
> I checked it wrong.
>
> User dictionary is independent from the system dictionary. If you give
> the user entries, JapaneseTokenizer builds two FSTs one for the
> built-in dictionary and one for the user dictionary and they are
> retrieved separately.
>
> Please ignore the following lines in my e-mail.
> 
> Japanese Analyzer does not load dictionaries by default.
> ...
> Since it is a way to create and pass the UserDictionary object, there is no
> conflict between user dictionary and system dictionary.
> (You may choose only one of them! -> means userFST instance in
> JapaneseTokenizer)
> =
>
> The System dictionary and the User dictionary are separated and can have
> each.
>
> About System dictionary,
> As I know, it is not possible to change the System dictionary at the code
> level.
> The part that reads the System dictionary is hard-coded.
> (TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
> If you really need it, can you make a JIRA issue and proceed with me?
>
> But there is a way to build a new kuromoji jar.
> 1. Modify your dictionary file and rebuild.
>   1-1) Install MeCab
>   1-2) Install MeCab Dictionary
>   1-3) Modify your dictionary file
>   1-4) Make it to tar.gz
> 2. change kuromoji/ivy.xml from
> https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
> "/>
> to
> 
> 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
> 4. "ant jar"
>
> I wish I could help you.
>
> Warm regards,
> Namgyu Kim
>
> 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov 님이 작성:
>
> > Thank you for the detailed responses! What Tomoko is saying seems
> > consistent with my cursory reading of the code. The reason I asked is
> > I have a customer that thinks they want to replace the system
> > dictionary, and I am trying to see if that is necessary. It seems as
> > if for the most part, we can supply a comprehensive user dictionary
> > and it would pretty much take the place of the system dictionary,
> > assuming it is a superset (covers at least the original system dict
> > tokens), but there is probably no way to "remove" a token that is
> > present in the system dictionary (or maybe it can effectively be
> > removed by adding it to user dictionary with a high penalty?). I'm not
> > sure why one would want to do this removal, just trying to understand
> > the design parameters.
> >
> > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
> >  wrote:
> > >
> > > Hi,
> > >
> > > > If I provide entries in the user
> > > dictionary is it just as if I had included them in the system
> > > dictionary? If the same entry occurs in both, do the user dictionary
> > > weights supersede those in the system dictionary? Is there some way to
> > > suppress entries in the system dict?
> > >
> > > User dictionary is independent from the system dictionary. If you give
> > > the user entries, JapaneseTokenizer builds two FSTs one for the
> > > built-in dictionary and one for the user dictionary and they are
> > > retrieved separately.
> > >
> > > First the user dictionary is retrieved, and if there are no entries
> > > matched then the system dictionary is retrieved. So if any entry is
> > > found in the user dictionary, all possible candidates in the system
> > > dictionary are ignored (suppressed).
> > >
> > > (I think this is kuromoji specific behaviour, the original mecab pos
> > > tagger retrieves both of the system dictionary and user dictionary and
> > > compares their weights by performing Viterbi. In fact the behaviour -
> > > always gives priority to the entries in the user dictionary - is a bit
> > > too aggressive from the point of view of the consistency of
> > > tokenization. I do not know why, but there may be some performance
> > > reasons?)
> > >
> > > I think you can easily find the retrieval logic I described here in
> > > JapaneseTokenizer#parse() method. (Let me know if my understanding is
> > > not correct.)
> > >
> > > Regards,
> > > Tomoko
> > >
> > > 2019年5月26日(日) 5:08 김남규 :
> > > >
> > > > Hi, Mike :D
> > > >
> > > > Japanese Analyzer does not load dictionaries by default.
> > > > If you look at the constructor, you can see that it is created as null
> > if
> > > > not set parameters.
> > > > (check testUserDict3() in TestJapaneseAnalyzer.java)
> > > >
> > > > In JapaneseTokenizer,
> > > > =
> > > > if (userDictionary != null) {
> > > >   

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Namgyu Kim
Sorry for the wrong information, Mike.
Tomoko is right.
I checked it wrong.

User dictionary is independent from the system dictionary. If you give
the user entries, JapaneseTokenizer builds two FSTs one for the
built-in dictionary and one for the user dictionary and they are
retrieved separately.

Please ignore the following lines in my e-mail.

Japanese Analyzer does not load dictionaries by default.
...
Since it is a way to create and pass the UserDictionary object, there is no
conflict between user dictionary and system dictionary.
(You may choose only one of them! -> means userFST instance in
JapaneseTokenizer)
=

The System dictionary and the User dictionary are separated and can have
each.

About System dictionary,
As I know, it is not possible to change the System dictionary at the code
level.
The part that reads the System dictionary is hard-coded.
(TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
If you really need it, can you make a JIRA issue and proceed with me?

But there is a way to build a new kuromoji jar.
1. Modify your dictionary file and rebuild.
  1-1) Install MeCab
  1-2) Install MeCab Dictionary
  1-3) Modify your dictionary file
  1-4) Make it to tar.gz
2. change kuromoji/ivy.xml from
https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
"/>
to

3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
4. "ant jar"

I wish I could help you.

Warm regards,
Namgyu Kim

2019년 5월 26일 (일) 오전 9:03, Michael Sokolov 님이 작성:

> Thank you for the detailed responses! What Tomoko is saying seems
> consistent with my cursory reading of the code. The reason I asked is
> I have a customer that thinks they want to replace the system
> dictionary, and I am trying to see if that is necessary. It seems as
> if for the most part, we can supply a comprehensive user dictionary
> and it would pretty much take the place of the system dictionary,
> assuming it is a superset (covers at least the original system dict
> tokens), but there is probably no way to "remove" a token that is
> present in the system dictionary (or maybe it can effectively be
> removed by adding it to user dictionary with a high penalty?). I'm not
> sure why one would want to do this removal, just trying to understand
> the design parameters.
>
> On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
>  wrote:
> >
> > Hi,
> >
> > > If I provide entries in the user
> > dictionary is it just as if I had included them in the system
> > dictionary? If the same entry occurs in both, do the user dictionary
> > weights supersede those in the system dictionary? Is there some way to
> > suppress entries in the system dict?
> >
> > User dictionary is independent from the system dictionary. If you give
> > the user entries, JapaneseTokenizer builds two FSTs one for the
> > built-in dictionary and one for the user dictionary and they are
> > retrieved separately.
> >
> > First the user dictionary is retrieved, and if there are no entries
> > matched then the system dictionary is retrieved. So if any entry is
> > found in the user dictionary, all possible candidates in the system
> > dictionary are ignored (suppressed).
> >
> > (I think this is kuromoji specific behaviour, the original mecab pos
> > tagger retrieves both of the system dictionary and user dictionary and
> > compares their weights by performing Viterbi. In fact the behaviour -
> > always gives priority to the entries in the user dictionary - is a bit
> > too aggressive from the point of view of the consistency of
> > tokenization. I do not know why, but there may be some performance
> > reasons?)
> >
> > I think you can easily find the retrieval logic I described here in
> > JapaneseTokenizer#parse() method. (Let me know if my understanding is
> > not correct.)
> >
> > Regards,
> > Tomoko
> >
> > 2019年5月26日(日) 5:08 김남규 :
> > >
> > > Hi, Mike :D
> > >
> > > Japanese Analyzer does not load dictionaries by default.
> > > If you look at the constructor, you can see that it is created as null
> if
> > > not set parameters.
> > > (check testUserDict3() in TestJapaneseAnalyzer.java)
> > >
> > > In JapaneseTokenizer,
> > > =
> > > if (userDictionary != null) {
> > >   userFST = userDictionary.getFST();
> > >   userFSTReader = userFST.getBytesReader();
> > > } else {
> > >   userFST = null;
> > >   userFSTReader = null;
> > > }
> > > =
> > > Since it is a way to create and pass the UserDictionary object, there
> is no
> > > conflict between user dictionary and system dictionary.
> > > (You may choose only one of them! -> means userFST instance in
> > > JapaneseTokenizer)
> > >
> > > About dictionary,
> > > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > > You can check it in org.apache.lucene.analysis.ja.dict.
> > > It