Re: JapaneseAnalyzer's system vs user dict

2019-05-28 Thread Namgyu Kim
Hi Tomoko :D Thank you for your reply and listening to my thinking. And I didn't know this question is old. Of course, I want to participate in the LUCENE-8816 issue. I think this issue will take some time. I'll check it. Warm regards, Namgyu Kim On Tue, May 28, 2019 at 10:43 PM Tomoko Uchida

Re: JapaneseAnalyzer's system vs user dict

2019-05-28 Thread Tomoko Uchida
Hi guys, I just created an issue related to this thread. Decouple Kuromoji's morphological analyser and its dictionary https://issues.apache.org/jira/browse/LUCENE-8816 The problem discussed here is essentially within the current architecture of Kuromoji (and Nori), "jar bundled system dictionar

Re: JapaneseAnalyzer's system vs user dict

2019-05-27 Thread Tomoko Uchida
Hi Namgyu, > There is a team that uses a well-ported system dictionary. > The Lucene version is up. (like 8.1 -> 8.2) > Suppose there was no modification to kuromoji in 8.2. > But the user has to port again. > The same goes for 8.2 to 8.3. I'm not sure about the situation at Korea, however, we al

Re: JapaneseAnalyzer's system vs user dict

2019-05-27 Thread Namgyu Kim
Thank you for your reply, Tomoko :D To be honest, I have not experienced it directly(means commercialize), so I can't tell the exact situation of the Japanese MeCab. I respect your opinion and it is true that customization is a difficult task. But I can talk a little bit about Korean MeCab. (The

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Tomoko Uchida
> Anyway, in my personal opinion, Lucene does not need to consider whether the system dictionary status is good or not. Please don't get me wrong, but I don't think so. Creating a customized or re-trained system dictionary still needs deep knowledge about language and machine-learning. Even among

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Namgyu Kim
Oh, I think my explanation was not enough. Sorry... I mentioned the following sentences. = 1. Modify your dictionary file and rebuild. 1-1) Install MeCab 1-2) Install MeCab Dictionary 1-3) Modify your dictionary file 1-4) Make it to tar.gz ==

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Tomoko Uchida
Hi, The system dictionary is not a mere "word collection", it includes a machine-learned language model which is carefully trained by researchers. If you want to replace the system dictionary, you have to start from "re-train" the model. This needs expert knowledge so I do not recommend to just mo

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Trejkaz
On Sun, 26 May 2019 at 23:49, Namgyu Kim wrote: > I think so about that approach. > It's not user-friendly and it is not good for the user. I think it's better to get the parameters in JapaneseTokenizer. > > What do you think about this? A way to override the system dictionary would be useful

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Namgyu Kim
I've been able to build a dictionary using DictionaryBuilder (I guess that is what the "regenerate" task must be using?) => Yes. That's right. The "regenerate" run commands in the following order: 1) Compile the code (compile-tools) 2) Download the jar file (download-dict) 3) Save Noun.proper.csv d

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Michael Sokolov
Thanks, Namgyu. I've been able to build a dictionary using DictionaryBuilder (I guess that is what the "regenerate" task must be using?) and I can replace the existing one on the classpath with jar surgery for now. Not a very user-friendly approach, but it will enable me to run some experiments and

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Namgyu Kim
Sorry for the wrong information, Mike. Tomoko is right. I checked it wrong. User dictionary is independent from the system dictionary. If you give the user entries, JapaneseTokenizer builds two FSTs one for the built-in dictionary and one for the user dictionary and they are retrieved separately.

Re: JapaneseAnalyzer's system vs user dict

2019-05-25 Thread Michael Sokolov
Thank you for the detailed responses! What Tomoko is saying seems consistent with my cursory reading of the code. The reason I asked is I have a customer that thinks they want to replace the system dictionary, and I am trying to see if that is necessary. It seems as if for the most part, we can sup

Re: JapaneseAnalyzer's system vs user dict

2019-05-25 Thread Tomoko Uchida
Hi, > If I provide entries in the user dictionary is it just as if I had included them in the system dictionary? If the same entry occurs in both, do the user dictionary weights supersede those in the system dictionary? Is there some way to suppress entries in the system dict? User dictionary is

Re: JapaneseAnalyzer's system vs user dict

2019-05-25 Thread 김남규
Hi, Mike :D Japanese Analyzer does not load dictionaries by default. If you look at the constructor, you can see that it is created as null if not set parameters. (check testUserDict3() in TestJapaneseAnalyzer.java) In JapaneseTokenizer, = if (userDicti