Hi Tomoko :D
Thank you for your reply and listening to my thinking.
And I didn't know this question is old.
Of course, I want to participate in the LUCENE-8816 issue.
I think this issue will take some time.
I'll check it.
Warm regards,
Namgyu Kim
On Tue, May 28, 2019 at 10:43 PM Tomoko Uchida
Hi guys,
I just created an issue related to this thread.
Decouple Kuromoji's morphological analyser and its dictionary
https://issues.apache.org/jira/browse/LUCENE-8816
The problem discussed here is essentially within the current
architecture of Kuromoji (and Nori), "jar bundled system dictionar
Hi Namgyu,
> There is a team that uses a well-ported system dictionary.
> The Lucene version is up. (like 8.1 -> 8.2)
> Suppose there was no modification to kuromoji in 8.2.
> But the user has to port again.
> The same goes for 8.2 to 8.3.
I'm not sure about the situation at Korea, however, we al
Thank you for your reply, Tomoko :D
To be honest, I have not experienced it directly(means commercialize), so I
can't tell the exact situation of the Japanese MeCab.
I respect your opinion and it is true that customization is a difficult
task.
But I can talk a little bit about Korean MeCab. (The
> Anyway, in my personal opinion, Lucene does not need to consider whether
the system dictionary status is good or not.
Please don't get me wrong, but I don't think so.
Creating a customized or re-trained system dictionary still needs deep
knowledge about language and machine-learning. Even among
Oh, I think my explanation was not enough. Sorry...
I mentioned the following sentences.
=
1. Modify your dictionary file and rebuild.
1-1) Install MeCab
1-2) Install MeCab Dictionary
1-3) Modify your dictionary file
1-4) Make it to tar.gz
==
Hi,
The system dictionary is not a mere "word collection", it includes a
machine-learned language model which is carefully trained by
researchers. If you want to replace the system dictionary, you have to
start from "re-train" the model. This needs expert knowledge so I do
not recommend to just mo
On Sun, 26 May 2019 at 23:49, Namgyu Kim wrote:
> I think so about that approach.
> It's not user-friendly and it is not good for the user.
I think it's better to get the parameters in
JapaneseTokenizer.
>
> What do you think about this?
A way to override the system dictionary would be useful
I've been able to build a dictionary using DictionaryBuilder (I guess that
is what the "regenerate" task must be using?)
=>
Yes. That's right.
The "regenerate" run commands in the following order:
1) Compile the code (compile-tools)
2) Download the jar file (download-dict)
3) Save Noun.proper.csv d
Thanks, Namgyu. I've been able to build a dictionary using
DictionaryBuilder (I guess that is what the "regenerate" task must be
using?) and I can replace the existing one on the classpath with jar
surgery for now. Not a very user-friendly approach, but it will enable
me to run some experiments and
Sorry for the wrong information, Mike.
Tomoko is right.
I checked it wrong.
User dictionary is independent from the system dictionary. If you give
the user entries, JapaneseTokenizer builds two FSTs one for the
built-in dictionary and one for the user dictionary and they are
retrieved separately.
Thank you for the detailed responses! What Tomoko is saying seems
consistent with my cursory reading of the code. The reason I asked is
I have a customer that thinks they want to replace the system
dictionary, and I am trying to see if that is necessary. It seems as
if for the most part, we can sup
Hi,
> If I provide entries in the user
dictionary is it just as if I had included them in the system
dictionary? If the same entry occurs in both, do the user dictionary
weights supersede those in the system dictionary? Is there some way to
suppress entries in the system dict?
User dictionary is
Hi, Mike :D
Japanese Analyzer does not load dictionaries by default.
If you look at the constructor, you can see that it is created as null if
not set parameters.
(check testUserDict3() in TestJapaneseAnalyzer.java)
In JapaneseTokenizer,
=
if (userDicti
14 matches
Mail list logo