Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
github-actions[bot] commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-2050752715 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
azagniotov commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-2026358207 @mocobeta thank you. I have not done any benchmarks, thus, I cannot comment on potential performance implications. One thing that probably be certain that a larger dictionary will require more memory allocated. Btw, have you had a chance to evaluate the correctness of tokenization? @mikemccand this sounds interesting. It seems like this is a treaded path. How easy/hard will it be to point `runAnalyzerPerf.py` to the current PR branch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
mikemccand commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-2025238979 > The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that? luceneutil has a `runAnalyzerPerf.py` benchmark, and the nightly script runs it and [posts the results here](https://home.apache.org/~mikemccand/lucenebench/analyzers.html), including Kuromoji. We could maybe test tokens/sec throughput using that benchy for this PR running on UniDic? I'm not sure how we would conclude whether it's fast enough since current Kuromoji (`main`) can't build UniDic. Maybe we could defer performance testing / optimizing on UniDic to a later issue... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
mocobeta commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-2016769417 Hi, sorry for my late reply. I quickly checked the built dictionary size. The latest Unidic is fairly (to me, insanely) large - its total size is 1.6G. https://clrd.ninjal.ac.jp/unidic/back_number.html#unidic_cwj The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. Besides the size, I think we should consider performance. I'm worried that there can be a significant impact on analysis/indexing speed. Do you have any benchmark result on that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
github-actions[bot] commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1984826022 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
azagniotov commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1960506371 Thank you, @mocobeta -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
mocobeta commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1960500994 @azagniotov Sorry, I've not been available for a while. Let me take a look; I will try to find time next week... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
azagniotov commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1960013536 @uschindler yes: I have rebased from the latest `main` branch and ran the `./gradlew clean regenerate`. The following is the (partial) output: ``` ... ... > Task :lucene:analysis:nori:compileMecabKo Download https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz > Task :lucene:analysis:kuromoji:compileMecab Download https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz Automaton regenerated from dictionary: mecab-ipadic-2.7.0-20070801 > Task :lucene:analysis:nori:compileMecabKo Automaton regenerated from dictionary: mecab-ko-dic-2.1.1-20180720 BUILD SUCCESSFUL in 39s 127 actionable tasks: 121 executed, 6 up-to-date ``` I also confirmed that the dictionary files `*.dat` were regenerated under the `lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja` based on the modified date of the files. @mocobeta What do you think with regards to next steps with this PR? Is there anything I can add? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
azagniotov commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1960013241 @uschindler yes: I have rebased from the latest `main` branch and ran the `./gradlew clean regenerate`. The following is the (partial) output: ``` ... ... > Task :lucene:analysis:nori:compileMecabKo Download https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz > Task :lucene:analysis:kuromoji:compileMecab Download https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz Automaton regenerated from dictionary: mecab-ipadic-2.7.0-20070801 > Task :lucene:analysis:nori:compileMecabKo Automaton regenerated from dictionary: mecab-ko-dic-2.1.1-20180720 BUILD SUCCESSFUL in 39s 127 actionable tasks: 121 executed, 6 up-to-date ``` I also confirmed that the dictionary files `*.dat` were regenerated under the `lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja` based on the modified date of the files. @mocobeta What do you think with regards to next steps with this PR? Is there anything I can add? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
uschindler commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1951194265 One thing! Have you checked that a simple recompile of the original mecab (gradlew regenerate) produced same files (working copy clean afterwards)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
uschindler commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1951188984 I can't tell you anything about the internals touched here, so I can't review it from the language specific point of view. The Gradle changes look fine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
azagniotov commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1925374230 Hi @uschindler , I wanted to ping you and see if you have any thoughts on this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
github-actions[bot] commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1913774063 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
azagniotov commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1890360185 > Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom. Hi @uschindler, I ended up simply reverting the 8d52f66 commit. The current `gradle/generation/kuromoji.gradle` already contains `def recompileDictionary(...)` which is used by all the `task compile(..)`, thus, the task and the application plugin that I added were not necessary, after all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
github-actions[bot] commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1880902132 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
azagniotov commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1862168100 @uschindler thank you for also taking a look, in addition to others. Understood. Please allow me a few days to export a commit that would add a plain java runner task instead of the application plugin. Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
uschindler commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1861627255 Hi, Please give me some time to review. Please don't add the application plugin. Instead just add a plain java runner task. The result of the project is a library jar, so please don't change this as it could have effects on the resulting maven pom. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]
mikemccand commented on PR #12517: URL: https://github.com/apache/lucene/pull/12517#issuecomment-1860575273 I am far from knowing much about Kuromoji and its dictionaries but this sounds like a great change (being able to load a new (UniDic) dictionary format, and the PR still cleanly applies. Are there any concerns with this? @rmuir had mentioned [some difficult `assert` on the Jira issue](https://github.com/apache/lucene/issues/5128#issuecomment-1223369141) long ago -- is that resolved / worked around? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org