Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-04-11 Thread via GitHub


github-actions[bot] commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-2050752715

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-03-28 Thread via GitHub


azagniotov commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-2026358207

   @mocobeta thank you. I have not done any benchmarks, thus, I cannot comment 
on potential performance implications. One thing that probably be certain that 
a larger dictionary will require more memory allocated. Btw, have you had a 
chance to evaluate the correctness of tokenization?
   
   @mikemccand this sounds interesting. It seems like this is a treaded path. 
How easy/hard will it be to point `runAnalyzerPerf.py` to the current PR branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-03-28 Thread via GitHub


mikemccand commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-2025238979

   > The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. 
Besides the size, I think we should consider performance. I'm worried that 
there can be a significant impact on analysis/indexing speed. Do you have any 
benchmark result on that?
   
   luceneutil has a `runAnalyzerPerf.py` benchmark, and the nightly script runs 
it and [posts the results 
here](https://home.apache.org/~mikemccand/lucenebench/analyzers.html), 
including Kuromoji.  We could maybe test tokens/sec throughput using that 
benchy for this PR running on UniDic?  I'm not sure how we would conclude 
whether it's fast enough since current Kuromoji (`main`) can't build UniDic.  
Maybe we could defer performance testing / optimizing on UniDic to a later 
issue...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-03-24 Thread via GitHub


mocobeta commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-2016769417

   Hi, sorry for my late reply.
   I quickly checked the built dictionary size. The latest Unidic is fairly (to 
me, insanely) large - its total size is 1.6G.
   https://clrd.ninjal.ac.jp/unidic/back_number.html#unidic_cwj
   
   The built kuromoji jar with unidic-cwj-3.1.1-full eventually becomes 442M. 
Besides the size, I think we should consider performance. I'm worried that 
there can be a significant impact on analysis/indexing speed. Do you have any 
benchmark result on that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-03-07 Thread via GitHub


github-actions[bot] commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1984826022

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-02-22 Thread via GitHub


azagniotov commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1960506371

   Thank you, @mocobeta  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-02-22 Thread via GitHub


mocobeta commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1960500994

   @azagniotov Sorry, I've not been available for a while. Let me take a look; 
I will try to find time next week...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-02-22 Thread via GitHub


azagniotov commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1960013536

   @uschindler yes: I have rebased from the latest `main` branch and ran the 
`./gradlew clean regenerate`. The following is the (partial) output:
   ```
   ...
   ...
   > Task :lucene:analysis:nori:compileMecabKo
   Download 
https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
   
   > Task :lucene:analysis:kuromoji:compileMecab
   Download 
https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
   Automaton regenerated from dictionary: mecab-ipadic-2.7.0-20070801
   
   > Task :lucene:analysis:nori:compileMecabKo
   Automaton regenerated from dictionary: mecab-ko-dic-2.1.1-20180720
   
   BUILD SUCCESSFUL in 39s
   127 actionable tasks: 121 executed, 6 up-to-date
   ```
   
   I also confirmed that the dictionary files `*.dat` were regenerated under 
the `lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja` 
based on the modified date of the files.
   
   @mocobeta What do you think with regards to next steps with this PR? Is 
there anything I can add?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-02-22 Thread via GitHub


azagniotov commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1960013241

   @uschindler yes: I have rebased from the latest `main` branch and ran the 
`./gradlew clean regenerate`. The following is the (partial) output:
   ```
   ...
   ...
   > Task :lucene:analysis:nori:compileMecabKo
   Download 
https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
   
   > Task :lucene:analysis:kuromoji:compileMecab
   Download 
https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
   Automaton regenerated from dictionary: mecab-ipadic-2.7.0-20070801
   
   > Task :lucene:analysis:nori:compileMecabKo
   Automaton regenerated from dictionary: mecab-ko-dic-2.1.1-20180720
   
   BUILD SUCCESSFUL in 39s
   127 actionable tasks: 121 executed, 6 up-to-date
   ```
   
   I also confirmed that the dictionary files `*.dat` were regenerated under 
the `lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja` 
based on the modified date of the files.
   
   @mocobeta What do you think with regards to next steps with this PR? Is 
there anything I can add?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-02-18 Thread via GitHub


uschindler commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1951194265

   One thing! Have you checked that a simple recompile of the original mecab 
(gradlew regenerate) produced same files (working copy clean afterwards)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-02-18 Thread via GitHub


uschindler commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1951188984

   I can't tell you anything about the internals touched here, so I can't 
review it from the language specific point of view.
   
   The Gradle changes look fine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-02-03 Thread via GitHub


azagniotov commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1925374230

   Hi @uschindler , I wanted to ping you and see if you have any thoughts on 
this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-01-28 Thread via GitHub


github-actions[bot] commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1913774063

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-01-12 Thread via GitHub


azagniotov commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1890360185

   > Please don't add the application plugin. Instead just add a plain java 
runner task. The result of the project is a library jar, so please don't change 
this as it could have effects on the resulting maven pom.
   
   Hi @uschindler, I ended up simply reverting the 8d52f66 commit. The current 
`gradle/generation/kuromoji.gradle` already contains `def 
recompileDictionary(...)` which is used by all the `task compile(..)`, 
thus, the task and the application plugin that I added were not necessary, 
after all.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2024-01-08 Thread via GitHub


github-actions[bot] commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1880902132

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2023-12-18 Thread via GitHub


azagniotov commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1862168100

   @uschindler thank you for also taking a look, in addition to others. 
Understood. Please allow me a few days to export a commit that would add a 
plain java runner task instead of the application plugin. Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2023-12-18 Thread via GitHub


uschindler commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1861627255

   Hi,
   Please give me some time to review.
   
   Please don't add the application plugin. Instead just add a plain java 
runner task. The result of the project is a library jar, so please don't change 
this as it could have effects on the resulting maven pom.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-4056: Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary [lucene]

2023-12-18 Thread via GitHub


mikemccand commented on PR #12517:
URL: https://github.com/apache/lucene/pull/12517#issuecomment-1860575273

   I am far from knowing much about Kuromoji and its dictionaries but this 
sounds like a great change (being able to load a new (UniDic) dictionary 
format, and the PR still cleanly applies.  Are there any concerns with this?  
@rmuir had mentioned [some difficult `assert` on the Jira 
issue](https://github.com/apache/lucene/issues/5128#issuecomment-1223369141) 
long ago -- is that resolved / worked around?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org