[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516893 ] Stanislaw Osinski commented on LUCENE-966: -- When digging deeper into the issues of compatibility with the original StandardAnalyzer, I stumbled upon something strange. Take the following text: 78academyawards/rules/rule02.html,7194,7227,type which was tokenized by the original StandardAnalyzer as one . If you look at the definition of the token: // every other segment must have at least one digit | | ( )+ | ( )+ |( )+ |( )+ ) you'll see that, as explained in the comment, every other segment must have at least one digit. But actually, according to my understanding, this rule should not match the above text as a whole (and with JFlex it doesn't , actually). Below is the text split by punctuation characters, and it looks like there is no way of splitting this text into alternating segments, every second of which must have a digit (A = ALPHANUM, H = HAS_DIGIT): 78academyawards / rules / rule02 . html , 7194 , 7227 , type H P A P H P A P H P A PH?* (starting from the beginning) H?*P A P H P A (starting from the end) * (would have to be H, but no digits in substring "type" or "html") I have no idea why JavaCC matched the whole text as a , JFlex behaved "more correctly" here. Now I can see two solutions: * try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be aware of most of them...) * relax the rule a little bit (JFlex notation): // there must be at least one segment with a digit NUM = ({P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P} ({HAS_DIGIT} | {ALPHANUM}))* With this definition, again, all StandardAnalyzer tests pass, plus all texts along the lines of: 2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type 78academyawards/rules/rule02.html,7194,7227,type 978-0-94045043-1,86408,86424,type 62.46,37004,37009,type(this one was parsed as by the original analyzer) get parsed as a whole as one , which is equivalent to what JavaCC-based version would do. I will attach a corresponding patch in a second. > A faster JFlex-based replacement for StandardAnalyzer > - > > Key: LUCENE-966 > URL: https://issues.apache.org/jira/browse/LUCENE-966 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Stanislaw Osinski > Fix For: 2.3 > > Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, > jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt > > > JFlex (http://www.jflex.de/) can be used to generate a faster (up to several > times) replacement for StandardAnalyzer. Will add a patch and a simple > benchmark code in a while. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
variuos IndexReader methods -- was: Re: [jira] Updated: (LUCENE-832) NPE when calling isCurrent() on a ParallellReader
is it just me, or does it seem like the base class versions of getVersion(), isOptimized(), and isCurrent() in IndexReader should all throw UnsupportedOperationException? (it seems like ideally they should abstract, but that ship/API has sailed) : This patch fixes ParallelReader similar to LUCENE-781: : :* ParallelReader.getVersion() now throws an : UnsupportedOperationException. : :* ParallelReader.isOptimized() now checks if all underlying : indexes are optimized and returns true in such a case. : :* ParallelReader.isCurrent() now checks if all underlying : IndexReaders are up to date and returns true in such a case. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislaw Osinski updated LUCENE-966: - Attachment: jflex-analyzer-r561693-compatibility.txt A patch for better compatibility with the StandardAnalyzer containing: * relaxed definition of the token * new test cases in TestStandardAnalyzer I noticed that with this patch org.apache.lucene.benchmark.quality.TestQualityRun.testTrecQuality fails, but I'm not sure if this is related to the tokenizer. > A faster JFlex-based replacement for StandardAnalyzer > - > > Key: LUCENE-966 > URL: https://issues.apache.org/jira/browse/LUCENE-966 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Stanislaw Osinski > Fix For: 2.3 > > Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, > jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, > jflex-analyzer-r561693-compatibility.txt > > > JFlex (http://www.jflex.de/) can be used to generate a faster (up to several > times) replacement for StandardAnalyzer. Will add a patch and a simple > benchmark code in a while. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed
To whom it may engage... This is an automated request, but not an unsolicited one. For more information please visit http://gump.apache.org/nagged.html, and/or contact the folk at [EMAIL PROTECTED] Project lucene-java has an issue affecting its community integration. This issue affects 3 projects, and has been outstanding for 30 runs. The current state of this project is 'Failed', with reason 'Build Failed'. For reference only, the following projects are affected by this: - eyebrowse : Web-based mail archive browsing - jakarta-lucene : Java Based Search Engine - lucene-java : Java Based Search Engine Full details are available at: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/index.html That said, some information snippets are provided here. The following annotations (debug/informational/warning/error messages) were provided: -DEBUG- Sole output [lucene-core-01082007.jar] identifier set to project name -DEBUG- Dependency on javacc exists, no need to add for property javacc.home. -INFO- Failed with reason build failed -INFO- Failed to extract fallback artifacts from Gump Repository The following work was performed: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/gump_work/build_lucene-java_lucene-java.html Work Name: build_lucene-java_lucene-java (Type: Build) Work ended in a state of : Failed Elapsed: 34 secs Command Line: /usr/lib/jvm/java-1.5.0-sun/bin/java -Djava.awt.headless=true -Xbootclasspath/p:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis.jar:/srv/gump/public/workspace/xml-xerces2/build/xercesImpl.jar org.apache.tools.ant.Main -Dgump.merge=/srv/gump/public/gump/work/merge.xml -Dbuild.sysclasspath=only -Dversion=01082007 -Djavacc.home=/srv/gump/packages/javacc-3.1 package [Working Directory: /srv/gump/public/workspace/lucene-java] CLASSPATH: /usr/lib/jvm/java-1.5.0-sun/lib/tools.jar:/srv/gump/public/workspace/lucene-java/build/classes/java:/srv/gump/public/workspace/lucene-java/build/classes/demo:/srv/gump/public/workspace/lucene-java/build/classes/test:/srv/gump/public/workspace/lucene-java/contrib/db/bdb/lib/db-4.3.29.jar:/srv/gump/public/workspace/lucene-java/contrib/gdata-server/lib/gdata-client-1.0.jar:/srv/gump/public/workspace/lucene-java/build/contrib/analyzers/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/ant/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/benchmark/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb-je/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/gdata-server/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/highlighter/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/javascript/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/lucli/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/memory/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/queries/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/regex/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/similarity/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/snowball/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/spellchecker/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/surround/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/swing/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/wordnet/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/xml-query-parser/classes/java:/srv/gump/public/workspace/ant/dist/lib/ant-jmf.jar:/srv/gump/public/workspace/ant/dist/lib/ant-swing.jar:/srv/gump/public/workspace/ant/dist/lib/ant-apache-resolver.jar:/srv/gump/public/workspace/ant/dist/lib/ant-trax.jar:/srv/gump/public/workspace/ant/dist/lib/ant-junit.jar:/srv/gump/public/workspace/ant/dist/lib/ant-launcher.jar:/srv/gump/public/workspace/ant/dist/lib/ant-nodeps.jar:/srv/gump/public/workspace/ant/dist/lib/ant.jar:/srv/gump/packages/junit3.8.1/junit.jar:/srv/gump/public/workspace/xml-commons/java/build/resolver.jar:/srv/gump/packages/je-1.7.1/lib/je.jar:/srv/gump/public/workspace/apache-commons/digester/dist/commons-digester.jar:/srv/gump/public/workspace/jakarta-regexp/build/jakarta-regexp-01082007.jar:/srv/gump/packages/javacc-3.1/bin/lib/javacc.jar:/srv/gump/public/workspace/jline/target/jline-0.9.92-SNAPSHOT.jar:/srv/gump/packages/jtidy-04aug2000r7-dev/build/Tidy.jar:/srv/gump/public/workspace/junit/dist/junit-01082007.jar:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis-ext.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-01082007.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-api-01082007.jar:/srv/gump/public/workspace/jakarta-servletapi-5/jsr154/dist/lib/servlet-api.jar:/srv/gump/packages/nekoh
[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed
To whom it may engage... This is an automated request, but not an unsolicited one. For more information please visit http://gump.apache.org/nagged.html, and/or contact the folk at [EMAIL PROTECTED] Project lucene-java has an issue affecting its community integration. This issue affects 3 projects, and has been outstanding for 30 runs. The current state of this project is 'Failed', with reason 'Build Failed'. For reference only, the following projects are affected by this: - eyebrowse : Web-based mail archive browsing - jakarta-lucene : Java Based Search Engine - lucene-java : Java Based Search Engine Full details are available at: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/index.html That said, some information snippets are provided here. The following annotations (debug/informational/warning/error messages) were provided: -DEBUG- Sole output [lucene-core-01082007.jar] identifier set to project name -DEBUG- Dependency on javacc exists, no need to add for property javacc.home. -INFO- Failed with reason build failed -INFO- Failed to extract fallback artifacts from Gump Repository The following work was performed: http://vmgump.apache.org/gump/public/lucene-java/lucene-java/gump_work/build_lucene-java_lucene-java.html Work Name: build_lucene-java_lucene-java (Type: Build) Work ended in a state of : Failed Elapsed: 34 secs Command Line: /usr/lib/jvm/java-1.5.0-sun/bin/java -Djava.awt.headless=true -Xbootclasspath/p:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis.jar:/srv/gump/public/workspace/xml-xerces2/build/xercesImpl.jar org.apache.tools.ant.Main -Dgump.merge=/srv/gump/public/gump/work/merge.xml -Dbuild.sysclasspath=only -Dversion=01082007 -Djavacc.home=/srv/gump/packages/javacc-3.1 package [Working Directory: /srv/gump/public/workspace/lucene-java] CLASSPATH: /usr/lib/jvm/java-1.5.0-sun/lib/tools.jar:/srv/gump/public/workspace/lucene-java/build/classes/java:/srv/gump/public/workspace/lucene-java/build/classes/demo:/srv/gump/public/workspace/lucene-java/build/classes/test:/srv/gump/public/workspace/lucene-java/contrib/db/bdb/lib/db-4.3.29.jar:/srv/gump/public/workspace/lucene-java/contrib/gdata-server/lib/gdata-client-1.0.jar:/srv/gump/public/workspace/lucene-java/build/contrib/analyzers/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/ant/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/benchmark/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb-je/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/gdata-server/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/highlighter/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/javascript/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/lucli/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/memory/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/queries/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/regex/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/similarity/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/snowball/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/spellchecker/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/surround/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/swing/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/wordnet/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/xml-query-parser/classes/java:/srv/gump/public/workspace/ant/dist/lib/ant-jmf.jar:/srv/gump/public/workspace/ant/dist/lib/ant-swing.jar:/srv/gump/public/workspace/ant/dist/lib/ant-apache-resolver.jar:/srv/gump/public/workspace/ant/dist/lib/ant-trax.jar:/srv/gump/public/workspace/ant/dist/lib/ant-junit.jar:/srv/gump/public/workspace/ant/dist/lib/ant-launcher.jar:/srv/gump/public/workspace/ant/dist/lib/ant-nodeps.jar:/srv/gump/public/workspace/ant/dist/lib/ant.jar:/srv/gump/packages/junit3.8.1/junit.jar:/srv/gump/public/workspace/xml-commons/java/build/resolver.jar:/srv/gump/packages/je-1.7.1/lib/je.jar:/srv/gump/public/workspace/apache-commons/digester/dist/commons-digester.jar:/srv/gump/public/workspace/jakarta-regexp/build/jakarta-regexp-01082007.jar:/srv/gump/packages/javacc-3.1/bin/lib/javacc.jar:/srv/gump/public/workspace/jline/target/jline-0.9.92-SNAPSHOT.jar:/srv/gump/packages/jtidy-04aug2000r7-dev/build/Tidy.jar:/srv/gump/public/workspace/junit/dist/junit-01082007.jar:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis-ext.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-01082007.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-api-01082007.jar:/srv/gump/public/workspace/jakarta-servletapi-5/jsr154/dist/lib/servlet-api.jar:/srv/gump/packages/nekoh
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516945 ] Michael McCandless commented on LUCENE-967: --- > Also, I think the addition of printing of elapsed time is redundant, > because you get it anyhow as the elapsed time reported for the > outermost task sequence. (?) Duh, right :) I will remove that. > 1) in ReadTokensTask change doLogic() to return the number of tokens > processed in that specific call to doLogic() (differs from tokensCount > which aggregates all calls). Ahh good idea! > 2) in TestPerfTaskLogic the comment in testReadTokens seems > copy/pasted from testLineDocFile and should be changed. Woops, will fix. > - Also (I am not sure if it is worth your time, but) to really test it, > you > could open a reader against the created index and verify the number > of docs, and also the index sum-of-DF comparing to the total tokens > counts numbers in ReadTokensTask. OK I added this too. Will submit new patch shortly. > Add "tokenize documents only" task to contrib/benchmark > --- > > Key: LUCENE-967 > URL: https://issues.apache.org/jira/browse/LUCENE-967 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-967.patch, LUCENE-967.take2.patch > > > I've been looking at performance improvements to tokenization by > re-using Tokens, and to help benchmark my changes I've added a new > task called ReadTokens that just steps through all fields in a > document, gets a TokenStream, and reads all the tokens out of it. > EG this alg just reads all Tokens for all docs in Reuters collection: > doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker > doc.maker.forever=false > {ReadTokens > : * -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-967: -- Attachment: LUCENE-967.take3.patch > Add "tokenize documents only" task to contrib/benchmark > --- > > Key: LUCENE-967 > URL: https://issues.apache.org/jira/browse/LUCENE-967 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, > LUCENE-967.take3.patch > > > I've been looking at performance improvements to tokenization by > re-using Tokens, and to help benchmark my changes I've added a new > task called ReadTokens that just steps through all fields in a > document, gets a TokenStream, and reads all the tokens out of it. > EG this alg just reads all Tokens for all docs in Reuters collection: > doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker > doc.maker.forever=false > {ReadTokens > : * -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article
[ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516996 ] Michael McCandless commented on LUCENE-971: --- This looks great! One alternate approach here would be to create a WikipediaDocMaker (implementing DocMaker interface) that pulls directly from the XML file and feeds documents into the alg. Then, to make a line file, one could create an alg that pulls docs from WikipediaDocMaker and uses WriteLineDoc task to create the line-by-line file. One benefit of this approach is creating docs of a certain size (10 tokens, 100 tokens, etc) would become a one-step process (single alg) instead of what I think is a 2-step process now (make first line file, then reprocess into second line file). Another benefit would be you could make wikipedia tasks that pull directly from the XML file and not even use a line file as an intermediary. Steve do you think this would be a hard change? I think it should be easy, except, I'm not sure how to do this w/ SAX since SAX is "in control". You sort of need coroutines. Or maybe one thread is running SAX and putting doc data into a shared queue, and then the other thread (the normal "main" thread that benchmark runs) would pull from this queue? > Create enwiki indexable data as line-per-article rather than file-per-article > - > > Key: LUCENE-971 > URL: https://issues.apache.org/jira/browse/LUCENE-971 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Steven Parkes > Attachments: LUCENE-971.patch.txt > > > Create a line per article rather than a file. Consume with indexLineFile task. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article
[ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516997 ] Steven Parkes commented on LUCENE-971: -- I can look at what it would take to avoid the line file ... but ... what about the overhead of the XML parser? I don't tend to think of XML parsers as "light". Would bundling that into the test be a concern? I guess it's not an issue if you're just using this to create an index and then are going to do your performance measurements on the queries of the index. But for measuring index performance, I would probably be cautious of bundling in the XML processing (until proven insignificant). > Create enwiki indexable data as line-per-article rather than file-per-article > - > > Key: LUCENE-971 > URL: https://issues.apache.org/jira/browse/LUCENE-971 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Steven Parkes > Attachments: LUCENE-971.patch.txt > > > Create a line per article rather than a file. Consume with indexLineFile task. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517003 ] Michael McCandless commented on LUCENE-966: --- Oddly, the patch for TestStandardAnalyzer failed to apply for me (but the rest did), so I manually merged those changes in. Oh, I see: it was the "Korean words" test -- somehow the characters got mapped to ?'s in your patch. This is why the patch didn't apply, I think? Maybe you used a diffing tool that wasn't happy with unicode or something? I also see the quality test failing in contrib benchmark. I fear something about the new StandardAnalyzer is in fact causing this test to fail (it passes on a clean checkout). That test uses StandardAnalyzer. KO I re-tested the old vs new StandardAnalyzer on Wikipedia and I still found some differences, I think only on these very large URL-like tokens. Here's one: OLD (money.cnn.com,1382,1395,type=) (magazines,1396,1405,type=) (fortune,1406,1413,type=) (fortune,1414,1421,type=) (archive/2007/03/19/8402357,1422,1448,type=) (index.htm,1449,1458,type=) NEW (/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=) I like the NEW behavior better but I fear we should try to match the old one? Here's another one: OLD (mid-20th,2436,2444,type=) NEW (mid,2436,2439,type=) (-20th,2439,2444,type=) I like the old behavior better here. Another one: OLD (safari-0-sheikh,12011,12026,type=) (zayed,12027,12032,type=) (grand,12033,12038,type=) (mosque.jpg,12039,12049,type=) NEW (safari,12011,12017,type=) (0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=) Another one: OLD (semitica-01.png,616,631,type=) NEW (-semitica-01.png,615,631,type=) > A faster JFlex-based replacement for StandardAnalyzer > - > > Key: LUCENE-966 > URL: https://issues.apache.org/jira/browse/LUCENE-966 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Stanislaw Osinski > Fix For: 2.3 > > Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, > jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, > jflex-analyzer-r561693-compatibility.txt > > > JFlex (http://www.jflex.de/) can be used to generate a faster (up to several > times) replacement for StandardAnalyzer. Will add a patch and a simple > benchmark code in a while. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article
[ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517007 ] Michael McCandless commented on LUCENE-971: --- > I can look at what it would take to avoid the line file ... but > ... what about the overhead of the XML parser? I don't tend to think > of XML parsers as "light". Would bundling that into the test be a > concern? Right I too would not consider XML parsing overhead "light". So tests that are sensitive to the XML parsing cost should first create a line file. But, this is the case regardless of which approach we use (ie, both approaches allow you use a line file -- the WriteLineDocTask writes a line file from any DocMaker). It's just that the new approach would buy us more flexibility for those people who don't need (or want) to use the line file as an intermediary. > Create enwiki indexable data as line-per-article rather than file-per-article > - > > Key: LUCENE-971 > URL: https://issues.apache.org/jira/browse/LUCENE-971 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Steven Parkes > Attachments: LUCENE-971.patch.txt > > > Create a line per article rather than a file. Consume with indexLineFile task. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517031 ] Doron Cohen commented on LUCENE-967: Thanks for fixing this Michael, looks perfect to me now. > Add "tokenize documents only" task to contrib/benchmark > --- > > Key: LUCENE-967 > URL: https://issues.apache.org/jira/browse/LUCENE-967 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, > LUCENE-967.take3.patch > > > I've been looking at performance improvements to tokenization by > re-using Tokens, and to help benchmark my changes I've added a new > task called ReadTokens that just steps through all fields in a > document, gets a TokenStream, and reads all the tokens out of it. > EG this alg just reads all Tokens for all docs in Reuters collection: > doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker > doc.maker.forever=false > {ReadTokens > : * -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517035 ] Michael McCandless commented on LUCENE-967: --- Thank you for reviewing! I will commit shortly. > Add "tokenize documents only" task to contrib/benchmark > --- > > Key: LUCENE-967 > URL: https://issues.apache.org/jira/browse/LUCENE-967 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, > LUCENE-967.take3.patch > > > I've been looking at performance improvements to tokenization by > re-using Tokens, and to help benchmark my changes I've added a new > task called ReadTokens that just steps through all fields in a > document, gets a TokenStream, and reads all the tokens out of it. > EG this alg just reads all Tokens for all docs in Reuters collection: > doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker > doc.maker.forever=false > {ReadTokens > : * -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article
[ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517048 ] Doron Cohen commented on LUCENE-971: Mmm... an additional advantage of this is not needing to extract the entire enwiki collection in order to index it - setting the repetition count to 100 for AddDocTask in alternative 1 or for WriteLineDocTask in alternative 2 would mean that only 100 docs from the huge file are extracted. > Create enwiki indexable data as line-per-article rather than file-per-article > - > > Key: LUCENE-971 > URL: https://issues.apache.org/jira/browse/LUCENE-971 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Steven Parkes > Attachments: LUCENE-971.patch.txt > > > Create a line per article rather than a file. Consume with indexLineFile task. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article
[ https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517047 ] Doron Cohen commented on LUCENE-971: > But, this is the case regardless of which approach we use (ie, both > approaches allow you use a line file -- the WriteLineDocTask writes a > line file from any DocMaker). It's just that the new approach would > buy us more flexibility for those people who don't need (or want) to > use the line file as an intermediary. So there would now be two alternative ways to index wiki data: (1) using the proposed WikiDocMaker directly to feed AddDoc task. (2) using line file after first running WriteLineDocTask when the doc maker was WikiDocMaker. I like this approach. This means that WikiDocMaker would read the data straight from temp/enwiki-20070527-pages-articles.xml. So the extract-enwiki target in build.xml would no longer be needed, right? > Create enwiki indexable data as line-per-article rather than file-per-article > - > > Key: LUCENE-971 > URL: https://issues.apache.org/jira/browse/LUCENE-971 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Steven Parkes > Attachments: LUCENE-971.patch.txt > > > Create a line per article rather than a file. Consume with indexLineFile task. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-967. --- Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) > Add "tokenize documents only" task to contrib/benchmark > --- > > Key: LUCENE-967 > URL: https://issues.apache.org/jira/browse/LUCENE-967 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, > LUCENE-967.take3.patch > > > I've been looking at performance improvements to tokenization by > re-using Tokens, and to help benchmark my changes I've added a new > task called ReadTokens that just steps through all fields in a > document, gets a TokenStream, and reads all the tokens out of it. > EG this alg just reads all Tokens for all docs in Reuters collection: > doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker > doc.maker.forever=false > {ReadTokens > : * -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-743) IndexReader.reopen()
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-743: - Attachment: lucene-743.patch Now, after LUCENE-781, LUCENE-970 and LUCENE-832 are committed, I updated the latest patch here, which was now easier because MultiReader is now separated into two classes. Notes: * As Hoss suggested I added the reopen() method to IndexReader non-static. * MultiReader and ParallelReader now overwrite reopen() to reopen the subreaders recursively. * FilteredReader also overwrites reopen(). It checks if the underlying reader has changed, and in that case returns a new instance of FilteredReader. I think the general contract of reopen() should be to always return a new IndexReader instance if it was successfully refreshed and return the same instance otherwise, because IndexReaders are used as keys in caches. A remaining question here is if the old reader(s) should be closed then or not. This patch closes the old readers for now, if we want to change that we probably have to add some reference counting mechanism, as Robert suggested already. Then I would also have to change the SegmentReader.reopen() implementation to clone resources like the dictionary, norms and delete bits. I think closing the old reader is fine. What do others think? Is keeping the old reader after a reopen() a useful usecase? > IndexReader.reopen() > > > Key: LUCENE-743 > URL: https://issues.apache.org/jira/browse/LUCENE-743 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Otis Gospodnetic >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: IndexReaderUtils.java, lucene-743.patch, > lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java > > > This is Robert Engels' implementation of IndexReader.reopen() functionality, > as a set of 3 new classes (this was easier for him to implement, but should > probably be folded into the core, if this looks good). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-743) IndexReader.reopen()
[ https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517087 ] Michael Busch commented on LUCENE-743: -- I ran some quick performance tests with this patch: 1) The test opens an IndexReader, deletes one document by random docid, closes the Reader. So this reader doesn't have to open the dictionary or the norms. 2) Another reader is opened (or alternatively reopened) and one TermQuery is executed, so this reader has to read the norms and the dictionary. I run these two steps 5000 times in a loop. First run: Index size: 4.5M, optimized * 1) + TermQuery:103 sec * 1) + 2) (open):806 sec, so open() takes 703 sec * 1) + 2) (reopen): 118 sec, so reopen() takes 15 sec ==> Speedup: 46.9 X Second run: Index size: 3.3M, 24 segments (14x 230.000, 10x 10.000) * 1) + TermQuery:235 sec * 1) + 2) (open): 1162 sec, so open() takes 927 sec * 1) + 2) (reopen): 321 sec, so reopen() takes 86 sec ==> Speedup: 10.8X > IndexReader.reopen() > > > Key: LUCENE-743 > URL: https://issues.apache.org/jira/browse/LUCENE-743 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Otis Gospodnetic >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: IndexReaderUtils.java, lucene-743.patch, > lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java > > > This is Robert Engels' implementation of IndexReader.reopen() functionality, > as a set of 3 new classes (this was easier for him to implement, but should > probably be folded into the core, if this looks good). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517097 ] Doron Cohen commented on LUCENE-966: The search quality test failure can be caused by the standard analyzer generating different tokens than before. (has nothing to do with token types.) This is because the test's topics (queries) and qrels (expected matches) were created by examining an index that was created using the current standard analyzer. Now, running this test with an analyzer that creates other tokens is likely to fail. It is not difficult to update this test for a modified analyzer, but it seems better to me to preserve the original standard analyzer behavior. > A faster JFlex-based replacement for StandardAnalyzer > - > > Key: LUCENE-966 > URL: https://issues.apache.org/jira/browse/LUCENE-966 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Stanislaw Osinski > Fix For: 2.3 > > Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, > jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, > jflex-analyzer-r561693-compatibility.txt > > > JFlex (http://www.jflex.de/) can be used to generate a faster (up to several > times) replacement for StandardAnalyzer. Will add a patch and a simple > benchmark code in a while. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText
[ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-969: -- Attachment: LUCENE-969.take2.patch Updated patch based on recent commits; fixed up the javadocs and a few other small things. I think this is ready to commit but I'll wait a few days for more comments... > Optimize the core tokenizers/analyzers & deprecate Token.termText > - > > Key: LUCENE-969 > URL: https://issues.apache.org/jira/browse/LUCENE-969 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-969.patch, LUCENE-969.take2.patch > > > There is some "low hanging fruit" for optimizing the core tokenizers > and analyzers: > - Re-use a single Token instance during indexing instead of creating > a new one for every term. To do this, I added a new method "Token > next(Token result)" (Doron's suggestion) which means TokenStream > may use the "Token result" as the returned Token, but is not > required to (ie, can still return an entirely different Token if > that is more convenient). I added default implementations for > both next() methods in TokenStream.java so that a TokenStream can > choose to implement only one of the next() methods. > - Use "char[] termBuffer" in Token instead of the "String > termText". > Token now maintains a char[] termBuffer for holding the term's > text. Tokenizers & filters should retrieve this buffer and > directly alter it to put the term text in or change the term > text. > I only deprecated the termText() method. I still allow the ctors > that pass in String termText, as well as setTermText(String), but > added a NOTE about performance cost of using these methods. I > think it's OK to keep these as convenience methods? > After the next release, when we can remove the deprecated API, we > should clean up Token.java to no longer maintain "either String or > char[]" (and the initTermBuffer() private method) and always use > the char[] termBuffer instead. > - Re-use TokenStream instances across Fields & Documents instead of > creating a new one for each doc. To do this I added an optional > "reusableTokenStream(...)" to Analyzer which just defaults to > calling tokenStream(...), and then I implemented this for the core > analyzers. > I'm using the patch from LUCENE-967 for benchmarking just > tokenization. > The changes above give 21% speedup (742 seconds -> 585 seconds) for > LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing > all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5 > IO system (best of 2 runs). > If I pre-break Wikipedia docs into 100 token docs then it's 37% faster > (1236 sec -> 774 sec), I think because of re-using TokenStreams across > docs. > I'm just running with this alg and recording the elapsed time: > analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer > doc.tokenize.log.step=5 > docs.file=/lucene/wikifull.txt > doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker > doc.tokenized=true > doc.maker.forever=false > {ReadTokens > : * > See this thread for discussion leading up to this: > http://www.gossamer-threads.com/lists/lucene/java-dev/51283 > I also fixed Token.toString() to work correctly when termBuffer is > used (and added unit test). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]