[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

2007-08-01 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516893
 ] 

Stanislaw Osinski commented on LUCENE-966:
--

When digging deeper into the issues of compatibility with the original 
StandardAnalyzer, I stumbled upon something strange. Take the following text:

78academyawards/rules/rule02.html,7194,7227,type

which was tokenized by the original StandardAnalyzer as one . If you look 
at the definition of the  token:

// every other segment must have at least one digit
  
   |   
   |  (   )+
   |  (   )+
   |(   )+
   |(   )+
)

you'll see that, as explained in the comment, every other segment must have at 
least one digit. But actually, according to my understanding, this rule should 
not match the above text as a whole (and with JFlex it doesn't , actually). 
Below is the text split by punctuation characters, and it looks like there is 
no way of splitting this text into alternating segments, every second of which 
must have a digit (A = ALPHANUM, H = HAS_DIGIT):

78academyawards   /   rules   /   rule02   .   html   ,   7194   ,   7227   ,   
type
H  P  A P   H   P A P   
  H  P  A PH?* (starting from the beginning)
  
H?*P A  P  H P A   (starting from the end)

* (would have to be H, but no digits in substring "type" or "html")

I have no idea why JavaCC matched the whole text as a , JFlex behaved 
"more correctly" here. 

Now I can see two solutions:

* try to patch the JFlex grammar to emulate JavaCC quirks (though I may not be 
aware of most of them...)
* relax the  rule a little bit (JFlex notation):

// there must be at least one segment with a digit
NUM = ({P} ({HAS_DIGIT} | {ALPHANUM}))* {HAS_DIGIT} ({P} ({HAS_DIGIT} | 
{ALPHANUM}))*

With this definition, again, all StandardAnalyzer tests pass, plus all texts 
along the lines of:

2006-03-11t082958z_01_ban130523_rtridst_0_ozabs,2076,2123,type
78academyawards/rules/rule02.html,7194,7227,type
978-0-94045043-1,86408,86424,type
62.46,37004,37009,type(this one was parsed as  by the original 
analyzer)

get parsed as a whole as one , which is equivalent to what JavaCC-based 
version would do. I will attach a corresponding patch in a second.



> A faster JFlex-based replacement for StandardAnalyzer
> -
>
> Key: LUCENE-966
> URL: https://issues.apache.org/jira/browse/LUCENE-966
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Stanislaw Osinski
> Fix For: 2.3
>
> Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, 
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



variuos IndexReader methods -- was: Re: [jira] Updated: (LUCENE-832) NPE when calling isCurrent() on a ParallellReader

2007-08-01 Thread Chris Hostetter

is it just me, or does it seem like the base class versions of
getVersion(), isOptimized(), and isCurrent() in IndexReader should all
throw UnsupportedOperationException?

(it seems like ideally they should abstract, but that ship/API has sailed)


: This patch fixes ParallelReader similar to LUCENE-781:
:
:* ParallelReader.getVersion() now throws an
:  UnsupportedOperationException.
:
:* ParallelReader.isOptimized() now checks if all underlying
:  indexes are optimized and returns true in such a case.
:
:* ParallelReader.isCurrent() now checks if all underlying
:  IndexReaders are up to date and returns true in such a case.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

2007-08-01 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated LUCENE-966:
-

Attachment: jflex-analyzer-r561693-compatibility.txt

A patch for better compatibility with the StandardAnalyzer containing:

* relaxed definition of the  token
* new test cases in TestStandardAnalyzer

I noticed that with this patch 
org.apache.lucene.benchmark.quality.TestQualityRun.testTrecQuality fails, but 
I'm not sure if this is related to the tokenizer.

> A faster JFlex-based replacement for StandardAnalyzer
> -
>
> Key: LUCENE-966
> URL: https://issues.apache.org/jira/browse/LUCENE-966
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Stanislaw Osinski
> Fix For: 2.3
>
> Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, 
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, 
> jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed

2007-08-01 Thread Jason van Zyl
To whom it may engage...

This is an automated request, but not an unsolicited one. For 
more information please visit http://gump.apache.org/nagged.html, 
and/or contact the folk at [EMAIL PROTECTED]

Project lucene-java has an issue affecting its community integration.
This issue affects 3 projects,
 and has been outstanding for 30 runs.
The current state of this project is 'Failed', with reason 'Build Failed'.
For reference only, the following projects are affected by this:
- eyebrowse :  Web-based mail archive browsing
- jakarta-lucene :  Java Based Search Engine
- lucene-java :  Java Based Search Engine


Full details are available at:
http://vmgump.apache.org/gump/public/lucene-java/lucene-java/index.html

That said, some information snippets are provided here.

The following annotations (debug/informational/warning/error messages) were 
provided:
 -DEBUG- Sole output [lucene-core-01082007.jar] identifier set to project name
 -DEBUG- Dependency on javacc exists, no need to add for property javacc.home.
 -INFO- Failed with reason build failed
 -INFO- Failed to extract fallback artifacts from Gump Repository



The following work was performed:
http://vmgump.apache.org/gump/public/lucene-java/lucene-java/gump_work/build_lucene-java_lucene-java.html
Work Name: build_lucene-java_lucene-java (Type: Build)
Work ended in a state of : Failed
Elapsed: 34 secs
Command Line: /usr/lib/jvm/java-1.5.0-sun/bin/java -Djava.awt.headless=true 
-Xbootclasspath/p:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis.jar:/srv/gump/public/workspace/xml-xerces2/build/xercesImpl.jar
 org.apache.tools.ant.Main -Dgump.merge=/srv/gump/public/gump/work/merge.xml 
-Dbuild.sysclasspath=only -Dversion=01082007 
-Djavacc.home=/srv/gump/packages/javacc-3.1 package 
[Working Directory: /srv/gump/public/workspace/lucene-java]
CLASSPATH: 
/usr/lib/jvm/java-1.5.0-sun/lib/tools.jar:/srv/gump/public/workspace/lucene-java/build/classes/java:/srv/gump/public/workspace/lucene-java/build/classes/demo:/srv/gump/public/workspace/lucene-java/build/classes/test:/srv/gump/public/workspace/lucene-java/contrib/db/bdb/lib/db-4.3.29.jar:/srv/gump/public/workspace/lucene-java/contrib/gdata-server/lib/gdata-client-1.0.jar:/srv/gump/public/workspace/lucene-java/build/contrib/analyzers/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/ant/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/benchmark/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb-je/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/gdata-server/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/highlighter/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/javascript/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/lucli/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/memory/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/queries/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/regex/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/similarity/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/snowball/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/spellchecker/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/surround/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/swing/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/wordnet/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/xml-query-parser/classes/java:/srv/gump/public/workspace/ant/dist/lib/ant-jmf.jar:/srv/gump/public/workspace/ant/dist/lib/ant-swing.jar:/srv/gump/public/workspace/ant/dist/lib/ant-apache-resolver.jar:/srv/gump/public/workspace/ant/dist/lib/ant-trax.jar:/srv/gump/public/workspace/ant/dist/lib/ant-junit.jar:/srv/gump/public/workspace/ant/dist/lib/ant-launcher.jar:/srv/gump/public/workspace/ant/dist/lib/ant-nodeps.jar:/srv/gump/public/workspace/ant/dist/lib/ant.jar:/srv/gump/packages/junit3.8.1/junit.jar:/srv/gump/public/workspace/xml-commons/java/build/resolver.jar:/srv/gump/packages/je-1.7.1/lib/je.jar:/srv/gump/public/workspace/apache-commons/digester/dist/commons-digester.jar:/srv/gump/public/workspace/jakarta-regexp/build/jakarta-regexp-01082007.jar:/srv/gump/packages/javacc-3.1/bin/lib/javacc.jar:/srv/gump/public/workspace/jline/target/jline-0.9.92-SNAPSHOT.jar:/srv/gump/packages/jtidy-04aug2000r7-dev/build/Tidy.jar:/srv/gump/public/workspace/junit/dist/junit-01082007.jar:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis-ext.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-01082007.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-api-01082007.jar:/srv/gump/public/workspace/jakarta-servletapi-5/jsr154/dist/lib/servlet-api.jar:/srv/gump/packages/nekoh

[EMAIL PROTECTED]: Project lucene-java (in module lucene-java) failed

2007-08-01 Thread Jason van Zyl
To whom it may engage...

This is an automated request, but not an unsolicited one. For 
more information please visit http://gump.apache.org/nagged.html, 
and/or contact the folk at [EMAIL PROTECTED]

Project lucene-java has an issue affecting its community integration.
This issue affects 3 projects,
 and has been outstanding for 30 runs.
The current state of this project is 'Failed', with reason 'Build Failed'.
For reference only, the following projects are affected by this:
- eyebrowse :  Web-based mail archive browsing
- jakarta-lucene :  Java Based Search Engine
- lucene-java :  Java Based Search Engine


Full details are available at:
http://vmgump.apache.org/gump/public/lucene-java/lucene-java/index.html

That said, some information snippets are provided here.

The following annotations (debug/informational/warning/error messages) were 
provided:
 -DEBUG- Sole output [lucene-core-01082007.jar] identifier set to project name
 -DEBUG- Dependency on javacc exists, no need to add for property javacc.home.
 -INFO- Failed with reason build failed
 -INFO- Failed to extract fallback artifacts from Gump Repository



The following work was performed:
http://vmgump.apache.org/gump/public/lucene-java/lucene-java/gump_work/build_lucene-java_lucene-java.html
Work Name: build_lucene-java_lucene-java (Type: Build)
Work ended in a state of : Failed
Elapsed: 34 secs
Command Line: /usr/lib/jvm/java-1.5.0-sun/bin/java -Djava.awt.headless=true 
-Xbootclasspath/p:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis.jar:/srv/gump/public/workspace/xml-xerces2/build/xercesImpl.jar
 org.apache.tools.ant.Main -Dgump.merge=/srv/gump/public/gump/work/merge.xml 
-Dbuild.sysclasspath=only -Dversion=01082007 
-Djavacc.home=/srv/gump/packages/javacc-3.1 package 
[Working Directory: /srv/gump/public/workspace/lucene-java]
CLASSPATH: 
/usr/lib/jvm/java-1.5.0-sun/lib/tools.jar:/srv/gump/public/workspace/lucene-java/build/classes/java:/srv/gump/public/workspace/lucene-java/build/classes/demo:/srv/gump/public/workspace/lucene-java/build/classes/test:/srv/gump/public/workspace/lucene-java/contrib/db/bdb/lib/db-4.3.29.jar:/srv/gump/public/workspace/lucene-java/contrib/gdata-server/lib/gdata-client-1.0.jar:/srv/gump/public/workspace/lucene-java/build/contrib/analyzers/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/ant/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/benchmark/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/db/bdb-je/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/gdata-server/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/highlighter/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/javascript/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/lucli/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/memory/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/queries/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/regex/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/similarity/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/snowball/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/spellchecker/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/surround/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/swing/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/wordnet/classes/java:/srv/gump/public/workspace/lucene-java/build/contrib/xml-query-parser/classes/java:/srv/gump/public/workspace/ant/dist/lib/ant-jmf.jar:/srv/gump/public/workspace/ant/dist/lib/ant-swing.jar:/srv/gump/public/workspace/ant/dist/lib/ant-apache-resolver.jar:/srv/gump/public/workspace/ant/dist/lib/ant-trax.jar:/srv/gump/public/workspace/ant/dist/lib/ant-junit.jar:/srv/gump/public/workspace/ant/dist/lib/ant-launcher.jar:/srv/gump/public/workspace/ant/dist/lib/ant-nodeps.jar:/srv/gump/public/workspace/ant/dist/lib/ant.jar:/srv/gump/packages/junit3.8.1/junit.jar:/srv/gump/public/workspace/xml-commons/java/build/resolver.jar:/srv/gump/packages/je-1.7.1/lib/je.jar:/srv/gump/public/workspace/apache-commons/digester/dist/commons-digester.jar:/srv/gump/public/workspace/jakarta-regexp/build/jakarta-regexp-01082007.jar:/srv/gump/packages/javacc-3.1/bin/lib/javacc.jar:/srv/gump/public/workspace/jline/target/jline-0.9.92-SNAPSHOT.jar:/srv/gump/packages/jtidy-04aug2000r7-dev/build/Tidy.jar:/srv/gump/public/workspace/junit/dist/junit-01082007.jar:/srv/gump/public/workspace/xml-commons/java/external/build/xml-apis-ext.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-01082007.jar:/srv/gump/public/workspace/apache-commons/logging/target/commons-logging-api-01082007.jar:/srv/gump/public/workspace/jakarta-servletapi-5/jsr154/dist/lib/servlet-api.jar:/srv/gump/packages/nekoh

[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

2007-08-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516945
 ] 

Michael McCandless commented on LUCENE-967:
---

> Also, I think the addition of printing of elapsed time is redundant, 
> because you get it anyhow as the elapsed time reported for the 
> outermost task sequence. (?)

Duh, right :)  I will remove that.

>  1) in ReadTokensTask change doLogic() to return the number of tokens 
>   processed in that specific call to doLogic() (differs from tokensCount 
>   which aggregates all calls).

Ahh good idea!

>  2) in TestPerfTaskLogic the comment in testReadTokens seems 
>  copy/pasted from testLineDocFile and should be changed. 

Woops, will fix.

>  - Also (I am not sure if it is worth your time, but) to really test it, 
> you 
>  could open a reader against the created index and verify the number 
>  of docs, and also the index sum-of-DF comparing to the total tokens 
>  counts numbers in ReadTokensTask. 

OK I added this too.  Will submit new patch shortly.

> Add "tokenize documents only" task to contrib/benchmark
> ---
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

2007-08-01 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-967:
--

Attachment: LUCENE-967.take3.patch

> Add "tokenize documents only" task to contrib/benchmark
> ---
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, 
> LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

2007-08-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516996
 ] 

Michael McCandless commented on LUCENE-971:
---

This looks great!

One alternate approach here would be to create a WikipediaDocMaker
(implementing DocMaker interface) that pulls directly from the XML
file and feeds documents into the alg.

Then, to make a line file, one could create an alg that pulls docs
from WikipediaDocMaker and uses WriteLineDoc task to create the
line-by-line file.

One benefit of this approach is creating docs of a certain size (10
tokens, 100 tokens, etc) would become a one-step process (single alg)
instead of what I think is a 2-step process now (make first line file,
then reprocess into second line file).  Another benefit would be you
could make wikipedia tasks that pull directly from the XML file and
not even use a line file as an intermediary.

Steve do you think this would be a hard change?  I think it should be
easy, except, I'm not sure how to do this w/ SAX since SAX is "in
control".  You sort of need coroutines.  Or maybe one thread is
running SAX and putting doc data into a shared queue, and then the other
thread (the normal "main" thread that benchmark runs) would pull from
this queue?


> Create enwiki indexable data as line-per-article rather than file-per-article
> -
>
> Key: LUCENE-971
> URL: https://issues.apache.org/jira/browse/LUCENE-971
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

2007-08-01 Thread Steven Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516997
 ] 

Steven Parkes commented on LUCENE-971:
--

I can look at what it would take to avoid the line file ... but ... what about 
the overhead of the XML parser? I don't tend to think of XML parsers as 
"light". Would bundling that into the test be a concern?

I guess it's not an issue if you're just using this to create an index and then 
are going to do your performance measurements on the queries of the index. But 
for measuring index performance, I would probably be cautious of bundling in 
the XML processing (until proven insignificant).

> Create enwiki indexable data as line-per-article rather than file-per-article
> -
>
> Key: LUCENE-971
> URL: https://issues.apache.org/jira/browse/LUCENE-971
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

2007-08-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517003
 ] 

Michael McCandless commented on LUCENE-966:
---

Oddly, the patch for TestStandardAnalyzer failed to apply for me (but
the rest did), so I manually merged those changes in.  Oh, I see: it
was the "Korean words" test -- somehow the characters got mapped to
?'s in your patch.  This is why the patch didn't apply, I think?
Maybe you used a diffing tool that wasn't happy with unicode or
something?

I also see the quality test failing in contrib benchmark.  I fear
something about the new StandardAnalyzer is in fact causing this test
to fail (it passes on a clean checkout).  That test uses
StandardAnalyzer.

KO I re-tested the old vs new StandardAnalyzer on Wikipedia and I
still found some differences, I think only on these very large
URL-like tokens.  Here's one:

  OLD
(money.cnn.com,1382,1395,type=)
(magazines,1396,1405,type=)
(fortune,1406,1413,type=)
(fortune,1414,1421,type=)
(archive/2007/03/19/8402357,1422,1448,type=)
(index.htm,1449,1458,type=)

  NEW

(/money.cnn.com/magazines/fortune/fortune_archive/2007/03/19/8402357/index.htm,1381,1458,type=)

I like the NEW behavior better but I fear we should try to match the
old one?


Here's another one:

  OLD
(mid-20th,2436,2444,type=)

  NEW
(mid,2436,2439,type=)
(-20th,2439,2444,type=)

I like the old behavior better here.

Another one:

  OLD
(safari-0-sheikh,12011,12026,type=)
(zayed,12027,12032,type=)
(grand,12033,12038,type=)
(mosque.jpg,12039,12049,type=)

  NEW
(safari,12011,12017,type=)
(0-sheikh-zayed-grand-mosque.jpg,12018,12049,type=)

Another one:

  OLD
(semitica-01.png,616,631,type=)

  NEW
(-semitica-01.png,615,631,type=)



> A faster JFlex-based replacement for StandardAnalyzer
> -
>
> Key: LUCENE-966
> URL: https://issues.apache.org/jira/browse/LUCENE-966
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Stanislaw Osinski
> Fix For: 2.3
>
> Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, 
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, 
> jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

2007-08-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517007
 ] 

Michael McCandless commented on LUCENE-971:
---


> I can look at what it would take to avoid the line file ... but
> ... what about the overhead of the XML parser? I don't tend to think
> of XML parsers as "light". Would bundling that into the test be a
> concern?

Right I too would not consider XML parsing overhead "light".  So tests
that are sensitive to the XML parsing cost should first create a line
file.

But, this is the case regardless of which approach we use (ie, both
approaches allow you use a line file -- the WriteLineDocTask writes a
line file from any DocMaker).  It's just that the new approach would
buy us more flexibility for those people who don't need (or want) to
use the line file as an intermediary.


> Create enwiki indexable data as line-per-article rather than file-per-article
> -
>
> Key: LUCENE-971
> URL: https://issues.apache.org/jira/browse/LUCENE-971
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

2007-08-01 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517031
 ] 

Doron Cohen commented on LUCENE-967:


Thanks for fixing this Michael, looks perfect to me now.

> Add "tokenize documents only" task to contrib/benchmark
> ---
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, 
> LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

2007-08-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517035
 ] 

Michael McCandless commented on LUCENE-967:
---

Thank you for reviewing!  I will commit shortly.

> Add "tokenize documents only" task to contrib/benchmark
> ---
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, 
> LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

2007-08-01 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517048
 ] 

Doron Cohen commented on LUCENE-971:


Mmm... an additional advantage of this is not needing to extract 
the entire enwiki collection in order to index it - setting the 
repetition count to 100 for AddDocTask in alternative 1 or for 
WriteLineDocTask in alternative 2 would  mean that only 100 
docs from the huge file are extracted.

> Create enwiki indexable data as line-per-article rather than file-per-article
> -
>
> Key: LUCENE-971
> URL: https://issues.apache.org/jira/browse/LUCENE-971
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

2007-08-01 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517047
 ] 

Doron Cohen commented on LUCENE-971:


> But, this is the case regardless of which approach we use (ie, both
> approaches allow you use a line file -- the WriteLineDocTask writes a
> line file from any DocMaker).  It's just that the new approach would
> buy us more flexibility for those people who don't need (or want) to
> use the line file as an intermediary.

So there would now be two alternative ways to index wiki data:
(1) using the proposed WikiDocMaker directly to feed AddDoc task.
(2) using line file after first running WriteLineDocTask when the 
doc maker was WikiDocMaker.

I like this approach.

This means that WikiDocMaker would read the data straight from 
temp/enwiki-20070527-pages-articles.xml. So the extract-enwiki 
target in build.xml would no longer be needed, right?



> Create enwiki indexable data as line-per-article rather than file-per-article
> -
>
> Key: LUCENE-971
> URL: https://issues.apache.org/jira/browse/LUCENE-971
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
> Attachments: LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-967) Add "tokenize documents only" task to contrib/benchmark

2007-08-01 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-967.
---

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> Add "tokenize documents only" task to contrib/benchmark
> ---
>
> Key: LUCENE-967
> URL: https://issues.apache.org/jira/browse/LUCENE-967
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-967.patch, LUCENE-967.take2.patch, 
> LUCENE-967.take3.patch
>
>
> I've been looking at performance improvements to tokenization by
> re-using Tokens, and to help benchmark my changes I've added a new
> task called ReadTokens that just steps through all fields in a
> document, gets a TokenStream, and reads all the tokens out of it.
> EG this alg just reads all Tokens for all docs in Reuters collection:
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
>   doc.maker.forever=false
>   {ReadTokens > : *

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-743) IndexReader.reopen()

2007-08-01 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-743:
-

Attachment: lucene-743.patch

Now, after LUCENE-781, LUCENE-970 and LUCENE-832 are committed, I updated the 
latest
patch here, which was now easier because MultiReader is now separated into two 
classes.

Notes:
   * As Hoss suggested I added the reopen() method to IndexReader non-static. 
   * MultiReader and ParallelReader now overwrite reopen() to reopen the 
subreaders
 recursively.
   * FilteredReader also overwrites reopen(). It checks if the underlying 
reader has
 changed, and in that case returns a new instance of FilteredReader.
 
I think the general contract of reopen() should be to always return a new 
IndexReader 
instance if it was successfully refreshed and return the same instance 
otherwise, 
because IndexReaders are used as keys in caches.
A remaining question here is if the old reader(s) should be closed then or not.
This patch closes the old readers for now, if we want to change that we 
probably have 
to add some reference counting mechanism, as Robert suggested already. Then I 
would
also have to change the SegmentReader.reopen() implementation to clone 
resources like
the dictionary, norms and delete bits. 
I think closing the old reader is fine. What do others think? Is keeping the 
old 
reader after a reopen() a useful usecase?

> IndexReader.reopen()
> 
>
> Key: LUCENE-743
> URL: https://issues.apache.org/jira/browse/LUCENE-743
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Otis Gospodnetic
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: IndexReaderUtils.java, lucene-743.patch, 
> lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java
>
>
> This is Robert Engels' implementation of IndexReader.reopen() functionality, 
> as a set of 3 new classes (this was easier for him to implement, but should 
> probably be folded into the core, if this looks good).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-743) IndexReader.reopen()

2007-08-01 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517087
 ] 

Michael Busch commented on LUCENE-743:
--

I ran some quick performance tests with this patch:

1) The test opens an IndexReader, deletes one document by random docid, closes 
the Reader.
   So this reader doesn't have to open the dictionary or the norms.
2) Another reader is opened (or alternatively reopened) and one TermQuery is 
executed, so 
   this reader has to read the norms and the dictionary. 

I run these two steps 5000 times in a loop.
   
First run: Index size: 4.5M, optimized 
   
   * 1) + TermQuery:103 sec
   * 1) + 2) (open):806 sec, so open()   takes 703 sec
   * 1) + 2) (reopen):  118 sec, so reopen() takes  15 sec ==> Speedup: 46.9 X
   

Second run: Index size: 3.3M, 24 segments (14x 230.000, 10x 10.000)

   * 1) + TermQuery:235 sec
   * 1) + 2) (open):   1162 sec, so open()   takes 927 sec
   * 1) + 2) (reopen):  321 sec, so reopen() takes  86 sec ==> Speedup: 10.8X

> IndexReader.reopen()
> 
>
> Key: LUCENE-743
> URL: https://issues.apache.org/jira/browse/LUCENE-743
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Otis Gospodnetic
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: IndexReaderUtils.java, lucene-743.patch, 
> lucene-743.patch, lucene-743.patch, MyMultiReader.java, MySegmentReader.java
>
>
> This is Robert Engels' implementation of IndexReader.reopen() functionality, 
> as a set of 3 new classes (this was easier for him to implement, but should 
> probably be folded into the core, if this looks good).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

2007-08-01 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517097
 ] 

Doron Cohen commented on LUCENE-966:


The search quality test failure can be caused by the standard 
analyzer generating different tokens than before. (has nothing 
to do with token types.)

This is because the test's topics (queries) and qrels (expected matches) 
were created by examining an index that was created using the current 
standard analyzer. Now, running this test with an analyzer that creates 
other tokens is likely to fail. 

It is not difficult to update this test for a modified analyzer, but it seems 
better to me to preserve the original standard analyzer behavior.


> A faster JFlex-based replacement for StandardAnalyzer
> -
>
> Key: LUCENE-966
> URL: https://issues.apache.org/jira/browse/LUCENE-966
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Stanislaw Osinski
> Fix For: 2.3
>
> Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, 
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, 
> jflex-analyzer-r561693-compatibility.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers & deprecate Token.termText

2007-08-01 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-969:
--

Attachment: LUCENE-969.take2.patch

Updated patch based on recent commits; fixed up the javadocs and a few
other small things.  I think this is ready to commit but I'll wait a
few days for more comments...


> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -
>
> Key: LUCENE-969
> URL: https://issues.apache.org/jira/browse/LUCENE-969
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-969.patch, LUCENE-969.take2.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
>   - Re-use a single Token instance during indexing instead of creating
> a new one for every term.  To do this, I added a new method "Token
> next(Token result)" (Doron's suggestion) which means TokenStream
> may use the "Token result" as the returned Token, but is not
> required to (ie, can still return an entirely different Token if
> that is more convenient).  I added default implementations for
> both next() methods in TokenStream.java so that a TokenStream can
> choose to implement only one of the next() methods.
>   - Use "char[] termBuffer" in Token instead of the "String
> termText".
> Token now maintains a char[] termBuffer for holding the term's
> text.  Tokenizers & filters should retrieve this buffer and
> directly alter it to put the term text in or change the term
> text.
> I only deprecated the termText() method.  I still allow the ctors
> that pass in String termText, as well as setTermText(String), but
> added a NOTE about performance cost of using these methods.  I
> think it's OK to keep these as convenience methods?
> After the next release, when we can remove the deprecated API, we
> should clean up Token.java to no longer maintain "either String or
> char[]" (and the initTermBuffer() private method) and always use
> the char[] termBuffer instead.
>   - Re-use TokenStream instances across Fields & Documents instead of
> creating a new one for each doc.  To do this I added an optional
> "reusableTokenStream(...)" to Analyzer which just defaults to
> calling tokenStream(...), and then I implemented this for the core
> analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
>   analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
>   doc.tokenize.log.step=5
>   docs.file=/lucene/wikifull.txt
>   doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>   doc.tokenized=true
>   doc.maker.forever=false
>   {ReadTokens > : *
> See this thread for discussion leading up to this:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]