[jira] Commented: (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances

2010-04-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857904#action_12857904
 ] 

Mark Miller commented on LUCENE-2287:
-

Hey Michael - there is a lot of reformatting it looks like in this patch - if 
its not that much of a hassle, is it possible to get a patch without the 
formats?

 Unexpected terms are highlighted within nested SpanQuery instances
 --

 Key: LUCENE-2287
 URL: https://issues.apache.org/jira/browse/LUCENE-2287
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Affects Versions: 2.9.1
 Environment: Linux, Solaris, Windows
Reporter: Michael Goddard
Priority: Minor
 Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, 
 LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 I haven't yet been able to resolve why I'm seeing spurious highlighting in 
 nested SpanQuery instances.  Briefly, the issue is illustrated by the second 
 instance of Lucene being highlighted in the test below, when it doesn't 
 satisfy the inner span.  There's been some discussion about this on the 
 java-dev list, and I'm opening this issue now because I have made some 
 initial progress on this.
 This new test, added to the  HighlighterTest class in lucene_2_9_1, 
 illustrates this:
 /*
  * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
  */
 public void testHighlightingNestedSpans2() throws Exception {
   String theText = The Lucene was made by Doug Cutting and Lucene great 
 Hadoop was; // Problem
   //String theText = The Lucene was made by Doug Cutting and the great 
 Hadoop was; // Works okay
   String fieldName = SOME_FIELD_NAME;
   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
 new SpanTermQuery(new Term(fieldName, lucene)),
 new SpanTermQuery(new Term(fieldName, doug)) }, 5, true);
   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
 new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true);
   String expected = The BLucene/B was made by BDoug/B Cutting and 
 Lucene great BHadoop/B was;
   //String expected = The BLucene/B was made by BDoug/B Cutting and 
 the great BHadoop/B was;
   String observed = highlightField(query, fieldName, theText);
   System.out.println(Expected: \ + expected + \n + Observed: \ + 
 observed);
   assertEquals(Why is that second instance of the term \Lucene\ 
 highlighted?, expected, observed);
 }
 Is this an issue that's arisen before?  I've been reading through the source 
 to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and 
 NearSpansOrdered, but haven't found the solution yet.  Initially, I thought 
 that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should 
 be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't 
 get me too far.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856916#action_12856916
 ] 

Mark Miller commented on LUCENE-2159:
-

There is an excellent section on it in LIA2 :)

 Tool to expand the index for perf/stress testing.
 -

 Key: LUCENE-2159
 URL: https://issues.apache.org/jira/browse/LUCENE-2159
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0
Reporter: John Wang
 Attachments: ExpandIndex.java


 Sometimes it is useful to take a small-ish index and expand it into a large 
 index with K segments for perf/stress testing. 
 This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857155#action_12857155
 ] 

Mark Miller commented on LUCENE-2393:
-

Perhaps this should be combined with high freq terms tool ... could make a ton 
of this little guys, so prob best to consolidate them.

 Utility to output total term frequency and df from a lucene index
 -

 Key: LUCENE-2393
 URL: https://issues.apache.org/jira/browse/LUCENE-2393
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Tom Burton-West
Priority: Trivial
 Attachments: LUCENE-2393.patch


 This is a command line utility that takes a field name, term, and index 
 directory and outputs the document frequency for the term and the total 
 number of occurrences of the term in the index (i.e. the sum of the tf of the 
 term for each document).  It is useful for estimating the size of the term's 
 entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-11 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855732#action_12855732
 ] 

Mark Miller commented on LUCENE-2386:
-

Is this change worth it with all of its repercussions? What are the upsides? 
There do appear to be downsides...

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-11 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855740#action_12855740
 ] 

Mark Miller commented on LUCENE-2386:
-

{quote}I do think this is a good change - IW was previously inconsistent, first 
that it would even make a commit when we no longer have an autoCommit=true, 
and, second, that it would not make the commit for a directory that already had 
an index (we fixed this case a while back). So I like that this fix makes IW's 
init behavior more consistent / simpler.{quote}

Thats not a very strong argument for a back compat break on a minor release 
though...

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory

2010-04-11 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855748#action_12855748
 ] 

Mark Miller commented on LUCENE-2386:
-

bq. Hmmm... I think the back compat break is very minor

Yes - it is - but so was the argument for it IMO.

Your extended argument is more compelling though.

 IndexWriter commits unnecessarily on fresh Directory
 

 Key: LUCENE-2386
 URL: https://issues.apache.org/jira/browse/LUCENE-2386
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
 Fix For: 3.1

 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, 
 LUCENE-2386.patch


 I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh 
 Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems 
 unnecessarily, and kind of brings back an autoCommit mode, in a strange way 
 ... why do we need that commit? Do we really expect people to open an 
 IndexReader on an empty Directory which they just passed to an IW w/ 
 create=true? If they want, they can simply call commit() right away on the IW 
 they created.
 I ran into this when writing a test which committed N times, then compared 
 the number of commits (via IndexReader.listCommits) and was surprised to see 
 N+1 commits.
 Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter 
 jumping on me .. so the change might not be that simple. But I think it's 
 manageable, so I'll try to attack it (and IFD specifically !) back :).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2391) Spellchecker uses default IW mergefactor/ramMB settings of 300/10

2010-04-11 Thread Mark Miller (JIRA)
Spellchecker uses default IW mergefactor/ramMB settings of 300/10
-

 Key: LUCENE-2391
 URL: https://issues.apache.org/jira/browse/LUCENE-2391
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spellchecker
Reporter: Mark Miller
Priority: Trivial


These settings seem odd - I'd like to investigate what makes most sense here.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute

2010-04-09 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855489#action_12855489
 ] 

Mark Miller commented on LUCENE-2372:
-

bq.If I make it final and

+1 - lets just remember to add these breaks to the CHANGES BW break section...

 Replace deprecated TermAttribute by new CharTermAttribute
 -

 Key: LUCENE-2372
 URL: https://issues.apache.org/jira/browse/LUCENE-2372
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-2372.patch, LUCENE-2372.patch, LUCENE-2372.patch


 After LUCENE-2302 is merged to trunk with flex, we need to carry over all 
 tokenizers and consumers of the TokenStreams to the new CharTermAttribute.
 We should also think about adding a AttributeFactory that creates a subclass 
 of CharTermAttributeImpl that returns collation keys in toBytesRef() 
 accessor. CollationKeyFilter is then obsolete, instead you can simply convert 
 every TokenStream to indexing only CollationKeys by changing the attribute 
 implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2010-04-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854899#action_12854899
 ] 

Mark Miller commented on LUCENE-2074:
-

{quote}Uwe, must this be coupled with that issue? This one waits for a long 
time (why? for JFlex 1.5 release?) and protecting against a huge buffer 
allocation can be a real quick and tiny fix. And this one also focuses on 
getting Unicode 5 to work, which is unrelated to the buffer size. But the 
buffer size is not a critical issue either that we need to move fast with it 
... so it's your call. Just thought they are two unrelated problems.{quote}

Agreed. Whether its fixed as part of this commit or not, it really deserves its 
own issue anyway, for changes and tracking. It has nothing to do with this 
issue other than convenience. 

 Use a separate JFlex generated Unicode 4 by Java 5 compatible 
 StandardTokenizer
 ---

 Key: LUCENE-2074
 URL: https://issues.apache.org/jira/browse/LUCENE-2074
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 3.0
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1

 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
 LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
 LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch


 The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
 regenerate the file.
 After regeneration the Tokenizer behaves different for some characters. 
 Because of that we should only use the new TokenizerImpl when 
 Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1895) Point2D defines equals by comparing double types with ==

2010-04-05 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853402#action_12853402
 ] 

Mark Miller commented on LUCENE-1895:
-

I put this up not knowing really anything about the specific use case(s) of the 
Point2D class - I have never used Spatial - so close if it makes sense to do so.

My generic worry is that you can come to the *same* double value in two 
different ways, but == will not find them to be equal.

 Point2D defines equals by comparing double types with ==
 

 Key: LUCENE-1895
 URL: https://issues.apache.org/jira/browse/LUCENE-1895
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/spatial
Reporter: Mark Miller
Assignee: Chris Male
Priority: Trivial

 Ideally, this should allow for a margin of error right?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1709) Parallelize Tests

2010-03-23 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848712#action_12848712
 ] 

Mark Miller commented on LUCENE-1709:
-

+1 on removing those flags - personally I find them unnecessary - and they 
complicate the build.

And I would love to Lucene parallel like Solr now.

 Parallelize Tests
 -

 Key: LUCENE-1709
 URL: https://issues.apache.org/jira/browse/LUCENE-1709
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
 Fix For: 3.1

 Attachments: LUCENE-1709.patch, runLuceneTests.py

   Original Estimate: 48h
  Remaining Estimate: 48h

 The Lucene tests can be parallelized to make for a faster testing system.  
 This task from ANT can be used: 
 http://ant.apache.org/manual/CoreTasks/parallel.html
 Previous discussion: 
 http://www.gossamer-threads.com/lists/lucene/java-dev/69669
 Notes from Mike M.:
 {quote}
 I'd love to see a clean solution here (the tests are embarrassingly
 parallelizable, and we all have machines with good concurrency these
 days)... I have a rather hacked up solution now, that uses
 -Dtestpackage=XXX to split the tests up.
 Ideally I would be able to say use N threads and it'd do the right
 thing... like the -j flag to make.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1814) Some Lucene tests try and use a Junit Assert in new threads

2010-03-21 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847931#action_12847931
 ] 

Mark Miller commented on LUCENE-1814:
-

Chris Male mentioned to me that he thinks Uwe has fixed this?

 Some Lucene tests try and use a Junit Assert in new threads
 ---

 Key: LUCENE-1814
 URL: https://issues.apache.org/jira/browse/LUCENE-1814
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Mark Miller
Priority: Minor

 There are a few cases in Lucene tests where JUnit Asserts are used inside a 
 new threads run method - this won't work because Junit throws an exception 
 when a call to Assert fails - that will kill the thread, but the exception 
 will not propagate to JUnit - so unless a failure is caused later from the 
 thread termination, the Asserts are invalid.
 TestThreadSafe
 TestStressIndexing2
 TestStringIntern

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2305) Introduce Version in more places long before 4.0

2010-03-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846542#action_12846542
 ] 

Mark Miller commented on LUCENE-2305:
-

Ah, yes - I didnt remember your comment right:

{quote}
We could make the change under Version?  (Change to true, starting in 3.1).

Or maybe not make the change.  If set to true, we use pct deletion on
a segment to reduce its perceived size when selecting merges, which
generally causes segments with pending deletions to be merged away
sooner
{quote}

Sounds like a good move.

 Introduce Version in more places long before 4.0
 

 Key: LUCENE-2305
 URL: https://issues.apache.org/jira/browse/LUCENE-2305
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Shai Erera
 Fix For: 3.1


 We need to introduce Version in as many places as we can (wherever it makes 
 sense of course), and preferably long before 4.0 (or shall I say 3.9?) is 
 out. That way, we can have a bunch of deprecated API now, that will be gone 
 in 4.0, rather than doing it one class at a time and never finish :).
 The purpose is to introduce Version wherever it is mandatory now, and also in 
 places where we think it might be useful in the future (like most of our 
 Analyzers, configured classes and configuration classes).
 I marked this issue for 3.1, though I don't expect it to end in 3.1. I still 
 think it will be done one step at a time, perhaps for cluster of classes 
 together. But on the other hand I don't want to mark it for 4.0.0 because 
 that needs to be resolved much sooner. So if I had a 3.9 version defined, I'd 
 mark it for 3.9. We can do several commits in one issue right? So this one 
 can live for a while in JIRA, while we gradually convert more and more 
 classes.
 The first candidate is InstantiatedIndexWriter which probably should take an 
 IndexWriterConfig. While I converted the code to use IWC, I've noticed 
 Instantiated defaults its maxFieldLength to the current default (10,000) 
 which is deprecated. I couldn't change it for back-compat reasons. But we can 
 upgrade it to accept IWC, and set to unlimited if the version is onOrAfter 
 3.1, otherwise stay w/ the deprecated default.
 if it's acceptable to have several commits in one issue, I can start w/ 
 Instantiated, post a patch and then we can continue to more classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig

2010-03-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846622#action_12846622
 ] 

Mark Miller commented on LUCENE-2320:
-

+1 - I've had to do this in the past too. Just dropping tests doesn't seem like 
the way to go in many cases.

 Add MergePolicy to IndexWriterConfig
 

 Key: LUCENE-2320
 URL: https://issues.apache.org/jira/browse/LUCENE-2320
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, 
 LUCENE-2320.patch


 Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as 
 well. The change is not straightforward and so I've kept it for a separate 
 issue. MergePolicy requires in its ctor an IndexWriter, however none can be 
 passed to it before an IndexWriter actually exists. And today IW may create 
 an MP just for it to be overridden by the application one line afterwards. I 
 don't want to make iw member of MP non-final, or settable by extending 
 classes, however it needs to remain protected so they can access it directly. 
 So the proposed changes are:
 * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set 
 once (hence its name). It'll have the signature SetOnceT w/ *synchronized 
 setT* and *T get()*. T will be declared volatile, so that get() won't be 
 synchronized.
 * MP will define a *protected final SetOnceIndexWriter writer* instead of 
 the current writer. *NOTE: this is a bw break*. any suggestions are welcomed.
 * MP will offer a public default ctor, together with a set(IndexWriter).
 * IndexWriter will set itself on MP using set(this). Note that if set will be 
 called more than once, it will throw an exception (AlreadySetException - or 
 does someone have a better suggestion, preferably an already existing Java 
 exception?).
 That's the core idea. I'd like to post a patch soon, so I'd appreciate your 
 review and proposals.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846711#action_12846711
 ] 

Mark Miller commented on LUCENE-2323:
-

This reorg is a great a great step for contrib IMO!

+1

 reorganize contrib modules
 --

 Key: LUCENE-2323
 URL: https://issues.apache.org/jira/browse/LUCENE-2323
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Robert Muir

 it would be nice to reorganize contrib modules, so that they are bundled 
 together by functionality.
 For example:
 * the wikipedia contrib is a tokenizer, i think really belongs in 
 contrib/analyzers
 * there are two highlighters, i think could be one highlighters package.
 * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844516#action_12844516
 ] 

Mark Miller commented on LUCENE-2309:
-

bq.  Also IRC is not logged/archived and searchable (I think?) which makes it 
impossible to trace back a discussion, and/or randomly stumble upon it in 
Google.

Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a 
great way for people to communicate and hash stuff out, but its not necessary 
you follow it. If you have questions or want further elaboration, just ask. No 
one can expect you to follow IRC, nor is it a valid reference for where 
something was decided. IRC is great - I think its really benefited having devs 
discuss there - but the official position is, if it didn't happen on the list, 
it didnt actually happen.

 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843717#action_12843717
 ] 

Mark Miller commented on LUCENE-2294:
-

bq. If we say Analyzer is mandatory, what will stop us tomorrow from saying 
IndexDeletionPolicy is mandatory?

Nothing ;) But I think Analyzer should be mandatory and that 
IndexDeletionPolicy should not be mandatory, looking at them case by case.

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843729#action_12843729
 ] 

Mark Miller commented on LUCENE-2294:
-

bq. Question - does SOLR requires everyone to specify an Analyzer, or does it 
come w/ a default one?

Hmm... SOLR doesn't really use Lucene analyzers.

It comes with a default Schema.xml that defines FieldTypes. Then field names 
can be assigned to FieldTypes. So technically speaking, no, Solr does not - but 
because most
people build off the example, you could say that it does have defaults for 
example FieldTypes and defaults of what field names map to those. But it also 
only accepts certain example fields with the example Schema - you really
have to go in and customize it to your needs - its setup to basically show off 
what options are available and work with some demo stuff.

Solr comes with almost no defaults in a way - but it does ship with an example 
setup that is meant to show you how to set things up, and what is available. 
You could consider those defaults since most will build off it.

example of Solr analyzer declaration:

{code}
!-- A general unstemmed text field - good if one does not know the 
language of the field --
fieldType name=textgen class=solr.TextField positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 
splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
{code}

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start 

[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843756#action_12843756
 ] 

Mark Miller commented on LUCENE-2294:
-

I'm assuming you would set an Analyzer for the document - and then you could 
override per field - or something along those lines.

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2010-03-09 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843331#action_12843331
 ] 

Mark Miller commented on LUCENE-2089:
-

Sweet!

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: Flex Branch
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
 Fix For: Flex Branch

 Attachments: ContrivedFuzzyBenchmark.java, createLevAutomata.py, 
 gen.py, gen.py, gen.py, gen.py, gen.py, gen.py, 
 Lev2ParametricDescription.java, Lev2ParametricDescription.java, 
 Lev2ParametricDescription.java, Lev2ParametricDescription.java, 
 LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
 LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
 LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
 LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
 LUCENE-2089_concat.patch, Moman-0.2.1.tar.gz, moman-57f5dc9dd0e7.diff, 
 TestFuzzy.java


 we can optimize fuzzyquery by using AutomatonTermsEnum. The idea is to speed 
 up the core FuzzyQuery in similar fashion to Wildcard and Regex speedups, 
 maintaining all backwards compatibility.
 The advantages are:
 * we can seek to terms that are useful, instead of brute-forcing the entire 
 terms dict
 * we can determine matches faster, as true/false from a DFA is array lookup, 
 don't even need to run levenshtein.
 We build Levenshtein DFAs in linear time with respect to the length of the 
 word: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 To implement support for 'prefix' length, we simply concatenate two DFAs, 
 which doesn't require us to do NFA-DFA conversion, as the prefix portion is 
 a singleton. the concatenation is also constant time with respect to the size 
 of the fuzzy DFA, it only need examine its start state.
 with this algorithm, parametric tables are precomputed so that DFAs can be 
 constructed very quickly.
 if the required number of edits is too large (we don't have a table for it), 
 we use dumb mode at first (no seeking, no DFA, just brute force like now).
 As the priority queue fills up during enumeration, the similarity score 
 required to be a competitive term increases, so, the enum gets faster and 
 faster as this happens. This is because terms in core FuzzyQuery are sorted 
 by boost value, then by term (in lexicographic order).
 For a large term dictionary with a low minimal similarity, you will fill the 
 pq very quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs (edit 
 distance of 2 - edit distance of 1 - edit distance of 0) during 
 enumeration, but also to switch from dumb mode to smart mode.
 With this design, we can add more DFAs at any time by adding additional 
 tables. The tradeoff is the tables get rather large, so for very high K, we 
 would start to increase the size of Lucene's jar file. The idea is we don't 
 have include large tables for very high K, by using the 'competitive boost' 
 attribute of the priority queue.
 For more information, see http://en.wikipedia.org/wiki/Levenshtein_automaton

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249
 ] 

Mark Miller commented on LUCENE-2294:
-

I can see the value in this - there are a bunch of IW constructors - but 
personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. 
Reminds me of windows C programming and structs. When I'm just coding away, its 
so much easier to just enter the params in the cnstr. And it seems like it 
would be more difficult to know whats *required* to set on the config class - 
without the same cstr business ...

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249
 ] 

Mark Miller edited comment on LUCENE-2294 at 3/4/10 1:45 PM:
-

I can see the value in this - there are a bunch of IW constructors - but 
personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. 
Reminds me of windows C programming and structs. When I'm just coding away, its 
so much easier to just enter the params in the cnstr. And it seems like it 
would be more difficult to know whats *required* to set on the config class - 
without the same cstr business ...

*edit*

Though I suppose the chaining *does* makes this more swallowable...

new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ...

  was (Author: markrmil...@gmail.com):
I can see the value in this - there are a bunch of IW constructors - but 
personally I still think I prefer them.

Creating config classes to init another class is its own pain in the butt. 
Reminds me of windows C programming and structs. When I'm just coding away, its 
so much easier to just enter the params in the cnstr. And it seems like it 
would be more difficult to know whats *required* to set on the config class - 
without the same cstr business ...
  
 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
 Fix For: 3.1


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances

2010-03-01 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839744#action_12839744
 ] 

Mark Miller commented on LUCENE-2287:
-

bq. Breaks backward compatibility, so need to find a way around that

Wouldn't be the end of the world depending on the break.

 Unexpected terms are highlighted within nested SpanQuery instances
 --

 Key: LUCENE-2287
 URL: https://issues.apache.org/jira/browse/LUCENE-2287
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Affects Versions: 2.9.1
 Environment: Linux, Solaris, Windows
Reporter: Michael Goddard
Priority: Minor
 Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, 
 LUCENE-2287.patch, LUCENE-2287.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 I haven't yet been able to resolve why I'm seeing spurious highlighting in 
 nested SpanQuery instances.  Briefly, the issue is illustrated by the second 
 instance of Lucene being highlighted in the test below, when it doesn't 
 satisfy the inner span.  There's been some discussion about this on the 
 java-dev list, and I'm opening this issue now because I have made some 
 initial progress on this.
 This new test, added to the  HighlighterTest class in lucene_2_9_1, 
 illustrates this:
 /*
  * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
  */
 public void testHighlightingNestedSpans2() throws Exception {
   String theText = The Lucene was made by Doug Cutting and Lucene great 
 Hadoop was; // Problem
   //String theText = The Lucene was made by Doug Cutting and the great 
 Hadoop was; // Works okay
   String fieldName = SOME_FIELD_NAME;
   SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
 new SpanTermQuery(new Term(fieldName, lucene)),
 new SpanTermQuery(new Term(fieldName, doug)) }, 5, true);
   Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
 new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true);
   String expected = The BLucene/B was made by BDoug/B Cutting and 
 Lucene great BHadoop/B was;
   //String expected = The BLucene/B was made by BDoug/B Cutting and 
 the great BHadoop/B was;
   String observed = highlightField(query, fieldName, theText);
   System.out.println(Expected: \ + expected + \n + Observed: \ + 
 observed);
   assertEquals(Why is that second instance of the term \Lucene\ 
 highlighted?, expected, observed);
 }
 Is this an issue that's arisen before?  I've been reading through the source 
 to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and 
 NearSpansOrdered, but haven't found the solution yet.  Initially, I thought 
 that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should 
 be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't 
 get me too far.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2226) move contrib/snowball to contrib/analyzers

2010-01-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801972#action_12801972
 ] 

Mark Miller commented on LUCENE-2226:
-

Contribs back compat policy is that there is no back compat policy unless that 
contrib specifically states one.

 move contrib/snowball to contrib/analyzers
 --

 Key: LUCENE-2226
 URL: https://issues.apache.org/jira/browse/LUCENE-2226
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2226.patch


 to fix bugs in some duplicate, handcoded impls of these stemmers (nl, fr, ru, 
 etc) we should simply merge snowball and analyzers, and replace the buggy 
 impls with the proper snowball stemfilters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2226) move contrib/snowball to contrib/analyzers

2010-01-18 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802011#action_12802011
 ] 

Mark Miller commented on LUCENE-2226:
-

{quote}Mark, that is my understanding too. I wasn't commenting on the policy 
but on the fact of the possible breakage. I think it is a courtesy to notify 
users of a change to which they might need to pay attention. I don't know 
that's spelled out in the policy, but I think it should be. Not that a lack of 
notice is a guarantee of no breakage but that a notice is a guarantee of 
breakage (at least under some circumstances).{quote}

Right - I was just pointing out that jar drop in is far from a requirement in 
contrib. We do always try and play nice anyway.

bq. Is there any contrib that specifically states one? I couldn't find it. 

Don't think so - meaning there is no back compat policy in contrib - I think as 
a contrib matures, its up to those working on it to decide that its reached a 
state that deserves a policy of some kind. The Highlighter could probably use 
one at this point, but at the same time, nothing has created too much of an 
outcry at this point.

bq.  The analysis/common is not clear as it has the Version stuff.

Right - just because there is no policy doesn't mean we shouldn't make any 
attempts at back compat - but the issue you brought up is not something easily 
addressed, nor I think, large enough to worry about with the proper warning in 
Changes. Users should be wary of contrib on upgrading - unless it presents a 
strong back compat policy.

bq.  But after all the dust settles and this i18n stuff is solid, I think it 
might be reasonable to make a stronger bw compat statement.

I agree - now that contrib has been getting some much needed love recently, I 
think it should start heading towards some back compat promises - especially 
concerning analyzers. We already do tend to bend over backwards when we can 
anyway.

I think we are on the same page - I'm just not very worried about the break you 
mention - I think its a perfectly acceptable growing pain. And I think our back 
compat has been so week because contrib has been a bit of a wasteland in the 
past - no one was willing to take ownership of a lot of this stuff - especially 
the language analyzers. That has change recently. As the devs clean up and 
consolidate this stuff properly, I think we can work towards stronger promises 
in the future.

 move contrib/snowball to contrib/analyzers
 --

 Key: LUCENE-2226
 URL: https://issues.apache.org/jira/browse/LUCENE-2226
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2226.patch


 to fix bugs in some duplicate, handcoded impls of these stemmers (nl, fr, ru, 
 etc) we should simply merge snowball and analyzers, and replace the buggy 
 impls with the proper snowball stemfilters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2010-01-06 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-2035.
-

Resolution: Fixed

Thanks Christopher!

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene

2010-01-06 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-860:
---

Attachment: LUCENE-860-1.patch

updated patch that also includes doc site level changes

 site should call project Lucene Java, not just Lucene
 -

 Key: LUCENE-860
 URL: https://issues.apache.org/jira/browse/LUCENE-860
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doug Cutting
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-860-1.patch, LUCENE-860-2.patch, LUCENE-860.patch


 To avoid confusion with the top-level Lucene project, the Lucene Java website 
 should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene

2010-01-06 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-860:
---

Attachment: LUCENE-860-2.patch

 site should call project Lucene Java, not just Lucene
 -

 Key: LUCENE-860
 URL: https://issues.apache.org/jira/browse/LUCENE-860
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doug Cutting
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-860-1.patch, LUCENE-860-2.patch, LUCENE-860.patch


 To avoid confusion with the top-level Lucene project, the Lucene Java website 
 should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-17 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791939#action_12791939
 ] 

Mark Miller commented on LUCENE-2035:
-

I'll commit this soon.

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1922) exposing the ability to get the number of unique term count per field

2009-12-17 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1922:


Affects Version/s: (was: 2.4.1)
   Flex Branch

 exposing the ability to get the number of unique term count per field
 -

 Key: LUCENE-1922
 URL: https://issues.apache.org/jira/browse/LUCENE-1922
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: Flex Branch
Reporter: John Wang

 Add an api to get the number of unique term count given a field name, e.g.:
 IndexReader.getUniqueTermCount(String field)
 This issue has a dependency on LUCENE-1458

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791680#action_12791680
 ] 

Mark Miller commented on LUCENE-2035:
-

Hey Christopher, why are you going through the trouble of the custom collector 
to check that there are no hits? Why not just do a standard search?

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2035.patch, LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-16 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2035:


Attachment: LUCENE-2035.patch

I've broken the new tests back out into there own file, change the hit 
collector code to just search basically, and improved the test coverage of 
TokenSources a bit.

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-12-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790748#action_12790748
 ] 

Mark Miller commented on LUCENE-2089:
-

Sorry Earwin - to be clear, we don't actually use chapter 6 - AutomataQuery 
needs the automata.

You can get all the states just by taking the power set of the subsumption 
triangle for every base position, and then removing from each set any position 
thats subsumed by another. Thats what I mean by brute force. But in the paper, 
they boil this down to nice little i param tables, extracting some sort of 
pattern from that process. They give no hint on how they do this, or whether it 
applicable to greater n's though. No big deal I guess - the computer can do the 
brute force method - but I wouldn't be surprised if it starts to bog down at 
much higher n's.

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2165) SnowballAnalyzer lacks a constructor that takes a Set of Stop Words

2009-12-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2165:


Fix Version/s: 3.1

 SnowballAnalyzer lacks a constructor that takes a Set of Stop Words
 ---

 Key: LUCENE-2165
 URL: https://issues.apache.org/jira/browse/LUCENE-2165
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.1, 3.0
Reporter: Nick Burch
Priority: Minor
 Fix For: 3.1


 As discussed on the java-user list, the SnowballAnalyzer has been updated to 
 use a Set of stop words. However, there is no constructor which accepts a 
 Set, there's only the original String[] one
 This is an issue, because most of the common sources of stop words (eg 
 StopAnalyzer) have deprecated their String[] stop word lists, and moved over 
 to Sets (eg StopAnalyzer.ENGLISH_STOP_WORDS_SET). So, for now, you either 
 have to use a deprecated field on StopAnalyzer, or manually turn the Set into 
 an array so you can pass it to the SnowballAnalyzer
 I would suggest that a constructor is added to SnowballAnalyzer which accepts 
 a Set. Not sure if the old String[] one should be deprecated or not.
 A sample patch against 2.9.1 to add the constructor is:
 --- SnowballAnalyzer.java.orig  2009-12-15 11:14:08.0 +
 +++ SnowballAnalyzer.java   2009-12-14 12:58:37.0 +
 @@ -67,6 +67,12 @@
  stopSet = StopFilter.makeStopSet(stopWords);
}
  
 +  /** Builds the named analyzer with the given stop words. */
 +  public SnowballAnalyzer(Version matchVersion, String name, Set 
 stopWordsSet) {
 +this(matchVersion, name);
 +stopSet = stopWordsSet;
 +  }
 +

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1769) Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.4.3 or better

2009-12-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791115#action_12791115
 ] 

Mark Miller commented on LUCENE-1769:
-

Would be cool to get this issue wrapped up ...

 Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.4.3 
 or better
 ---

 Key: LUCENE-1769
 URL: https://issues.apache.org/jira/browse/LUCENE-1769
 Project: Lucene - Java
  Issue Type: Bug
  Components: Build
Affects Versions: 2.9
Reporter: Uwe Schindler
 Attachments: clover.license, LUCENE-1769.patch, LUCENE-1769.patch, 
 nicks-LUCENE-1769.patch


 This is a followup for 
 [http://www.lucidimagination.com/search/document/6248d6eafbe10ef4/build_failed_in_hudson_lucene_trunk_902]
 The problem with clover running on hudson is, that it does not instrument all 
 tests ran. The autodetection of clover 1.x is not able to find out which 
 files are the correct tests and only instruments the backwards test. Because 
 of this, the current coverage report is only from the backwards tests running 
 against the current Lucene JAR.
 You can see this, if you install clover and start the tests. During test-core 
 no clover data is added to the db, only when backwards-tests begin, new files 
 are created in the clover db folder.
 Clover 2.x supports a new ant task, testsources that can be used to specify 
 the files, that are the tests. It works here locally with clover 2.4.3 and 
 produces a really nice coverage report, also linking with test files work, it 
 tells which tests failed and so on.
 I will attach a patch, that changes common-build.xml to the new clover 
 version (other initialization resource) and tells clover where to find the 
 tests (using the test folder include/exclude properties).
 One problem with the current patch: It does *not* instrument the backwards 
 branch, so you see only coverage of the core/contrib tests. Getting the 
 coverage also from the backwards tests is not easy possible because of two 
 things:
 - the tag test dir is not easy to find out and add to testsources element 
 (there may be only one of them)
 - the test names in BW branch are identical to the trunk tests. This 
 completely corrupts the linkage between tests and code in the coverage report.
 In principle the best would be to generate a second coverage report for the 
 backwards branch with a separate clover DB. The attached patch does not 
 instrument the bw branch, it only does trunk tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned LUCENE-2035:
---

Assignee: Mark Miller

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2035:


Fix Version/s: 3.1

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2035:


Attachment: LUCENE-2035.patch

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2035.patch, LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

2009-12-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791152#action_12791152
 ] 

Mark Miller commented on LUCENE-2035:
-

Thanks for the tests and fix Christopher!

I've got one more patch coming and ill commit in a few days.

I'm going to break the tests back out in a separate file again (on second 
thought I think how you had is a good idea) and remove an author tag. Then 
after one more review I think this good to go in.

 TokenSources.getTokenStream() does not assign positionIncrement
 ---

 Key: LUCENE-2035
 URL: https://issues.apache.org/jira/browse/LUCENE-2035
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.4, 2.4.1, 2.9
Reporter: Christopher Morris
Assignee: Mark Miller
 Fix For: 3.1

 Attachments: LUCENE-2035.patch, LUCENE-2305.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 TokenSources.StoredTokenStream does not assign positionIncrement information. 
 This means that all tokens in the stream are considered adjacent. This has 
 implications for the phrase highlighting in QueryScorer when using 
 non-contiguous tokens.
 For example:
 Consider  a token stream that creates tokens for both the stemmed and 
 unstemmed version of each word - the fox (jump|jumped)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - the fox jump jumped
 Now try a search and highlight for the phrase query fox jumped. The search 
 will correctly find the document; the highlighter will fail to highlight the 
 phrase because it thinks that there is an additional word between fox and 
 jumped. If we use the original (from the analyzer) token stream then the 
 highlighter works.
 Also, consider the converse - the fox did not jump
 not is a stop word and there is an option to increment the position to 
 account for stop words - (the,0) (fox,1) (did,2) (jump,4)
 When retrieved from the index using TokenSources.getTokenStream(tpv,false), 
 the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
 So the phrase query did jump will cause the did and jump terms in the 
 text did not jump to be highlighted. If we use the original (from the 
 analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-406) sort missing string fields last

2009-12-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791153#action_12791153
 ] 

Mark Miller commented on LUCENE-406:


We should update this and incorporate into Lucene.

 sort missing string fields last
 ---

 Key: LUCENE-406
 URL: https://issues.apache.org/jira/browse/LUCENE-406
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 1.4
 Environment: Operating System: All
 Platform: All
Reporter: Yonik Seeley
Assignee: Hoss Man
Priority: Minor
 Attachments: MissingStringLastComparatorSource.java, 
 MissingStringLastComparatorSource.java, 
 TestMissingStringLastComparatorSource.java


 A SortComparatorSource for string fields that orders documents with the sort
 field missing after documents with the field.  This is the reverse of the
 default Lucene implementation.
 The concept and first-pass implementation was done by Chris Hostetter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1942) NUM_THREADS is a static member of RunAddIndexesThreads and should be accessed in a static way

2009-12-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-1942.
-

Resolution: Won't Fix

 NUM_THREADS is a static member of RunAddIndexesThreads and should be accessed 
 in a static way
 -

 Key: LUCENE-1942
 URL: https://issues.apache.org/jira/browse/LUCENE-1942
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
 Environment: Eclipse 3.4.2
Reporter: Hasan Diwan
Priority: Trivial
 Attachments: lucene.pat


 The summary contains the problem. No further description needed, I don't 
 think.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-628) Intermittent FileNotFoundException for .fnm when using rsync

2009-12-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-628.


Resolution: Incomplete

 Intermittent FileNotFoundException for .fnm when using rsync
 

 Key: LUCENE-628
 URL: https://issues.apache.org/jira/browse/LUCENE-628
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.9
 Environment: Linux RedHat ES3, Jboss402
Reporter: Simon Lorenz
Priority: Minor

 We use Lucene 1.9.1 to create and search indexes for web applications. The 
 application runs in Jboss402 on Redhat ES3. A single Master (Writer) Jboss 
 instance creates and writes the indexes using the compound file format , 
 which is optimised after all updates. These index files are replicated every 
 few hours using rsync, to a number of other application servers (Searchers). 
 The rsync job only runs if there are no lucene lock files present on the 
 Writer. The Searcher servers that receive the replicated files, perform only 
 searches on the index. Up to 60 searches may be performed each minute. 
 Everything works well most of the time, but we get the following issue on the 
 Searcher servers about 10% of the time. 
 Following an rsync replication one or all of the Searcher server throws
 IOException caught when creating and IndexSearcher
 java.io.FileNotFoundException: //_1zm.fnm (No such file or directory)
 at java.io.RandomAccessFile.open(Native Method)
 at java.io.RandomAccessFile.init(RandomAccessFile.java:212)
 at 
 org.apache.lucene.store.FSIndexInput$Descriptor.init(FSDirectory.java:425)
 at org.apache.lucene.store.FSIndexInput.init(FSDirectory.java:434)
 at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324)
 at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:56)
 at 
 org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110)
 at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154)
 at org.apache.lucene.store.Lock$With.run(Lock.java:109)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:143)  
 As we use the compound file format I would not expect .fnm files to be 
 present. When replicating, we do not delete the old .cfs index files as these 
 could still be referenced by old Searcher threads. We do overwrite the 
 segments and deletable files on the Searcher servers. 
 My thoughts are: Either we are occasionally overwriting a file at the exact 
 time a new searcher is being created, or the lock files are removed from the 
 Writer server before the compaction process is completed, we then replicate a 
 segments file that still references a ghost .fnm file.
 I would greatly appreciate any ideas and suggestions to solve this annoying 
 issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery

2009-12-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790368#action_12790368
 ] 

Mark Miller commented on LUCENE-2089:
-

bq. If you do take hold of it, do not hesitate to share  The original paper and 
C++ code likewise melt my brain, and I needed the algo in some other place.

The java impl I was onto was about 75% complete according to the author, but I 
have not yet looked at the code. Robert was convinced it was a different less 
efficient algorithm last I heard though.

We have cracked much of the paper - thats how Robert implemented n=1 here - 
thats from the paper. The next step is to work out how to construct the tables 
for n as Robert says above. And store those tables efficiently as they start 
getting quite large rather fast - though we might only use as high as n=3 or 4 
in Lucene - Robert suspects term seeking will outweigh any gains at that point. 
I think we know how to do the majority of the work for the n case, but I don't 
really have much/any time for this, so it probably depends on if/when Robert 
gets to it. If he loses interest on finishing, I def plan to come back to it 
someday. I'd like to complete my understanding of the paper and see a full n 
java impl of this in either case. The main piece left that I don't understand 
fully (computing all possible states for n), can be computed with just a brute 
force check (thats how the python impl is doing it), so there may not be much 
more to understand. I would like to know how the paper is getting 'i' 
parametrized state generators though - thats much more efficient. The paper 
shows them for n=1 and n=2.

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-13 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789901#action_12789901
 ] 

Mark Miller commented on LUCENE-2126:
-

 I disagree with you here: introducing DataInput/Output makes IMO the API
 actually easier for the normal user to understand.

 I agree with everything you say in the second paragraph, but I don't see how
 any of that supports the assertion you make in the first paragraph.

Presumably, because the normal user won't touch/see the IndexInput/Output 
classes, but more likely may deal with DataInput/Output - and those classes
being limited to what actually makes sense to use for them (only exposing 
methods they should use) - thats easier for them.

I was leaning towards Marvin's arguments - it really seems that documentation 
should be enough to steer users against doing something stupid - there is no
doubt that writing attributes into the posting list is a fairly advanced 
operation (though more normal than using IndexInput/Output). On the other 
hand though, 
I'm not really sold on the downsides longer term either. The complexity 
argument is a bit over blown. If you understand anything down to the level of 
these classes, 
this is a ridiculously simple change. The backcompat argument is not very 
persuasive either - not only does it look like a slim chance of any future 
issues - at this 
level we are fairly loose about back compat when something comes up. I think 
advanced users have already realized, the more you dig into Lucene's guts, the 
more likely you won't be able to count on jar drop in. Thats just the way 
things have gone. I don't see a looming concrete issue myself anyway. And if 
there is a
hidden one, I don't think anyone is going to get in a ruffle about it.

So net/net, I'm +1. Seems worth it to me to be able to give a user 2125 the 
correct API.

I could go either way on the name change. Not a fan of LuceneInput/Output 
though.

 Split up IndexInput and IndexOutput into DataInput and DataOutput
 -

 Key: LUCENE-2126
 URL: https://issues.apache.org/jira/browse/LUCENE-2126
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Flex Branch
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Flex Branch

 Attachments: lucene-2126.patch


 I'd like to introduce the two new classes DataInput and DataOutput
 that contain all methods from IndexInput and IndexOutput that actually
 decode or encode data, such as readByte()/writeByte(),
 readVInt()/writeVInt().
 Methods like getFilePointer(), seek(), close(), etc., which are not
 related to data encoding, but to files as input/output source stay in
 IndexInput/IndexOutput.
 This patch also changes ByteSliceReader/ByteSliceWriter to extend
 DataInput/DataOutput. Previously ByteSliceReader implemented the
 methods that stay in IndexInput by throwing RuntimeExceptions.
 See also LUCENE-2125.
 All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-11 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789384#action_12789384
 ] 

Mark Miller commented on LUCENE-2133:
-

bq. Something along these lines maybe?

And we are back to 831 :)

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch, LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes)
 - all subclasses of FieldComparator (= several package-level classes)
 Final notes:
 - The patch would be simpler if no backwards compatibility was necessary. The 
 Lucene community has to decide which classes/methods can immediately be 
 removed, which ones later, which not at all. Whenever new classes depend on 
 the old ones, an appropriate notice exists in the javadocs.
 - The patch introduces a new, 

[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR

2009-12-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788762#action_12788762
 ] 

Mark Miller commented on LUCENE-1377:
-

bq.  with the exception of a few core committers.

I think the exception is the other way around, especially considering Lucene 
contrib. Lets look at the Solr list (and consider some are not very active in 
Solr currently)
||name||status||
|Bill Au| |
|Doug Cutting|Lucene Core Committer|
|Otis Gospodnetić|Lucene Core Committer|
|Erik Hatcher| Lucene Core Committer|
|Chris Hostetter |Lucene Core Committer|
|Grant Ingersoll | Lucene Core Committer|
|Mike Klaas| |
|Shalin Shekhar Mangar| |
|Ryan McKinley| Lucene Contrib Committer|
|Mark Miller |Lucene Core Committer|
|Noble Paul| |
|Yonik Seeley| Lucene Core Committer|
|Koji Sekiguchi|Lucene Contrib Committer|


 Add HTMLStripReader and WordDelimiterFilter from SOLR
 -

 Key: LUCENE-1377
 URL: https://issues.apache.org/jira/browse/LUCENE-1377
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 2.3.2
Reporter: Jason Rutherglen
Priority: Minor
   Original Estimate: 24h
  Remaining Estimate: 24h

 SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very 
 useful for a wide variety of use cases.  It would be good to place them into 
 core Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788874#action_12788874
 ] 

Mark Miller commented on LUCENE-2133:
-

I don't know that back compat is really a concern if we are just leaving the 
old API intact as part of that, with its own caching mechanism?

Just deprecate the old API, and make a new one. This is a big pain, because you 
have to be sure you don't straddle the two apis on upgrading, but thats the 
boat we will be in anyway.

Which means a new impl should provide enough benefits to make that large pain 
worth enduring. 831 was not committed for the same reason - it didn't bring 
enough to table to be worth it after we got to a per segment cache in another 
way. Since I don't see that this provides anything over 831, I don't see how 
its not in the same boat.

I'm not sure we should target a specific release with this - we don't even know 
when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should 
prob just do what makes sense and commit it when its ready.

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch, LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw 

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788908#action_12788908
 ] 

Mark Miller commented on LUCENE-2133:
-

bq. LUCENE-831 still requires a static FieldCache, the root of all evil  :) 

It doesn't require one though? It supports a cache per segment reader just like 
this. Except its called a ValueSource.

The CacheByReaderValueSource is just there to handle a back compat issue - its 
something that we would want to get around and use the reader valuesource for 
instead - but that patch still had a long way to go.

Overall, from what I can see, the approach was about the same.

bq. It probably makes sense to start from one of Hoss's original patches or 
even from scratch

That was said before a lot more work was done. The API was actually starting to 
shape up nicely.

bq. The more complex the patches are, the longer it will take to integrate them 
into a new version.

Of course - and this is a complex issue with a lot of upgrade pain. Like with 
831, it not really worth the pain to users without more benefits.

bq. The more such patches you have, the longer it will take to get to a new 
release.

Thats not really true. 3.1 does't need this patch - there would be no reason to 
hold it for it. Patches go in when they are ready.

bq. Let's make it simple, submit what we have and build upon that.

I dont think thats simple :) The patch can be iterated on outside of trunk as 
easy as in.



 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch, LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788951#action_12788951
 ] 

Mark Miller commented on LUCENE-2133:
-

bq. That is, it adds a lot of duplicated code / different possible 
implementations for the same thing.

Things were still ugly were not likely to stick around - 831 was very much a 
work in progress. The solution there to handle back compat issues was a working 
solution that would need to be improved upon. 831 was still in experimentation 
state - issues that need more though had hacked in working solutions. We had a 
more general cache at one point, and began working towards ValueSources based 
on discussion. The latest 831 patch is an exploration of that, not a final 
product.

 bq. They should store arbitrary data, allow cache inspection, eviction of 
entries and so on.

Thats extremely simple to add to an IndexReader - we were thinking of a 
ValueSource as something different than a basic cache.

{quote}
It is indeed a complex problem but it can easily be split into several subtasks 
that can be addressed by different people in parallel. To allow such a 
development, we have to somehow get the base code it into SVN, not necessarily 
trunk, admittedly, a branch would also do. Of course, this requires also 
additional work to keep it in sync with trunk. If we can really assume to have 
3.1 in one year, we have lots of time for developing a stable, powerful new API 
directly in trunk. Of course, this is a decision related to release management 
and not to the actual problem. I can live with both ways (trunk vs. branch), 
but, in my opinion, managing the changes just as patch files in jira is not a 
viable option.
{quote}

A branch is certainly a possibility, but with only one person working on it, I 
think its overkill. With some additional interest, a branch can make sense - 
otherwise its not worth the merging headaches. You also have to have a 
committer(s) thats willing to take on the merging.

At one point, 831 was much more like this patch. Discussion along what Mike 
brought up above started transforming it to something else. We essentially 
decided that unless that much was brought to the table, the disrupting change 
just wasn't worth it for a different cache API.

I'm def a proponent of FieldCache reform - but I think we want to fully flesh 
it out before committing to something in trunk.

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch, LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 

[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-10 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788951#action_12788951
 ] 

Mark Miller edited comment on LUCENE-2133 at 12/10/09 9:48 PM:
---

bq. That is, it adds a lot of duplicated code / different possible 
implementations for the same thing.

Things that were still ugly were not likely to stick around - 831 was very much 
a work in progress. The solution there to handle back compat issues was a 
working solution that would need to be improved upon. 831 was still in 
experimentation state - issues that need more though had hacked in working 
solutions. We had a more general cache at one point, and began working towards 
ValueSources based on discussion. The latest 831 patch is an exploration of 
that, not a final product.

 bq. They should store arbitrary data, allow cache inspection, eviction of 
entries and so on.

Thats extremely simple to add to an IndexReader - we were thinking of a 
ValueSource as something different than a basic cache.

{quote}
It is indeed a complex problem but it can easily be split into several subtasks 
that can be addressed by different people in parallel. To allow such a 
development, we have to somehow get the base code it into SVN, not necessarily 
trunk, admittedly, a branch would also do. Of course, this requires also 
additional work to keep it in sync with trunk. If we can really assume to have 
3.1 in one year, we have lots of time for developing a stable, powerful new API 
directly in trunk. Of course, this is a decision related to release management 
and not to the actual problem. I can live with both ways (trunk vs. branch), 
but, in my opinion, managing the changes just as patch files in jira is not a 
viable option.
{quote}

A branch is certainly a possibility, but with only one person working on it, I 
think its overkill. With some additional interest, a branch can make sense - 
otherwise its not worth the merging headaches. You also have to have a 
committer(s) thats willing to take on the merging.

At one point, 831 was much more like this patch. Discussion along what Mike 
brought up above started transforming it to something else. We essentially 
decided that unless that much was brought to the table, the disrupting change 
just wasn't worth it for a different cache API.

I'm def a proponent of FieldCache reform - but I think we want to fully flesh 
it out before committing to something in trunk.

  was (Author: markrmil...@gmail.com):
bq. That is, it adds a lot of duplicated code / different possible 
implementations for the same thing.

Things were still ugly were not likely to stick around - 831 was very much a 
work in progress. The solution there to handle back compat issues was a working 
solution that would need to be improved upon. 831 was still in experimentation 
state - issues that need more though had hacked in working solutions. We had a 
more general cache at one point, and began working towards ValueSources based 
on discussion. The latest 831 patch is an exploration of that, not a final 
product.

 bq. They should store arbitrary data, allow cache inspection, eviction of 
entries and so on.

Thats extremely simple to add to an IndexReader - we were thinking of a 
ValueSource as something different than a basic cache.

{quote}
It is indeed a complex problem but it can easily be split into several subtasks 
that can be addressed by different people in parallel. To allow such a 
development, we have to somehow get the base code it into SVN, not necessarily 
trunk, admittedly, a branch would also do. Of course, this requires also 
additional work to keep it in sync with trunk. If we can really assume to have 
3.1 in one year, we have lots of time for developing a stable, powerful new API 
directly in trunk. Of course, this is a decision related to release management 
and not to the actual problem. I can live with both ways (trunk vs. branch), 
but, in my opinion, managing the changes just as patch files in jira is not a 
viable option.
{quote}

A branch is certainly a possibility, but with only one person working on it, I 
think its overkill. With some additional interest, a branch can make sense - 
otherwise its not worth the merging headaches. You also have to have a 
committer(s) thats willing to take on the merging.

At one point, 831 was much more like this patch. Discussion along what Mike 
brought up above started transforming it to something else. We essentially 
decided that unless that much was brought to the table, the disrupting change 
just wasn't worth it for a different cache API.

I'm def a proponent of FieldCache reform - but I think we want to fully flesh 
it out before committing to something in trunk.
  
 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -


[jira] Commented: (LUCENE-2018) Reconsider boolean max clause exception

2009-12-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787658#action_12787658
 ] 

Mark Miller commented on LUCENE-2018:
-

I still think this should be removed - or moved to the MTQ query itself - then 
a setting on the queryparser could set it, or a user could set it. It shouldn't 
be a sys property, and I don't necessarily think it should be on by default 
either.

 Reconsider boolean max clause exception
 ---

 Key: LUCENE-2018
 URL: https://issues.apache.org/jira/browse/LUCENE-2018
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
 Fix For: 3.1


 Now that we have smarter multi-term queries, I think its time to reconsider 
 the boolean max clause setting. It made more sense before, because you could 
 hit it more unaware when the multi-term queries got huge - now its more 
 likely that if it happens its because a user built the boolean themselves. 
 And no duh thousands more boolean clauses means slower perf and more 
 resources needed. We don't throw an exception when you try to use a ton of 
 resources in a thousand other ways.
 The current setting also suffers from the static hell argument - especially 
 when you consider something like Solr's multicore feature - you can have 
 different settings for this in different cores, and the last one is going to 
 win. Its ugly. Yes, that could be addressed better in Solr as well - but I 
 still think it should be less ugly in Lucene as well.
 I'd like to consider either doing away with it, or raising it by quite a bit 
 at the least. Or an alternative better solution. Right now, it aint so great.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787711#action_12787711
 ] 

Mark Miller commented on LUCENE-2133:
-

There are a bunch or unrelated changes (imports/names/exception thrown) that 
should be pulled from this patch.

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes)
 - all subclasses of FieldComparator (= several package-level classes)
 Final notes:
 - The patch would be simpler if no backwards compatibility was necessary. The 
 Lucene community has to decide which classes/methods can immediately be 
 removed, which ones later, which not at all. Whenever new classes depend on 
 the old ones, an appropriate notice exists in the javadocs.
 - The 

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787715#action_12787715
 ] 

Mark Miller commented on LUCENE-2133:
-

Hmm ... nevermind. The exception is related and most of the imports are correct 
- brain spin.

Didn't see that 

import org.apache.lucene.search.SortField; // for javadocs 

wasn't being used anymore anyway.

import org.apache.lucene.search.fields.IndexFieldCache in NumericQuery should 
get a //javadoc so someone doesn't accidently remove it.

And I guess the t to threadLocal change doesn't hurt with the amount your 
changing that anyway. Its a better name.

This looks pretty nice overall.

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes)
 - 

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787729#action_12787729
 ] 

Mark Miller commented on LUCENE-2133:
-

A couple more quick notes:

I know the FieldComparator class is ugly, but I'm not sure we should pull the 
rug by putting the impls in a new package. On the other hand, its not likely to 
affect many and it was experimental - so its a tough call. Its a lot of classes 
in there ;)

I'm also not sure if fields is the right package name? And do the Filters 
belong in that package?

Also, almost a non issue, but extending a deprecated class is going to be an 
ultra minor back compat break when its removed. Not likely a problem though. 
But we might put a note to that affect to be clear. It is almost self 
documenting anyway though :)

Rather then changing the tests to the new classes, we should prob copy them and 
make new ones - then remove them when the deprecations are removed.

Also, you should pull the author tag(s) - all credit is through JIRA and 
Changes. (I only see it like once, so I bet thats eclipse?)

I havn't done a thorough review it all, but this is pretty great stuff to 
appear so complete and out of nowhere :)

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by 

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787734#action_12787734
 ] 

Mark Miller commented on LUCENE-2133:
-

It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And 
I think the FieldCache import in that class can be removed.

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes)
 - all subclasses of FieldComparator (= several package-level classes)
 Final notes:
 - The patch would be simpler if no backwards compatibility was necessary. The 
 Lucene community has to decide which classes/methods can immediately be 
 removed, which ones later, which not at all. Whenever new classes depend on 
 the old ones, an appropriate notice 

[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787734#action_12787734
 ] 

Mark Miller edited comment on LUCENE-2133 at 12/8/09 8:42 PM:
--

It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And 
I think the FieldCache import in that class can be removed (same with 
IndexFieldCacheRangeFilter).

  was (Author: markrmil...@gmail.com):
It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? 
And I think the FieldCache import in that class can be removed.
  
 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes)
 - all subclasses of FieldComparator (= several package-level classes)
 Final 

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787748#action_12787748
 ] 

Mark Miller commented on LUCENE-2133:
-

bq.  I think it does not hurt either.

I didn't notice that you actually just deprecated the originals - I guess thats 
not a complete rug pull ...

By the way, I don't think you need to deprecate something in a new class (  
IndexFieldCacheImpl):

{code}

  /**
   * @deprecated Use {...@link #clear()} instead.
   */
  public void purgeAllCaches() {
init();
  }
{code}

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - FieldCacheImpl (= IndexFieldCacheImpl)
 - all classes in FieldCacheImpl (= several package-level classes)
 - all subclasses of FieldComparator (= several package-level classes)
 Final notes:
 - The patch would be simpler if no 

[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787752#action_12787752
 ] 

Mark Miller edited comment on LUCENE-2133 at 12/8/09 9:34 PM:
--

And what about the doubling up insanity? It looks like you just commented out 
that check? It appears to me that thats still an issue we want to check for - 
we want to make sure Lucene core and users have a way to be sure they are not 
using a toplevel reader and its sub readers for caches unless they *really* 
intend to.

*edit*

This type of change actually even exaggerates that problem (though if we want 
to improve things here, its something we will have to deal with).

Now you might have a mixture of old api/new api caches as well if you don't 
properly upgrade everything at once.

  was (Author: markrmil...@gmail.com):
And what about the doubling up insanity? It looks like you just commented 
out that check? It appears to me that thats still an issue we want to check for 
- we want to make sure Lucene core and users have a way to be sure they are not 
using a toplevel reader and its sub readers for caches unless they *really* 
intend to.
  
 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 

[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField

2009-12-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787767#action_12787767
 ] 

Mark Miller commented on LUCENE-2133:
-

bq. not bind the cache so hard to the IndexReader (which was also the problem 
with the last FieldCache), instead just make it a plugin component

At a minimum, you should be able to set the cache for the reader.

bq. For the functionality of Lucene, FieldCache is not needed, sorting is just 
an addon on searching

The way he has it, this is not just for the fieldache, but also the 
fieldsreader and vectorreader - if we go down that road, we should consider 
norms as well.

bq.  I see no problems with appling it soon

I still think it might be a little early. This has a lot of consequences.

 [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
 -

 Key: LUCENE-2133
 URL: https://issues.apache.org/jira/browse/LUCENE-2133
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1, 3.0
Reporter: Christian Kohlschütter
 Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, 
 LUCENE-2133.patch


 Hi all,
 up to the current version Lucene contains a conceptual flaw, that is the 
 FieldCache. The FieldCache is a singleton which is supposed to cache certain 
 information for every IndexReader that is currently open
 The FieldCache is flawed because it is incorrect to assume that:
 1. one IndexReader instance equals one index. In fact, there can be many 
 clones (of SegmentReader) or decorators (FilterIndexReader) which all access 
 the very same data.
 2. the cache information remains valid for the lifetime of an IndexReader. In 
 fact, some IndexReaders may be reopen()'ed and thus they may contain 
 completely different information.
 3. all IndexReaders need the same type of cache. In fact, because of the 
 limitations imposed by the singleton construct there was no implementation 
 other than FieldCacheImpl.
 Furthermore, FieldCacheImpl and FieldComparator are bloated by several static 
 inner-classes that could be moved to package level.
 There have been a few attempts to improve FieldCache, namely LUCENE-831, 
 LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: 
 There is a central registry for assigning Caches to IndexReader instances.
 I now propose the following:
 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, 
 extensible cache instances (IndexCache). IndexCaches provide common caching 
 functionality for all IndexReaders and may be extended (for example, 
 SegmentReader would have a SegmentReaderIndexCache and store different data 
 than a regular IndexCache)
 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. 
 IndexFieldCache is an interface just like FieldCache and may support 
 different implementations.
 3. The IndexCache instances may be flushed/closed by the associated 
 IndexReaders whenever necessary.
 4. Obsolete FieldCacheSanityChecker because no more insanities are expected 
 (or at least, they do not impact the overall performance)
 5. Refactor FieldCacheImpl and the related classes (FieldComparator, 
 SortField) 
 I have provided an patch which takes care of all these issues. It passes all 
 JUnit tests.
 The patch is quite large, admittedly, but the change required several 
 modifications and some more to preserve backwards-compatibility. 
 Backwards-compatibility is preserved by moving some of the updated 
 functionality in the package org.apache.lucene.search.fields (field 
 comparators and parsers, SortField) while adding wrapper instances and 
 keeping old code in org.apache.lucene.search.
 In detail and besides the above mentioned improvements, the following is 
 provided:
 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved 
 from SegmentReader to SegmentReaderIndexCache.
 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the 
 close() method to all registered instances by calling an onClose() method 
 with the threads' instances.
 3. Analyzer.close now may throw an IOException (this already is covered by 
 java.io.Closeable).
 4. A change to Collector: allow IndexCache instead of IndexReader being 
 passed to setNextReader()
 5. SortField's numeric types have been replaced by direct assignments of 
 FieldComparatorSource. This removes the switch statements and the 
 possibility to throw IllegalArgumentExceptions because of unsupported type 
 values.
 The following classes have been deprecated and replaced by new classes in 
 org.apache.lucene.search.fields:
 - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter)
 - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter)
 - FieldCache (= IndexFieldCache)
 - 

[jira] Resolved: (LUCENE-2106) Benchmark does not close its Reader when OpenReader/CloseReader are not used

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-2106.
-

Resolution: Fixed

 Benchmark does not close its Reader when OpenReader/CloseReader are not used
 

 Key: LUCENE-2106
 URL: https://issues.apache.org/jira/browse/LUCENE-2106
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Affects Versions: 3.0
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 3.0.1, 3.1

 Attachments: LUCENE-2106.patch


 Only the Searcher is closed, but because the reader is passed to the 
 Searcher, the Searcher does not close the Reader, causing a resource leak.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1844) Speed up junit tests

2009-12-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787004#action_12787004
 ] 

Mark Miller commented on LUCENE-1844:
-

It should work fine.

 Speed up junit tests
 

 Key: LUCENE-1844
 URL: https://issues.apache.org/jira/browse/LUCENE-1844
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: FastCnstScoreQTest.patch, hi_junit_test_runtimes.png, 
 LUCENE-1844-Junit3.patch, LUCENE-1844.patch, LUCENE-1844.patch, 
 LUCENE-1844.patch


 As Lucene grows, so does the number of JUnit tests. This is obviously a good 
 thing, but it comes with longer and longer test times. Now that we also run 
 back compat tests in a standard test run, this problem is essentially doubled.
 There are some ways this may get better, including running parallel tests. 
 You will need the hardware to fully take advantage, but it should be a nice 
 gain. There is already an issue for this, and Junit 4.6, 4.7 have the 
 beginnings of something we might be able to count on soon. 4.6 was buggy, and 
 4.7 still doesn't come with nice ant integration. Parallel tests will come 
 though.
 Beyond parallel testing, I think we also need to concentrate on keeping our 
 tests lean. We don't want to sacrifice coverage or quality, but I'm sure 
 there is plenty of fat to skim.
 I've started making a list of some of the longer tests - I think with some 
 work we can make our tests much faster - and then with parallelization, I 
 think we could see some really great gains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)
Investigate Rewriting Constant Scoring MultiTermQueries per segment
---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor


This issue is likely not to go anywhere, but I thought we might explore it. The 
only idea I have come up with is fairly ugly, and unless something better comes 
up, this is not likely to happen.

But if we could rewrite constant score multi-term queries per segment, MTQ's 
with auto, constant, or constant boolean rewrite could enum terms against a 
single segment and then apply a boolean query against each segment with just 
the terms that are known to be in that segment. This way, if you have a bunch 
of really large segments and a lot of really small segments, you wouldn't apply 
a huge booleanquery against all of the small segments which don't have those 
terms anyway. How advantageous this is, I'm not sure yet.

No biggie, not likely, but what the heck.

So the ugly way to do it is to add a property to query's and weights - 
lateCnstRewrite or something, that defaults to false. MTQ would return true if 
its in a constant score mode. On the top level rewrite, if this is detected, an 
empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite 
and it keeps a ref to the original MTQ query. It also gets its boost set to the 
MTQ's boost. Then when we are searching per segment, if the Weight is 
lateCnstRewrite, we grab the orig query and actually do the rewrite against the 
subreader and grab the actual constantscore weight. It works I think - but its 
a little ugly.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787239#action_12787239
 ] 

Mark Miller commented on LUCENE-2130:
-

Whoops - a little off in that summary - you would't apply a huge boolean query 
- you'd just have a sparser filter. This might not be that beneficial.

 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor

 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto, constant, or constant boolean rewrite could enum terms against a 
 single segment and then apply a boolean query against each segment with just 
 the terms that are known to be in that segment. This way, if you have a bunch 
 of really large segments and a lot of really small segments, you wouldn't 
 apply a huge booleanquery against all of the small segments which don't have 
 those terms anyway. How advantageous this is, I'm not sure yet.
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787248#action_12787248
 ] 

Mark Miller commented on LUCENE-2130:
-

Okay - so talking to Robert in chat - the advantage when you are enumerating a 
lot of terms is that you avoid DirectoryReaders MultiTermEnum and its PQ.

 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor

 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto, constant, or constant boolean rewrite could enum terms against a 
 single segment and then apply a boolean query against each segment with just 
 the terms that are known to be in that segment. This way, if you have a bunch 
 of really large segments and a lot of really small segments, you wouldn't 
 apply a huge booleanquery against all of the small segments which don't have 
 those terms anyway. How advantageous this is, I'm not sure yet.
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2130:


Attachment: LUCENE-2130.patch

The ugly patch

 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-2130.patch


 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto, constant, or constant boolean rewrite could enum terms against a 
 single segment and then apply a boolean query against each segment with just 
 the terms that are known to be in that segment. This way, if you have a bunch 
 of really large segments and a lot of really small segments, you wouldn't 
 apply a huge booleanquery against all of the small segments which don't have 
 those terms anyway. How advantageous this is, I'm not sure yet.
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787239#action_12787239
 ] 

Mark Miller edited comment on LUCENE-2130 at 12/8/09 2:11 AM:
--

Whoops - a little off in that summary - you would't apply a huge boolean query 
- you'd just have a sparser filter. This might not be that beneficial.

* edit *

Smaller, sparser filter?

  was (Author: markrmil...@gmail.com):
Whoops - a little off in that summary - you would't apply a huge boolean 
query - you'd just have a sparser filter. This might not be that beneficial.
  
 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-2130.patch


 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto, constant, or constant boolean rewrite could enum terms against a 
 single segment and then apply a boolean query against each segment with just 
 the terms that are known to be in that segment. This way, if you have a bunch 
 of really large segments and a lot of really small segments, you wouldn't 
 apply a huge booleanquery against all of the small segments which don't have 
 those terms anyway. How advantageous this is, I'm not sure yet.
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787250#action_12787250
 ] 

Mark Miller edited comment on LUCENE-2130 at 12/8/09 2:16 AM:
--

The ugly patch - (which doesn't yet handle the filter supplied case)

  was (Author: markrmil...@gmail.com):
The ugly patch
  
 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-2130.patch


 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto, constant, or constant boolean rewrite could enum terms against a 
 single segment and then apply a boolean query against each segment with just 
 the terms that are known to be in that segment. This way, if you have a bunch 
 of really large segments and a lot of really small segments, you wouldn't 
 apply a huge booleanquery against all of the small segments which don't have 
 those terms anyway. How advantageous this is, I'm not sure yet.
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787239#action_12787239
 ] 

Mark Miller edited comment on LUCENE-2130 at 12/8/09 3:25 AM:
--

Whoops - a little off in that summary - you would't apply a huge boolean query 
- you'd just have a sparser filter. This might not be that beneficial.

* edit *

Smaller, sparser filter?

*edit*

Err - in the ConstantScore mode, i guess your really just subdividing the 
filter - so no real benefit. I didn't realize it, but the constant booleanquery 
mode does still use the booleanquery (of course, why else have it) - but its 
only going to be with few clauses, so neither is really a benefit.

  was (Author: markrmil...@gmail.com):
Whoops - a little off in that summary - you would't apply a huge boolean 
query - you'd just have a sparser filter. This might not be that beneficial.

* edit *

Smaller, sparser filter?
  
 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-2130.patch


 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto, constant, or constant boolean rewrite could enum terms against a 
 single segment and then apply a boolean query against each segment with just 
 the terms that are known to be in that segment. This way, if you have a bunch 
 of really large segments and a lot of really small segments, you wouldn't 
 apply a huge booleanquery against all of the small segments which don't have 
 those terms anyway. How advantageous this is, I'm not sure yet.
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2130:


Comment: was deleted

(was: Whoops - a little off in that summary - you would't apply a huge boolean 
query - you'd just have a sparser filter. This might not be that beneficial.

* edit *

Smaller, sparser filter?

*edit*

Err - in the ConstantScore mode, i guess your really just subdividing the 
filter - so no real benefit. I didn't realize it, but the constant booleanquery 
mode does still use the booleanquery (of course, why else have it) - but its 
only going to be with few clauses, so neither is really a benefit.)

 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-2130.patch


 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto, constant, or constant boolean rewrite could enum terms against a 
 single segment and then apply a boolean query against each segment with just 
 the terms that are known to be in that segment. This way, if you have a bunch 
 of really large segments and a lot of really small segments, you wouldn't 
 apply a huge booleanquery against all of the small segments which don't have 
 those terms anyway. How advantageous this is, I'm not sure yet.
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2130:


Description: 
This issue is likely not to go anywhere, but I thought we might explore it. The 
only idea I have come up with is fairly ugly, and unless something better comes 
up, this is not likely to happen.

But if we could rewrite constant score multi-term queries per segment, MTQ's 
with auto (when the heuristic doesnt cut over to constant filter), or constant 
boolean rewrite could enum terms against a single segment and then apply a 
boolean query against each segment with just the terms that are known to be in 
that segment. This also allows you to avoid DirectoryReaders MultiTermEnum and 
its PQ. (See Roberts comment below).

No biggie, not likely, but what the heck.

So the ugly way to do it is to add a property to query's and weights - 
lateCnstRewrite or something, that defaults to false. MTQ would return true if 
its in a constant score mode. On the top level rewrite, if this is detected, an 
empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite 
and it keeps a ref to the original MTQ query. It also gets its boost set to the 
MTQ's boost. Then when we are searching per segment, if the Weight is 
lateCnstRewrite, we grab the orig query and actually do the rewrite against the 
subreader and grab the actual constantscore weight. It works I think - but its 
a little ugly.

Not sure its worth the baggage for the win - but perhaps the objective can be 
met in another way.



  was:
This issue is likely not to go anywhere, but I thought we might explore it. The 
only idea I have come up with is fairly ugly, and unless something better comes 
up, this is not likely to happen.

But if we could rewrite constant score multi-term queries per segment, MTQ's 
with auto, constant, or constant boolean rewrite could enum terms against a 
single segment and then apply a boolean query against each segment with just 
the terms that are known to be in that segment. This way, if you have a bunch 
of really large segments and a lot of really small segments, you wouldn't apply 
a huge booleanquery against all of the small segments which don't have those 
terms anyway. How advantageous this is, I'm not sure yet.

No biggie, not likely, but what the heck.

So the ugly way to do it is to add a property to query's and weights - 
lateCnstRewrite or something, that defaults to false. MTQ would return true if 
its in a constant score mode. On the top level rewrite, if this is detected, an 
empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite 
and it keeps a ref to the original MTQ query. It also gets its boost set to the 
MTQ's boost. Then when we are searching per segment, if the Weight is 
lateCnstRewrite, we grab the orig query and actually do the rewrite against the 
subreader and grab the actual constantscore weight. It works I think - but its 
a little ugly.




I've spewed too much confusion in this issue - just going to rewrite the 
summary.

 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor
 Attachments: LUCENE-2130.patch


 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto (when the heuristic doesnt cut over to constant filter), or 
 constant boolean rewrite could enum terms against a single segment and then 
 apply a boolean query against each segment with just the terms that are known 
 to be in that segment. This also allows you to avoid DirectoryReaders 
 MultiTermEnum and its PQ. (See Roberts comment below).
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.
 Not sure its worth the baggage for the win - but perhaps the objective can be 
 met in another way.

-- 
This 

[jira] Commented: (LUCENE-2132) the demo application does not work as of 3.0

2009-12-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787287#action_12787287
 ] 

Mark Miller commented on LUCENE-2132:
-

tsk tsk - got to run that demo release manager ;) The webapp demo too (which I 
think we should drop just because of that - its outdated and annoying to 
maintain).

 the demo application does not work as of 3.0
 

 Key: LUCENE-2132
 URL: https://issues.apache.org/jira/browse/LUCENE-2132
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: 3.0
Reporter: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2132.patch


 the demo application does not work. QueryParser needs a Version argument.
 While I am here, remove @author too

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-874) Automatic reopen of IndexSearcher/IndexReader

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller closed LUCENE-874.
--

Resolution: Won't Fix

 Automatic reopen of IndexSearcher/IndexReader
 -

 Key: LUCENE-874
 URL: https://issues.apache.org/jira/browse/LUCENE-874
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: João Fonseca
Priority: Minor

 To improve performance, a single instance of IndexSearcher should be used. 
 However, if the index is updated, it's hard to close/reopen it, because 
 multiple threads may be accessing it at the same time.
 Lucene should include an out-of-the-box solution to this problem. Either a 
 new class should be implemented to manage this behaviour (singleton 
 IndexSearcher, plus detection of a modified index, plus safely closing and 
 reopening the IndexSearcher) or this could be behind the scenes by the 
 IndexSearcher class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller closed LUCENE-252.
--

Resolution: Fixed
  Assignee: (was: Lucene Developers)

This issue is too old - if a new patch/proposal is brought up we can reopen it.

 [PATCH] Problem with Sort logic on tokenized fields
 ---

 Key: LUCENE-252
 URL: https://issues.apache.org/jira/browse/LUCENE-252
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.4
 Environment: Operating System: other
 Platform: All
Reporter: Aviran Mordo
 Attachments: dif.txt, 
 FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, 
 FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch, 
 FieldCacheImpl_Tokenized_fields_lucene_2.2-dev.patch


 When you set s SortField to a Text field which gets tokenized
 FieldCacheImpl uses the term to do the sort, but then sorting is off 
 especially with more then one word in the field. I think it is much 
 more logical to sort by field's string value if the sort field is Tokenized 
 and
 stored. This way you'll get the CORRECT sort order

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1000) queryparsersyntax.html escaping section needs beefed up

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1000:


Fix Version/s: 3.1

 queryparsersyntax.html escaping section needs beefed up
 ---

 Key: LUCENE-1000
 URL: https://issues.apache.org/jira/browse/LUCENE-1000
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Hoss Man
 Fix For: 3.1


 the query syntax documentation is currently lacking several key pieces of 
 info:
  1) that unicode style escapes are valid
  2) that any character can be escaped with a backslash, not just special 
 chars.
 ..we should probably beef up the Escaping Special Characters section

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader

2009-12-07 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787290#action_12787290
 ] 

Mark Miller commented on LUCENE-1923:
-

Hows that patch coming ;)

 Add toString() or getName() method to IndexReader
 -

 Key: LUCENE-1923
 URL: https://issues.apache.org/jira/browse/LUCENE-1923
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith

 It would be very useful for debugging if IndexReader either had a getName() 
 method, or a toString() implementation that would get a string identification 
 for the reader.
 for SegmentReader, this would return the same as getSegmentName()
 for Directory readers, this would return the generation id?
 for MultiReader, this could return something like multi(sub reader name, sub 
 reader name, sub reader name, ...)
 right now, i have to check instanceof for SegmentReader, then call 
 getSegmentName(), and for all other IndexReader types, i would have to do 
 something like get the IndexCommit and get the generation off it (and this 
 may throw UnsupportedOperationException, at which point i have would have to 
 recursively walk sub readers and try again)
 I could work up a patch if others like this idea

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2018) Reconsider boolean max clause exception

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2018:


Fix Version/s: 3.1

 Reconsider boolean max clause exception
 ---

 Key: LUCENE-2018
 URL: https://issues.apache.org/jira/browse/LUCENE-2018
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
 Fix For: 3.1


 Now that we have smarter multi-term queries, I think its time to reconsider 
 the boolean max clause setting. It made more sense before, because you could 
 hit it more unaware when the multi-term queries got huge - now its more 
 likely that if it happens its because a user built the boolean themselves. 
 And no duh thousands more boolean clauses means slower perf and more 
 resources needed. We don't throw an exception when you try to use a ton of 
 resources in a thousand other ways.
 The current setting also suffers from the static hell argument - especially 
 when you consider something like Solr's multicore feature - you can have 
 different settings for this in different cores, and the last one is going to 
 win. Its ugly. Yes, that could be addressed better in Solr as well - but I 
 still think it should be less ugly in Lucene as well.
 I'd like to consider either doing away with it, or raising it by quite a bit 
 at the least. Or an alternative better solution. Right now, it aint so great.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-421) Numeric range searching with large value sets

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller closed LUCENE-421.
--

Resolution: Fixed
  Assignee: (was: Lucene Developers)

Closing - a few years old now and we currently have NumericRangeQuery.

 Numeric range searching with large value sets
 -

 Key: LUCENE-421
 URL: https://issues.apache.org/jira/browse/LUCENE-421
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 1.4
 Environment: Operating System: other
 Platform: Other
Reporter: Randy Puttick
Priority: Minor
 Attachments: FieldCache.java, FieldCacheImpl.java, 
 FloatRangeQuery.java, FloatRangeScorer.java, FloatRangeScorer.java, 
 IntegerRangeQuery.java, IntegerRangeQueryTestCase.java, 
 IntegerRangeScorer.java, IntegerRangeScorer.java, IntStack.java, 
 RangeQuery.java, Sort.java


 I have a set of enhancements that build on the numeric sorting cache 
 introduced
 by Tim Jones and that provide integer and floating point range searches over
 numeric ranges that are far too large to be implemented via the current term
 range rewrite mechanism.  I'm new to Apache and trying to find out how to 
 attach
 the source files for the changes for your consideration.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-379) Contribution: Efficient Sorting of DateField/DateTools Encoded Timestamp Long Values

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller closed LUCENE-379.
--

Resolution: Fixed
  Assignee: (was: Lucene Developers)

closing - patch is a few years old and we have Numeric for this now.

 Contribution: Efficient Sorting of DateField/DateTools Encoded Timestamp Long 
 Values
 

 Key: LUCENE-379
 URL: https://issues.apache.org/jira/browse/LUCENE-379
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 1.4
 Environment: Operating System: All
 Platform: Other
Reporter: Rasik Pandey
Priority: Minor
 Attachments: org.apache.lucene.search.LongSortComparator.zip, 
 org.apache.lucene.search.ZIP, org.apache.lucene.search.ZIP, 
 patchTestSort.txt, patchTestSort.txt, patchTestSort.txt


 Hello Tim,
 As promised, the sort functionality for long values is included in the
 attached files.
 patchTestSort.txt contains the diff info. for my modifications to the
 TestSort.java class
 org.apache.lucene.search.ZIP contains the three new class files for
 efficient sorting of long field values and of encoded timestamp
 field values as long values.
 Let me know if you have any questions.
 Regards,
 Rus

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2085) Update PayloadSpanUtil

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2085:


Fix Version/s: 3.1

 Update PayloadSpanUtil
 --

 Key: LUCENE-2085
 URL: https://issues.apache.org/jira/browse/LUCENE-2085
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.9.1
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 3.1




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller closed LUCENE-1286.
---

Resolution: Fixed

This isn't likely to go anywhere anytime soon - Koji's FastVectorHighlighter, 
while requiring termvectors, accomplishes this pretty nicely.

 LargeDocHighlighter - another span highlighter optimized for large documents
 

 Key: LUCENE-1286
 URL: https://issues.apache.org/jira/browse/LUCENE-1286
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/highlighter
Affects Versions: 2.4
Reporter: Mark Miller
Priority: Minor

 The existing Highlighter API is rich and well designed, but the approach 
 taken is not very efficient for large documents.
 I believe that this is because the current Highlighter rebuilds the document 
 by running through and scoring every every token in the tokenstream.
 With a break in the current API, an alternate approach can be taken: rebuild 
 the document by running through the query terms by using their offsets. The 
 benefit is clear - a large doc will have a large tokenstream, but a query 
 will likely be very small in comparison.
 I expect this approach to be quite a bit faster for very large documents, 
 while still supporting Phrase and Span queries.
 First rough patch to follow shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-375) fish*~ parses to PrefixQuery - should be a parse exception

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-375:
---

Assignee: Luis Alves  (was: Lucene Developers)

 fish*~ parses to PrefixQuery - should be a parse exception
 --

 Key: LUCENE-375
 URL: https://issues.apache.org/jira/browse/LUCENE-375
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 1.4
 Environment: Operating System: other
 Platform: Other
Reporter: Erik Hatcher
Assignee: Luis Alves
Priority: Minor

 QueryParser parses fish*~ into a fish* PrefixQuery and silently drops the 
 ~.  This really should be a 
 parse exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-858) link from Lucene web page to API docs

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-858:
---

Fix Version/s: 3.1
 Assignee: (was: Grant Ingersoll)

 link from Lucene web page to API docs
 -

 Key: LUCENE-858
 URL: https://issues.apache.org/jira/browse/LUCENE-858
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Daniel Naber
 Fix For: 3.1


 There should be a way to link from e.g. 
 http://lucene.apache.org/java/docs/gettingstarted.html to the API docs, but 
 not just to the start page with the frame set but to a specific page, e.g. 
 this:
 http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/overview-summary.html#overview_description
 To make this work a way to set a relative link is needed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1307) Remove Contributions page

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1307:


Fix Version/s: 3.1

 Remove Contributions page
 -

 Key: LUCENE-1307
 URL: https://issues.apache.org/jira/browse/LUCENE-1307
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Otis Gospodnetic
Priority: Minor
 Fix For: 3.1


  On Fri, May 16, 2008 at 10:06 PM, Otis Gospodnetic
  otis_gospodne...@yahoo.com wrote:
  Hola,
 
  Does anyone think the Contributions page should be removed?
  http://lucene.apache.org/java/2_3_2/contributions.html
 
  It looks so outdated that I think it may give newcomers a bad  
  impression of Lucene (What, this is it for contributions?).
  The only really valuable piece there is Luke, but Luke must be  
  mentioned in a dozen places on the Wiki anyway.
 
 
  Should we remove the Contributions page?
 Yonik and Grant gave their +1s.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1941:


Fix Version/s: 3.1
   3.0.1

 MinPayloadFunction returns 0 when only one payload is present
 -

 Key: LUCENE-1941
 URL: https://issues.apache.org/jira/browse/LUCENE-1941
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.9
Reporter: Erik Hatcher
 Fix For: 3.0.1, 3.1


 In some experiments with payload scoring through PayloadTermQuery, I'm seeing 
 0 returned when using MinPayloadFunction.  I believe there is a bug there.  
 No time at the moment to flesh out a unit test, but wanted to report it for 
 tracking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1271) ClassCastException when using ParallelMultiSearcher.search(Query query, Filter filter, int n, Sort sort)

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1271:


Fix Version/s: 3.1

 ClassCastException when using ParallelMultiSearcher.search(Query query, 
 Filter filter, int n, Sort sort)
 

 Key: LUCENE-1271
 URL: https://issues.apache.org/jira/browse/LUCENE-1271
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.3, 2.3.1
 Environment: MS Windows XP (SP 2), JDK 1.5.0 Update 12
Reporter: Kai Burjack
Priority: Minor
 Fix For: 3.1


 Stacktrace-Output in Console:
 Exception in thread MultiSearcher thread #1 java.lang.ClassCastException: 
 org.apache.lucene.search.ScoreDoc
   at 
 org.apache.lucene.search.FieldDocSortedHitQueue.lessThan(FieldDocSortedHitQueue.java:105)
   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:139)
   at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:53)
   at 
 org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:78)
   at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:63)
   at 
 org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:272)
 Exception in thread MultiSearcher thread #2 java.lang.ClassCastException: 
 org.apache.lucene.search.ScoreDoc
   at 
 org.apache.lucene.search.FieldDocSortedHitQueue.lessThan(FieldDocSortedHitQueue.java:105)
   at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:139)
   at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:53)
   at 
 org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:78)
   at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:63)
   at 
 org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:272)
 Stack-Trace in resulting exception while performing the JUnit-Test:
 java.lang.ClassCastException: org.apache.lucene.search.ScoreDoc
   at 
 org.apache.lucene.search.FieldDocSortedHitQueue.lessThan(FieldDocSortedHitQueue.java:105)
   at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:155)
   at org.apache.lucene.util.PriorityQueue.pop(PriorityQueue.java:106)
   at 
 org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:146)
   at org.apache.lucene.search.Searcher.search(Searcher.java:78)
   at class calling the Searcher.search(Query query, Filter filter, int 
 n, Sort sort) method with filter:null and sort:null
   
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
   at java.lang.reflect.Method.invoke(Unknown Source)
   at junit.framework.TestCase.runTest(TestCase.java:154)
   at junit.framework.TestCase.runBare(TestCase.java:127)
   at junit.framework.TestResult$1.protect(TestResult.java:106)
   at junit.framework.TestResult.runProtected(TestResult.java:124)
   at junit.framework.TestResult.run(TestResult.java:109)
   at junit.framework.TestCase.run(TestCase.java:118)
   at junit.framework.TestSuite.runTest(TestSuite.java:208)
   at junit.framework.TestSuite.run(TestSuite.java:203)
   at 
 org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
   at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-860:
---

Fix Version/s: 3.1

 site should call project Lucene Java, not just Lucene
 -

 Key: LUCENE-860
 URL: https://issues.apache.org/jira/browse/LUCENE-860
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doug Cutting
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-860.patch


 To avoid confusion with the top-level Lucene project, the Lucene Java website 
 should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1736) DateTools.java general improvements

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1736:


Fix Version/s: 3.1

 DateTools.java general improvements
 ---

 Key: LUCENE-1736
 URL: https://issues.apache.org/jira/browse/LUCENE-1736
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: David Smiley
Priority: Minor
 Fix For: 3.1

 Attachments: cleanerDateTools.patch


 Applying the attached patch shows the improvements to DateTools.java that I 
 think should be done. All logic that does anything at all is moved to 
 instance methods of the inner class Resolution. I argue this is more 
 object-oriented.
 1. In cases where Resolution is an argument to the method, I can simply 
 invoke the appropriate call on the Resolution object. Formerly there was a 
 big branch if/else.
 2. Instead of synchronized being used seemingly everywhere, synchronized is 
 used to sync on the object that is not threadsafe, be it a DateFormat or 
 Calendar instance.
 3. Since different DateFormat and Calendar instances are created 
 per-Resolution, there is now less lock contention since threads using 
 different resolutions will not use the same locks.
 4. The old implementation of timeToString rounded the time before formatting 
 it. That's unnecessary since the format only includes the resolution desired.
 5. round() now uses a switch statement that benefits from fall-through (no 
 break).
 Another debatable improvement that could be made is putting the resolution 
 instances into an array indexed by format length. This would mean I could 
 remove the switch in lookupResolutionByLength() and avoid the length 
 constants there. Maybe that would be a bit too over-engineered when the 
 switch is fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-636) [PATCH] Differently configured Lucene 'instances' in same JVM

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-636.


   Resolution: Fixed
Fix Version/s: 3.0

 [PATCH] Differently configured Lucene 'instances' in same JVM
 -

 Key: LUCENE-636
 URL: https://issues.apache.org/jira/browse/LUCENE-636
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Johan Stuyts
 Fix For: 3.0

 Attachments: Lucene2DifferentConfigurations.patch


 Currently Lucene can be configured using system properties. When running 
 multiple 'instances' of Lucene for different purposes in the same JVM, it is 
 not possible to use different settings for each 'instance'.
 I made changes to some Lucene classes so you can pass a configuration to that 
 class. The Lucene 'instance' will use the settings from that configuration. 
 The changes do not effect the API and/or the current behavior so are 
 backwards compatible.
 In addition to the changes above I also made the SegmentReader and 
 SegmentTermDocs extensible outside of their package. I would appreciate the 
 inclusion of these changes but don't mind creating a separate issue for them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-1052.
-

Resolution: Fixed

This issue was resolved - lets open a new one if we want to do more.

 Add an termInfosIndexDivisor to IndexReader
 -

 Key: LUCENE-1052
 URL: https://issues.apache.org/jira/browse/LUCENE-1052
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Attachments: LUCENE-1052.patch, LUCENE-1052.patch, 
 termInfosConfigurer.patch


 The termIndexInterval, set during indexing time, let's you tradeoff
 how much RAM is used by a reader to load the indexed terms vs cost of
 seeking to the specific term you want to load.
 But the downside is you must set it at indexing time.
 This issue adds an indexDivisor to TermInfosReader so that on opening
 a reader you could further sub-sample the the termIndexInterval to use
 less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
 loaded into RAM.
 This is particularly useful if your index has a great many terms (eg
 you accidentally indexed binary terms).
 Spinoff from this thread:
   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller closed LUCENE-1859.
---

Resolution: Won't Fix

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2130:


Attachment: LUCENE-2130.patch

updated

 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor
 Fix For: Flex Branch

 Attachments: LUCENE-2130.patch, LUCENE-2130.patch


 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto (when the heuristic doesnt cut over to constant filter), or 
 constant boolean rewrite could enum terms against a single segment and then 
 apply a boolean query against each segment with just the terms that are known 
 to be in that segment. This also allows you to avoid DirectoryReaders 
 MultiTermEnum and its PQ. (See Roberts comment below).
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.
 Not sure its worth the baggage for the win - but perhaps the objective can be 
 met in another way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment

2009-12-07 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2130:


Fix Version/s: Flex Branch

 Investigate Rewriting Constant Scoring MultiTermQueries per segment
 ---

 Key: LUCENE-2130
 URL: https://issues.apache.org/jira/browse/LUCENE-2130
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Priority: Minor
 Fix For: Flex Branch

 Attachments: LUCENE-2130.patch, LUCENE-2130.patch


 This issue is likely not to go anywhere, but I thought we might explore it. 
 The only idea I have come up with is fairly ugly, and unless something better 
 comes up, this is not likely to happen.
 But if we could rewrite constant score multi-term queries per segment, MTQ's 
 with auto (when the heuristic doesnt cut over to constant filter), or 
 constant boolean rewrite could enum terms against a single segment and then 
 apply a boolean query against each segment with just the terms that are known 
 to be in that segment. This also allows you to avoid DirectoryReaders 
 MultiTermEnum and its PQ. (See Roberts comment below).
 No biggie, not likely, but what the heck.
 So the ugly way to do it is to add a property to query's and weights - 
 lateCnstRewrite or something, that defaults to false. MTQ would return true 
 if its in a constant score mode. On the top level rewrite, if this is 
 detected, an empty ConstantScoreQuery is made, and its Weight is turned to 
 lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets 
 its boost set to the MTQ's boost. Then when we are searching per segment, if 
 the Weight is lateCnstRewrite, we grab the orig query and actually do the 
 rewrite against the subreader and grab the actual constantscore weight. It 
 works I think - but its a little ugly.
 Not sure its worth the baggage for the win - but perhaps the objective can be 
 met in another way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big

2009-12-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786680#action_12786680
 ] 

Mark Miller commented on LUCENE-1859:
-

without a proposed patch from someone, I'm tempted to close this issue...

 TermAttributeImpl's buffer will never shrink if it grows too big
 --

 Key: LUCENE-1859
 URL: https://issues.apache.org/jira/browse/LUCENE-1859
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.9
Reporter: Tim Smith
Priority: Minor

 This was also an issue with Token previously as well
 If a TermAttributeImpl is populated with a very long buffer, it will never be 
 able to reclaim this memory
 Obviously, it can be argued that Tokenizer's should never emit large 
 tokens, however it seems that the TermAttributeImpl should have a reasonable 
 static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, 
 it will shrink back down to this size once the next token smaller than 
 MAX_BUFFER_SIZE is set
 I don't think i have actually encountered issues with this yet, however it 
 seems like if you have multiple indexing threads, you could end up with a 
 char[Integer.MAX_VALUE] per thread (in the very worst case scenario)
 perhaps growTermBuffer should have the logic to shrink if the buffer is 
 currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-774) TopDocs and TopFieldDocs does not implement equals and hashCode

2009-12-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786681#action_12786681
 ] 

Mark Miller commented on LUCENE-774:


Still want to push forward with this issue?

 TopDocs and TopFieldDocs does not implement equals and hashCode
 ---

 Key: LUCENE-774
 URL: https://issues.apache.org/jira/browse/LUCENE-774
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Karl Wettin
Priority: Trivial
 Attachments: extendsObject.diff




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-792) PrecedenceQueryParser misinterprets queries starting with NOT

2009-12-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786683#action_12786683
 ] 

Mark Miller commented on LUCENE-792:


Based on its state, we should prob deprecate PrecedenceQueryParser in favor of 
the precedence thats about to land in the new QueryParser impl.

 PrecedenceQueryParser misinterprets queries starting with NOT
 -

 Key: LUCENE-792
 URL: https://issues.apache.org/jira/browse/LUCENE-792
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 2.0.0
Reporter: Eric Jain

 NOT foo AND baz is parsed as -(+foo +baz) instead of -foo +bar.
 (I'm setting parser.setDefaultOperator(PrecedenceQueryParser.AND_OPERATOR) 
 but the issue applies otherwise too.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-860) site should call project Lucene Java, not just Lucene

2009-12-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786684#action_12786684
 ] 

Mark Miller commented on LUCENE-860:


I'm actually more confused when I see Lucene Java than I am by Lucene :) 

But, I'll commit this soon if no one has any objections.

 site should call project Lucene Java, not just Lucene
 -

 Key: LUCENE-860
 URL: https://issues.apache.org/jira/browse/LUCENE-860
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doug Cutting
Priority: Minor
 Attachments: LUCENE-860.patch


 To avoid confusion with the top-level Lucene project, the Lucene Java website 
 should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-860) site should call project Lucene Java, not just Lucene

2009-12-06 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned LUCENE-860:
--

Assignee: Mark Miller

 site should call project Lucene Java, not just Lucene
 -

 Key: LUCENE-860
 URL: https://issues.apache.org/jira/browse/LUCENE-860
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doug Cutting
Assignee: Mark Miller
Priority: Minor
 Attachments: LUCENE-860.patch


 To avoid confusion with the top-level Lucene project, the Lucene Java website 
 should refer to itself as Lucene Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-874) Automatic reopen of IndexSearcher/IndexReader

2009-12-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786686#action_12786686
 ] 

Mark Miller commented on LUCENE-874:


Anyone interested in this issue? I think the new ref stuff actually makes this 
rather easy now ...

 Automatic reopen of IndexSearcher/IndexReader
 -

 Key: LUCENE-874
 URL: https://issues.apache.org/jira/browse/LUCENE-874
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: João Fonseca
Priority: Minor

 To improve performance, a single instance of IndexSearcher should be used. 
 However, if the index is updated, it's hard to close/reopen it, because 
 multiple threads may be accessing it at the same time.
 Lucene should include an out-of-the-box solution to this problem. Either a 
 new class should be implemented to manage this behaviour (singleton 
 IndexSearcher, plus detection of a modified index, plus safely closing and 
 reopening the IndexSearcher) or this could be behind the scenes by the 
 IndexSearcher class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-902) Check on PositionIncrement with StopFilter.

2009-12-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786687#action_12786687
 ] 

Mark Miller commented on LUCENE-902:


This patch is severely out of date - could we get an update?

 Check on PositionIncrement  with StopFilter.
 

 Key: LUCENE-902
 URL: https://issues.apache.org/jira/browse/LUCENE-902
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.2
Reporter: Toru Matsuzawa
 Attachments: stopfilter.patch, stopfilter20070604.patch, 
 stopfilter20070605.patch, stopfilter20070608.patch


 PositionIncrement set with Tokenizer is not considered with StopFilter. 
 When PositionIncrement of Token is 1, it is deleted by StopFilter. However, 
 when PositionIncrement of Token following afterwards is 0, it is not deleted. 
 I think that it is necessary to be deleted. Because it is thought same Token 
 when PositionIncrement is 0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-644) Contrib: another highlighter approach

2009-12-06 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786688#action_12786688
 ] 

Mark Miller commented on LUCENE-644:


I think its time to close this issue - further work here should probably be 
applied to the FastVectorHighlighter (which is very similar and now in contrib).

 Contrib: another highlighter approach
 -

 Key: LUCENE-644
 URL: https://issues.apache.org/jira/browse/LUCENE-644
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Ronnie Kolehmainen
Priority: Minor
 Attachments: FulltextHighlighter.java, FulltextHighlighter.java, 
 FulltextHighlighterTest.java, FulltextHighlighterTest.java, svn-diff.patch, 
 svn-diff.patch, TokenSources.java, TokenSources.java.diff


 Mark Harwoods highlighter package is a great contribution to Lucene, I've 
 used it a lot! However, when you have *large* documents (fields), 
 highlighting can be quite time consuming if you increase the number of bytes 
 to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is 
 often too low for indexed PDFs etcetera, which results in empty highlight 
 strings.
 This is an alternative approach using term position vectors only to build 
 fragment info objects. Then a StringReader can read the relevant fragments 
 and skip() between them. This is a lot faster. Also, this method uses the 
 *entire* field for finding the best fragments so you're always guaranteed to 
 get a highlight snippet.
 Because this method only works with fields which have term positions stored 
 one can check if this method works for a particular field using following 
 code (taken from TokenSources.java):
 TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, 
 field);
 if (tfv != null  tfv instanceof TermPositionVector)
 {
   // use FulltextHighlighter
 }
 else
 {
   // use standard Highlighter
 }
 Someone else might find this useful so I'm posting the code here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   3   4   5   6   7   8   9   10   >