[jira] Commented: (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances
[ https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857904#action_12857904 ] Mark Miller commented on LUCENE-2287: - Hey Michael - there is a lot of reformatting it looks like in this patch - if its not that much of a hassle, is it possible to get a patch without the formats? Unexpected terms are highlighted within nested SpanQuery instances -- Key: LUCENE-2287 URL: https://issues.apache.org/jira/browse/LUCENE-2287 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 2.9.1 Environment: Linux, Solaris, Windows Reporter: Michael Goddard Priority: Minor Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch Original Estimate: 336h Remaining Estimate: 336h I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. Briefly, the issue is illustrated by the second instance of Lucene being highlighted in the test below, when it doesn't satisfy the inner span. There's been some discussion about this on the java-dev list, and I'm opening this issue now because I have made some initial progress on this. This new test, added to the HighlighterTest class in lucene_2_9_1, illustrates this: /* * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ */ public void testHighlightingNestedSpans2() throws Exception { String theText = The Lucene was made by Doug Cutting and Lucene great Hadoop was; // Problem //String theText = The Lucene was made by Doug Cutting and the great Hadoop was; // Works okay String fieldName = SOME_FIELD_NAME; SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(fieldName, lucene)), new SpanTermQuery(new Term(fieldName, doug)) }, 5, true); Query query = new SpanNearQuery(new SpanQuery[] { spanNear, new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true); String expected = The BLucene/B was made by BDoug/B Cutting and Lucene great BHadoop/B was; //String expected = The BLucene/B was made by BDoug/B Cutting and the great BHadoop/B was; String observed = highlightField(query, fieldName, theText); System.out.println(Expected: \ + expected + \n + Observed: \ + observed); assertEquals(Why is that second instance of the term \Lucene\ highlighted?, expected, observed); } Is this an issue that's arisen before? I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet. Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.
[ https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856916#action_12856916 ] Mark Miller commented on LUCENE-2159: - There is an excellent section on it in LIA2 :) Tool to expand the index for perf/stress testing. - Key: LUCENE-2159 URL: https://issues.apache.org/jira/browse/LUCENE-2159 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 3.0 Reporter: John Wang Attachments: ExpandIndex.java Sometimes it is useful to take a small-ish index and expand it into a large index with K segments for perf/stress testing. This tool does that. See attached class. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index
[ https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12857155#action_12857155 ] Mark Miller commented on LUCENE-2393: - Perhaps this should be combined with high freq terms tool ... could make a ton of this little guys, so prob best to consolidate them. Utility to output total term frequency and df from a lucene index - Key: LUCENE-2393 URL: https://issues.apache.org/jira/browse/LUCENE-2393 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Tom Burton-West Priority: Trivial Attachments: LUCENE-2393.patch This is a command line utility that takes a field name, term, and index directory and outputs the document frequency for the term and the total number of occurrences of the term in the index (i.e. the sum of the tf of the term for each document). It is useful for estimating the size of the term's entry in the *prx files and consequent Disk I/O demands -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855732#action_12855732 ] Mark Miller commented on LUCENE-2386: - Is this change worth it with all of its repercussions? What are the upsides? There do appear to be downsides... IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855740#action_12855740 ] Mark Miller commented on LUCENE-2386: - {quote}I do think this is a good change - IW was previously inconsistent, first that it would even make a commit when we no longer have an autoCommit=true, and, second, that it would not make the commit for a directory that already had an index (we fixed this case a while back). So I like that this fix makes IW's init behavior more consistent / simpler.{quote} Thats not a very strong argument for a back compat break on a minor release though... IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2386) IndexWriter commits unnecessarily on fresh Directory
[ https://issues.apache.org/jira/browse/LUCENE-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855748#action_12855748 ] Mark Miller commented on LUCENE-2386: - bq. Hmmm... I think the back compat break is very minor Yes - it is - but so was the argument for it IMO. Your extended argument is more compelling though. IndexWriter commits unnecessarily on fresh Directory Key: LUCENE-2386 URL: https://issues.apache.org/jira/browse/LUCENE-2386 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Shai Erera Assignee: Shai Erera Fix For: 3.1 Attachments: LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch, LUCENE-2386.patch I've noticed IndexWriter's ctor commits a first commit (empty one) if a fresh Directory is passed, w/ OpenMode.CREATE or CREATE_OR_APPEND. This seems unnecessarily, and kind of brings back an autoCommit mode, in a strange way ... why do we need that commit? Do we really expect people to open an IndexReader on an empty Directory which they just passed to an IW w/ create=true? If they want, they can simply call commit() right away on the IW they created. I ran into this when writing a test which committed N times, then compared the number of commits (via IndexReader.listCommits) and was surprised to see N+1 commits. Tried to change doCommit to false in IW ctor, but it got IndexFileDeleter jumping on me .. so the change might not be that simple. But I think it's manageable, so I'll try to attack it (and IFD specifically !) back :). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2391) Spellchecker uses default IW mergefactor/ramMB settings of 300/10
Spellchecker uses default IW mergefactor/ramMB settings of 300/10 - Key: LUCENE-2391 URL: https://issues.apache.org/jira/browse/LUCENE-2391 Project: Lucene - Java Issue Type: Improvement Components: contrib/spellchecker Reporter: Mark Miller Priority: Trivial These settings seem odd - I'd like to investigate what makes most sense here. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2372) Replace deprecated TermAttribute by new CharTermAttribute
[ https://issues.apache.org/jira/browse/LUCENE-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12855489#action_12855489 ] Mark Miller commented on LUCENE-2372: - bq.If I make it final and +1 - lets just remember to add these breaks to the CHANGES BW break section... Replace deprecated TermAttribute by new CharTermAttribute - Key: LUCENE-2372 URL: https://issues.apache.org/jira/browse/LUCENE-2372 Project: Lucene - Java Issue Type: Improvement Affects Versions: 3.1 Reporter: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2372.patch, LUCENE-2372.patch, LUCENE-2372.patch After LUCENE-2302 is merged to trunk with flex, we need to carry over all tokenizers and consumers of the TokenStreams to the new CharTermAttribute. We should also think about adding a AttributeFactory that creates a subclass of CharTermAttributeImpl that returns collation keys in toBytesRef() accessor. CollationKeyFilter is then obsolete, instead you can simply convert every TokenStream to indexing only CollationKeys by changing the attribute implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854899#action_12854899 ] Mark Miller commented on LUCENE-2074: - {quote}Uwe, must this be coupled with that issue? This one waits for a long time (why? for JFlex 1.5 release?) and protecting against a huge buffer allocation can be a real quick and tiny fix. And this one also focuses on getting Unicode 5 to work, which is unrelated to the buffer size. But the buffer size is not a critical issue either that we need to move fast with it ... so it's your call. Just thought they are two unrelated problems.{quote} Agreed. Whether its fixed as part of this commit or not, it really deserves its own issue anyway, for changes and tracking. It has nothing to do with this issue other than convenience. Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer --- Key: LUCENE-2074 URL: https://issues.apache.org/jira/browse/LUCENE-2074 Project: Lucene - Java Issue Type: Bug Affects Versions: 3.0 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch The current trunk version of StandardTokenizerImpl was generated by Java 1.4 (according to the warning). In Java 3.0 we switch to Java 1.5, so we should regenerate the file. After regeneration the Tokenizer behaves different for some characters. Because of that we should only use the new TokenizerImpl when Version.LUCENE_30 or LUCENE_31 is used as matchVersion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1895) Point2D defines equals by comparing double types with ==
[ https://issues.apache.org/jira/browse/LUCENE-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853402#action_12853402 ] Mark Miller commented on LUCENE-1895: - I put this up not knowing really anything about the specific use case(s) of the Point2D class - I have never used Spatial - so close if it makes sense to do so. My generic worry is that you can come to the *same* double value in two different ways, but == will not find them to be equal. Point2D defines equals by comparing double types with == Key: LUCENE-1895 URL: https://issues.apache.org/jira/browse/LUCENE-1895 Project: Lucene - Java Issue Type: Bug Components: contrib/spatial Reporter: Mark Miller Assignee: Chris Male Priority: Trivial Ideally, this should allow for a margin of error right? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1709) Parallelize Tests
[ https://issues.apache.org/jira/browse/LUCENE-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848712#action_12848712 ] Mark Miller commented on LUCENE-1709: - +1 on removing those flags - personally I find them unnecessary - and they complicate the build. And I would love to Lucene parallel like Solr now. Parallelize Tests - Key: LUCENE-1709 URL: https://issues.apache.org/jira/browse/LUCENE-1709 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Fix For: 3.1 Attachments: LUCENE-1709.patch, runLuceneTests.py Original Estimate: 48h Remaining Estimate: 48h The Lucene tests can be parallelized to make for a faster testing system. This task from ANT can be used: http://ant.apache.org/manual/CoreTasks/parallel.html Previous discussion: http://www.gossamer-threads.com/lists/lucene/java-dev/69669 Notes from Mike M.: {quote} I'd love to see a clean solution here (the tests are embarrassingly parallelizable, and we all have machines with good concurrency these days)... I have a rather hacked up solution now, that uses -Dtestpackage=XXX to split the tests up. Ideally I would be able to say use N threads and it'd do the right thing... like the -j flag to make. {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1814) Some Lucene tests try and use a Junit Assert in new threads
[ https://issues.apache.org/jira/browse/LUCENE-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847931#action_12847931 ] Mark Miller commented on LUCENE-1814: - Chris Male mentioned to me that he thinks Uwe has fixed this? Some Lucene tests try and use a Junit Assert in new threads --- Key: LUCENE-1814 URL: https://issues.apache.org/jira/browse/LUCENE-1814 Project: Lucene - Java Issue Type: Bug Reporter: Mark Miller Priority: Minor There are a few cases in Lucene tests where JUnit Asserts are used inside a new threads run method - this won't work because Junit throws an exception when a call to Assert fails - that will kill the thread, but the exception will not propagate to JUnit - so unless a failure is caused later from the thread termination, the Asserts are invalid. TestThreadSafe TestStressIndexing2 TestStringIntern -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2305) Introduce Version in more places long before 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846542#action_12846542 ] Mark Miller commented on LUCENE-2305: - Ah, yes - I didnt remember your comment right: {quote} We could make the change under Version? (Change to true, starting in 3.1). Or maybe not make the change. If set to true, we use pct deletion on a segment to reduce its perceived size when selecting merges, which generally causes segments with pending deletions to be merged away sooner {quote} Sounds like a good move. Introduce Version in more places long before 4.0 Key: LUCENE-2305 URL: https://issues.apache.org/jira/browse/LUCENE-2305 Project: Lucene - Java Issue Type: Improvement Reporter: Shai Erera Fix For: 3.1 We need to introduce Version in as many places as we can (wherever it makes sense of course), and preferably long before 4.0 (or shall I say 3.9?) is out. That way, we can have a bunch of deprecated API now, that will be gone in 4.0, rather than doing it one class at a time and never finish :). The purpose is to introduce Version wherever it is mandatory now, and also in places where we think it might be useful in the future (like most of our Analyzers, configured classes and configuration classes). I marked this issue for 3.1, though I don't expect it to end in 3.1. I still think it will be done one step at a time, perhaps for cluster of classes together. But on the other hand I don't want to mark it for 4.0.0 because that needs to be resolved much sooner. So if I had a 3.9 version defined, I'd mark it for 3.9. We can do several commits in one issue right? So this one can live for a while in JIRA, while we gradually convert more and more classes. The first candidate is InstantiatedIndexWriter which probably should take an IndexWriterConfig. While I converted the code to use IWC, I've noticed Instantiated defaults its maxFieldLength to the current default (10,000) which is deprecated. I couldn't change it for back-compat reasons. But we can upgrade it to accept IWC, and set to unlimited if the version is onOrAfter 3.1, otherwise stay w/ the deprecated default. if it's acceptable to have several commits in one issue, I can start w/ Instantiated, post a patch and then we can continue to more classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846622#action_12846622 ] Mark Miller commented on LUCENE-2320: - +1 - I've had to do this in the past too. Just dropping tests doesn't seem like the way to go in many cases. Add MergePolicy to IndexWriterConfig Key: LUCENE-2320 URL: https://issues.apache.org/jira/browse/LUCENE-2320 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as well. The change is not straightforward and so I've kept it for a separate issue. MergePolicy requires in its ctor an IndexWriter, however none can be passed to it before an IndexWriter actually exists. And today IW may create an MP just for it to be overridden by the application one line afterwards. I don't want to make iw member of MP non-final, or settable by extending classes, however it needs to remain protected so they can access it directly. So the proposed changes are: * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set once (hence its name). It'll have the signature SetOnceT w/ *synchronized setT* and *T get()*. T will be declared volatile, so that get() won't be synchronized. * MP will define a *protected final SetOnceIndexWriter writer* instead of the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. * MP will offer a public default ctor, together with a set(IndexWriter). * IndexWriter will set itself on MP using set(this). Note that if set will be called more than once, it will throw an exception (AlreadySetException - or does someone have a better suggestion, preferably an already existing Java exception?). That's the core idea. I'd like to post a patch soon, so I'd appreciate your review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2323) reorganize contrib modules
[ https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846711#action_12846711 ] Mark Miller commented on LUCENE-2323: - This reorg is a great a great step for contrib IMO! +1 reorganize contrib modules -- Key: LUCENE-2323 URL: https://issues.apache.org/jira/browse/LUCENE-2323 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Robert Muir it would be nice to reorganize contrib modules, so that they are bundled together by functionality. For example: * the wikipedia contrib is a tokenizer, i think really belongs in contrib/analyzers * there are two highlighters, i think could be one highlighters package. * there are many queryparsers and queries in different places in contrib -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844516#action_12844516 ] Mark Miller commented on LUCENE-2309: - bq. Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google. Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a great way for people to communicate and hash stuff out, but its not necessary you follow it. If you have questions or want further elaboration, just ask. No one can expect you to follow IRC, nor is it a valid reference for where something was decided. IRC is great - I think its really benefited having devs discuss there - but the official position is, if it didn't happen on the list, it didnt actually happen. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843717#action_12843717 ] Mark Miller commented on LUCENE-2294: - bq. If we say Analyzer is mandatory, what will stop us tomorrow from saying IndexDeletionPolicy is mandatory? Nothing ;) But I think Analyzer should be mandatory and that IndexDeletionPolicy should not be mandatory, looking at them case by case. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843729#action_12843729 ] Mark Miller commented on LUCENE-2294: - bq. Question - does SOLR requires everyone to specify an Analyzer, or does it come w/ a default one? Hmm... SOLR doesn't really use Lucene analyzers. It comes with a default Schema.xml that defines FieldTypes. Then field names can be assigned to FieldTypes. So technically speaking, no, Solr does not - but because most people build off the example, you could say that it does have defaults for example FieldTypes and defaults of what field names map to those. But it also only accepts certain example fields with the example Schema - you really have to go in and customize it to your needs - its setup to basically show off what options are available and work with some demo stuff. Solr comes with almost no defaults in a way - but it does ship with an example setup that is meant to show you how to set things up, and what is available. You could consider those defaults since most will build off it. example of Solr analyzer declaration: {code} !-- A general unstemmed text field - good if one does not know the language of the field -- fieldType name=textgen class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType {code} Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843756#action_12843756 ] Mark Miller commented on LUCENE-2294: - I'm assuming you would set an Analyzer for the document - and then you could override per field - or something along those lines. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843331#action_12843331 ] Mark Miller commented on LUCENE-2089: - Sweet! explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: Flex Branch Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: Flex Branch Attachments: ContrivedFuzzyBenchmark.java, createLevAutomata.py, gen.py, gen.py, gen.py, gen.py, gen.py, gen.py, Lev2ParametricDescription.java, Lev2ParametricDescription.java, Lev2ParametricDescription.java, Lev2ParametricDescription.java, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089_concat.patch, Moman-0.2.1.tar.gz, moman-57f5dc9dd0e7.diff, TestFuzzy.java we can optimize fuzzyquery by using AutomatonTermsEnum. The idea is to speed up the core FuzzyQuery in similar fashion to Wildcard and Regex speedups, maintaining all backwards compatibility. The advantages are: * we can seek to terms that are useful, instead of brute-forcing the entire terms dict * we can determine matches faster, as true/false from a DFA is array lookup, don't even need to run levenshtein. We build Levenshtein DFAs in linear time with respect to the length of the word: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 To implement support for 'prefix' length, we simply concatenate two DFAs, which doesn't require us to do NFA-DFA conversion, as the prefix portion is a singleton. the concatenation is also constant time with respect to the size of the fuzzy DFA, it only need examine its start state. with this algorithm, parametric tables are precomputed so that DFAs can be constructed very quickly. if the required number of edits is too large (we don't have a table for it), we use dumb mode at first (no seeking, no DFA, just brute force like now). As the priority queue fills up during enumeration, the similarity score required to be a competitive term increases, so, the enum gets faster and faster as this happens. This is because terms in core FuzzyQuery are sorted by boost value, then by term (in lexicographic order). For a large term dictionary with a low minimal similarity, you will fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs (edit distance of 2 - edit distance of 1 - edit distance of 0) during enumeration, but also to switch from dumb mode to smart mode. With this design, we can add more DFAs at any time by adding additional tables. The tradeoff is the tables get rather large, so for very high K, we would start to increase the size of Lucene's jar file. The idea is we don't have include large tables for very high K, by using the 'competitive boost' attribute of the priority queue. For more information, see http://en.wikipedia.org/wiki/Levenshtein_automaton -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249 ] Mark Miller commented on LUCENE-2294: - I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats *required* to set on the config class - without the same cstr business ... Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841249#action_12841249 ] Mark Miller edited comment on LUCENE-2294 at 3/4/10 1:45 PM: - I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats *required* to set on the config class - without the same cstr business ... *edit* Though I suppose the chaining *does* makes this more swallowable... new IW(new IWConfig(Analyzer).set().set().set()) isn't really so bad ... was (Author: markrmil...@gmail.com): I can see the value in this - there are a bunch of IW constructors - but personally I still think I prefer them. Creating config classes to init another class is its own pain in the butt. Reminds me of windows C programming and structs. When I'm just coding away, its so much easier to just enter the params in the cnstr. And it seems like it would be more difficult to know whats *required* to set on the config class - without the same cstr business ... Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2287) Unexpected terms are highlighted within nested SpanQuery instances
[ https://issues.apache.org/jira/browse/LUCENE-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839744#action_12839744 ] Mark Miller commented on LUCENE-2287: - bq. Breaks backward compatibility, so need to find a way around that Wouldn't be the end of the world depending on the break. Unexpected terms are highlighted within nested SpanQuery instances -- Key: LUCENE-2287 URL: https://issues.apache.org/jira/browse/LUCENE-2287 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 2.9.1 Environment: Linux, Solaris, Windows Reporter: Michael Goddard Priority: Minor Attachments: LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch, LUCENE-2287.patch Original Estimate: 336h Remaining Estimate: 336h I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. Briefly, the issue is illustrated by the second instance of Lucene being highlighted in the test below, when it doesn't satisfy the inner span. There's been some discussion about this on the java-dev list, and I'm opening this issue now because I have made some initial progress on this. This new test, added to the HighlighterTest class in lucene_2_9_1, illustrates this: /* * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ */ public void testHighlightingNestedSpans2() throws Exception { String theText = The Lucene was made by Doug Cutting and Lucene great Hadoop was; // Problem //String theText = The Lucene was made by Doug Cutting and the great Hadoop was; // Works okay String fieldName = SOME_FIELD_NAME; SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(fieldName, lucene)), new SpanTermQuery(new Term(fieldName, doug)) }, 5, true); Query query = new SpanNearQuery(new SpanQuery[] { spanNear, new SpanTermQuery(new Term(fieldName, hadoop)) }, 4, true); String expected = The BLucene/B was made by BDoug/B Cutting and Lucene great BHadoop/B was; //String expected = The BLucene/B was made by BDoug/B Cutting and the great BHadoop/B was; String observed = highlightField(query, fieldName, theText); System.out.println(Expected: \ + expected + \n + Observed: \ + observed); assertEquals(Why is that second instance of the term \Lucene\ highlighted?, expected, observed); } Is this an issue that's arisen before? I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet. Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2226) move contrib/snowball to contrib/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801972#action_12801972 ] Mark Miller commented on LUCENE-2226: - Contribs back compat policy is that there is no back compat policy unless that contrib specifically states one. move contrib/snowball to contrib/analyzers -- Key: LUCENE-2226 URL: https://issues.apache.org/jira/browse/LUCENE-2226 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2226.patch to fix bugs in some duplicate, handcoded impls of these stemmers (nl, fr, ru, etc) we should simply merge snowball and analyzers, and replace the buggy impls with the proper snowball stemfilters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2226) move contrib/snowball to contrib/analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802011#action_12802011 ] Mark Miller commented on LUCENE-2226: - {quote}Mark, that is my understanding too. I wasn't commenting on the policy but on the fact of the possible breakage. I think it is a courtesy to notify users of a change to which they might need to pay attention. I don't know that's spelled out in the policy, but I think it should be. Not that a lack of notice is a guarantee of no breakage but that a notice is a guarantee of breakage (at least under some circumstances).{quote} Right - I was just pointing out that jar drop in is far from a requirement in contrib. We do always try and play nice anyway. bq. Is there any contrib that specifically states one? I couldn't find it. Don't think so - meaning there is no back compat policy in contrib - I think as a contrib matures, its up to those working on it to decide that its reached a state that deserves a policy of some kind. The Highlighter could probably use one at this point, but at the same time, nothing has created too much of an outcry at this point. bq. The analysis/common is not clear as it has the Version stuff. Right - just because there is no policy doesn't mean we shouldn't make any attempts at back compat - but the issue you brought up is not something easily addressed, nor I think, large enough to worry about with the proper warning in Changes. Users should be wary of contrib on upgrading - unless it presents a strong back compat policy. bq. But after all the dust settles and this i18n stuff is solid, I think it might be reasonable to make a stronger bw compat statement. I agree - now that contrib has been getting some much needed love recently, I think it should start heading towards some back compat promises - especially concerning analyzers. We already do tend to bend over backwards when we can anyway. I think we are on the same page - I'm just not very worried about the break you mention - I think its a perfectly acceptable growing pain. And I think our back compat has been so week because contrib has been a bit of a wasteland in the past - no one was willing to take ownership of a lot of this stuff - especially the language analyzers. That has change recently. As the devs clean up and consolidate this stuff properly, I think we can work towards stronger promises in the future. move contrib/snowball to contrib/analyzers -- Key: LUCENE-2226 URL: https://issues.apache.org/jira/browse/LUCENE-2226 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Fix For: 3.1 Attachments: LUCENE-2226.patch to fix bugs in some duplicate, handcoded impls of these stemmers (nl, fr, ru, etc) we should simply merge snowball and analyzers, and replace the buggy impls with the proper snowball stemfilters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-2035. - Resolution: Fixed Thanks Christopher! TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene
[ https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-860: --- Attachment: LUCENE-860-1.patch updated patch that also includes doc site level changes site should call project Lucene Java, not just Lucene - Key: LUCENE-860 URL: https://issues.apache.org/jira/browse/LUCENE-860 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doug Cutting Assignee: Mark Miller Priority: Minor Fix For: 3.1 Attachments: LUCENE-860-1.patch, LUCENE-860-2.patch, LUCENE-860.patch To avoid confusion with the top-level Lucene project, the Lucene Java website should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene
[ https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-860: --- Attachment: LUCENE-860-2.patch site should call project Lucene Java, not just Lucene - Key: LUCENE-860 URL: https://issues.apache.org/jira/browse/LUCENE-860 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doug Cutting Assignee: Mark Miller Priority: Minor Fix For: 3.1 Attachments: LUCENE-860-1.patch, LUCENE-860-2.patch, LUCENE-860.patch To avoid confusion with the top-level Lucene project, the Lucene Java website should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791939#action_12791939 ] Mark Miller commented on LUCENE-2035: - I'll commit this soon. TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1922) exposing the ability to get the number of unique term count per field
[ https://issues.apache.org/jira/browse/LUCENE-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1922: Affects Version/s: (was: 2.4.1) Flex Branch exposing the ability to get the number of unique term count per field - Key: LUCENE-1922 URL: https://issues.apache.org/jira/browse/LUCENE-1922 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: Flex Branch Reporter: John Wang Add an api to get the number of unique term count given a field name, e.g.: IndexReader.getUniqueTermCount(String field) This issue has a dependency on LUCENE-1458 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791680#action_12791680 ] Mark Miller commented on LUCENE-2035: - Hey Christopher, why are you going through the trouble of the custom collector to check that there are no hits? Why not just do a standard search? TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2035: Attachment: LUCENE-2035.patch I've broken the new tests back out into there own file, change the hit collector code to just search basically, and improved the test coverage of TokenSources a bit. TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790748#action_12790748 ] Mark Miller commented on LUCENE-2089: - Sorry Earwin - to be clear, we don't actually use chapter 6 - AutomataQuery needs the automata. You can get all the states just by taking the power set of the subsumption triangle for every base position, and then removing from each set any position thats subsumed by another. Thats what I mean by brute force. But in the paper, they boil this down to nice little i param tables, extracting some sort of pattern from that process. They give no hint on how they do this, or whether it applicable to greater n's though. No big deal I guess - the computer can do the brute force method - but I wouldn't be surprised if it starts to bog down at much higher n's. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2165) SnowballAnalyzer lacks a constructor that takes a Set of Stop Words
[ https://issues.apache.org/jira/browse/LUCENE-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2165: Fix Version/s: 3.1 SnowballAnalyzer lacks a constructor that takes a Set of Stop Words --- Key: LUCENE-2165 URL: https://issues.apache.org/jira/browse/LUCENE-2165 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.1, 3.0 Reporter: Nick Burch Priority: Minor Fix For: 3.1 As discussed on the java-user list, the SnowballAnalyzer has been updated to use a Set of stop words. However, there is no constructor which accepts a Set, there's only the original String[] one This is an issue, because most of the common sources of stop words (eg StopAnalyzer) have deprecated their String[] stop word lists, and moved over to Sets (eg StopAnalyzer.ENGLISH_STOP_WORDS_SET). So, for now, you either have to use a deprecated field on StopAnalyzer, or manually turn the Set into an array so you can pass it to the SnowballAnalyzer I would suggest that a constructor is added to SnowballAnalyzer which accepts a Set. Not sure if the old String[] one should be deprecated or not. A sample patch against 2.9.1 to add the constructor is: --- SnowballAnalyzer.java.orig 2009-12-15 11:14:08.0 + +++ SnowballAnalyzer.java 2009-12-14 12:58:37.0 + @@ -67,6 +67,12 @@ stopSet = StopFilter.makeStopSet(stopWords); } + /** Builds the named analyzer with the given stop words. */ + public SnowballAnalyzer(Version matchVersion, String name, Set stopWordsSet) { +this(matchVersion, name); +stopSet = stopWordsSet; + } + -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1769) Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.4.3 or better
[ https://issues.apache.org/jira/browse/LUCENE-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791115#action_12791115 ] Mark Miller commented on LUCENE-1769: - Would be cool to get this issue wrapped up ... Fix wrong clover analysis because of backwards-tests, upgrade clover to 2.4.3 or better --- Key: LUCENE-1769 URL: https://issues.apache.org/jira/browse/LUCENE-1769 Project: Lucene - Java Issue Type: Bug Components: Build Affects Versions: 2.9 Reporter: Uwe Schindler Attachments: clover.license, LUCENE-1769.patch, LUCENE-1769.patch, nicks-LUCENE-1769.patch This is a followup for [http://www.lucidimagination.com/search/document/6248d6eafbe10ef4/build_failed_in_hudson_lucene_trunk_902] The problem with clover running on hudson is, that it does not instrument all tests ran. The autodetection of clover 1.x is not able to find out which files are the correct tests and only instruments the backwards test. Because of this, the current coverage report is only from the backwards tests running against the current Lucene JAR. You can see this, if you install clover and start the tests. During test-core no clover data is added to the db, only when backwards-tests begin, new files are created in the clover db folder. Clover 2.x supports a new ant task, testsources that can be used to specify the files, that are the tests. It works here locally with clover 2.4.3 and produces a really nice coverage report, also linking with test files work, it tells which tests failed and so on. I will attach a patch, that changes common-build.xml to the new clover version (other initialization resource) and tells clover where to find the tests (using the test folder include/exclude properties). One problem with the current patch: It does *not* instrument the backwards branch, so you see only coverage of the core/contrib tests. Getting the coverage also from the backwards tests is not easy possible because of two things: - the tag test dir is not easy to find out and add to testsources element (there may be only one of them) - the test names in BW branch are identical to the trunk tests. This completely corrupts the linkage between tests and code in the coverage report. In principle the best would be to generate a second coverage report for the backwards branch with a separate clover DB. The attached patch does not instrument the bw branch, it only does trunk tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller reassigned LUCENE-2035: --- Assignee: Mark Miller TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2035: Fix Version/s: 3.1 TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2035: Attachment: LUCENE-2035.patch TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement
[ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791152#action_12791152 ] Mark Miller commented on LUCENE-2035: - Thanks for the tests and fix Christopher! I've got one more patch coming and ill commit in a few days. I'm going to break the tests back out in a separate file again (on second thought I think how you had is a good idea) and remove an author tag. Then after one more review I think this good to go in. TokenSources.getTokenStream() does not assign positionIncrement --- Key: LUCENE-2035 URL: https://issues.apache.org/jira/browse/LUCENE-2035 Project: Lucene - Java Issue Type: Bug Components: contrib/highlighter Affects Versions: 2.4, 2.4.1, 2.9 Reporter: Christopher Morris Assignee: Mark Miller Fix For: 3.1 Attachments: LUCENE-2035.patch, LUCENE-2305.patch Original Estimate: 24h Remaining Estimate: 24h TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens. For example: Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped Now try a search and highlight for the phrase query fox jumped. The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between fox and jumped. If we use the original (from the analyzer) token stream then the highlighter works. Also, consider the converse - the fox did not jump not is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4) When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3). So the phrase query did jump will cause the did and jump terms in the text did not jump to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-406) sort missing string fields last
[ https://issues.apache.org/jira/browse/LUCENE-406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791153#action_12791153 ] Mark Miller commented on LUCENE-406: We should update this and incorporate into Lucene. sort missing string fields last --- Key: LUCENE-406 URL: https://issues.apache.org/jira/browse/LUCENE-406 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 1.4 Environment: Operating System: All Platform: All Reporter: Yonik Seeley Assignee: Hoss Man Priority: Minor Attachments: MissingStringLastComparatorSource.java, MissingStringLastComparatorSource.java, TestMissingStringLastComparatorSource.java A SortComparatorSource for string fields that orders documents with the sort field missing after documents with the field. This is the reverse of the default Lucene implementation. The concept and first-pass implementation was done by Chris Hostetter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1942) NUM_THREADS is a static member of RunAddIndexesThreads and should be accessed in a static way
[ https://issues.apache.org/jira/browse/LUCENE-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-1942. - Resolution: Won't Fix NUM_THREADS is a static member of RunAddIndexesThreads and should be accessed in a static way - Key: LUCENE-1942 URL: https://issues.apache.org/jira/browse/LUCENE-1942 Project: Lucene - Java Issue Type: Bug Components: Other Environment: Eclipse 3.4.2 Reporter: Hasan Diwan Priority: Trivial Attachments: lucene.pat The summary contains the problem. No further description needed, I don't think. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-628) Intermittent FileNotFoundException for .fnm when using rsync
[ https://issues.apache.org/jira/browse/LUCENE-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-628. Resolution: Incomplete Intermittent FileNotFoundException for .fnm when using rsync Key: LUCENE-628 URL: https://issues.apache.org/jira/browse/LUCENE-628 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.9 Environment: Linux RedHat ES3, Jboss402 Reporter: Simon Lorenz Priority: Minor We use Lucene 1.9.1 to create and search indexes for web applications. The application runs in Jboss402 on Redhat ES3. A single Master (Writer) Jboss instance creates and writes the indexes using the compound file format , which is optimised after all updates. These index files are replicated every few hours using rsync, to a number of other application servers (Searchers). The rsync job only runs if there are no lucene lock files present on the Writer. The Searcher servers that receive the replicated files, perform only searches on the index. Up to 60 searches may be performed each minute. Everything works well most of the time, but we get the following issue on the Searcher servers about 10% of the time. Following an rsync replication one or all of the Searcher server throws IOException caught when creating and IndexSearcher java.io.FileNotFoundException: //_1zm.fnm (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:212) at org.apache.lucene.store.FSIndexInput$Descriptor.init(FSDirectory.java:425) at org.apache.lucene.store.FSIndexInput.init(FSDirectory.java:434) at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:324) at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:56) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:144) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:129) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:110) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:154) at org.apache.lucene.store.Lock$With.run(Lock.java:109) at org.apache.lucene.index.IndexReader.open(IndexReader.java:143) As we use the compound file format I would not expect .fnm files to be present. When replicating, we do not delete the old .cfs index files as these could still be referenced by old Searcher threads. We do overwrite the segments and deletable files on the Searcher servers. My thoughts are: Either we are occasionally overwriting a file at the exact time a new searcher is being created, or the lock files are removed from the Writer server before the compaction process is completed, we then replicate a segments file that still references a ghost .fnm file. I would greatly appreciate any ideas and suggestions to solve this annoying issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790368#action_12790368 ] Mark Miller commented on LUCENE-2089: - bq. If you do take hold of it, do not hesitate to share The original paper and C++ code likewise melt my brain, and I needed the algo in some other place. The java impl I was onto was about 75% complete according to the author, but I have not yet looked at the code. Robert was convinced it was a different less efficient algorithm last I heard though. We have cracked much of the paper - thats how Robert implemented n=1 here - thats from the paper. The next step is to work out how to construct the tables for n as Robert says above. And store those tables efficiently as they start getting quite large rather fast - though we might only use as high as n=3 or 4 in Lucene - Robert suspects term seeking will outweigh any gains at that point. I think we know how to do the majority of the work for the n case, but I don't really have much/any time for this, so it probably depends on if/when Robert gets to it. If he loses interest on finishing, I def plan to come back to it someday. I'd like to complete my understanding of the paper and see a full n java impl of this in either case. The main piece left that I don't understand fully (computing all possible states for n), can be computed with just a brute force check (thats how the python impl is doing it), so there may not be much more to understand. I would like to know how the paper is getting 'i' parametrized state generators though - thats much more efficient. The paper shows them for n=1 and n=2. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput
[ https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789901#action_12789901 ] Mark Miller commented on LUCENE-2126: - I disagree with you here: introducing DataInput/Output makes IMO the API actually easier for the normal user to understand. I agree with everything you say in the second paragraph, but I don't see how any of that supports the assertion you make in the first paragraph. Presumably, because the normal user won't touch/see the IndexInput/Output classes, but more likely may deal with DataInput/Output - and those classes being limited to what actually makes sense to use for them (only exposing methods they should use) - thats easier for them. I was leaning towards Marvin's arguments - it really seems that documentation should be enough to steer users against doing something stupid - there is no doubt that writing attributes into the posting list is a fairly advanced operation (though more normal than using IndexInput/Output). On the other hand though, I'm not really sold on the downsides longer term either. The complexity argument is a bit over blown. If you understand anything down to the level of these classes, this is a ridiculously simple change. The backcompat argument is not very persuasive either - not only does it look like a slim chance of any future issues - at this level we are fairly loose about back compat when something comes up. I think advanced users have already realized, the more you dig into Lucene's guts, the more likely you won't be able to count on jar drop in. Thats just the way things have gone. I don't see a looming concrete issue myself anyway. And if there is a hidden one, I don't think anyone is going to get in a ruffle about it. So net/net, I'm +1. Seems worth it to me to be able to give a user 2125 the correct API. I could go either way on the name change. Not a fan of LuceneInput/Output though. Split up IndexInput and IndexOutput into DataInput and DataOutput - Key: LUCENE-2126 URL: https://issues.apache.org/jira/browse/LUCENE-2126 Project: Lucene - Java Issue Type: Improvement Affects Versions: Flex Branch Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: Flex Branch Attachments: lucene-2126.patch I'd like to introduce the two new classes DataInput and DataOutput that contain all methods from IndexInput and IndexOutput that actually decode or encode data, such as readByte()/writeByte(), readVInt()/writeVInt(). Methods like getFilePointer(), seek(), close(), etc., which are not related to data encoding, but to files as input/output source stay in IndexInput/IndexOutput. This patch also changes ByteSliceReader/ByteSliceWriter to extend DataInput/DataOutput. Previously ByteSliceReader implemented the methods that stay in IndexInput by throwing RuntimeExceptions. See also LUCENE-2125. All tests pass. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789384#action_12789384 ] Mark Miller commented on LUCENE-2133: - bq. Something along these lines maybe? And we are back to 831 :) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final notes: - The patch would be simpler if no backwards compatibility was necessary. The Lucene community has to decide which classes/methods can immediately be removed, which ones later, which not at all. Whenever new classes depend on the old ones, an appropriate notice exists in the javadocs. - The patch introduces a new,
[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR
[ https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788762#action_12788762 ] Mark Miller commented on LUCENE-1377: - bq. with the exception of a few core committers. I think the exception is the other way around, especially considering Lucene contrib. Lets look at the Solr list (and consider some are not very active in Solr currently) ||name||status|| |Bill Au| | |Doug Cutting|Lucene Core Committer| |Otis Gospodnetić|Lucene Core Committer| |Erik Hatcher| Lucene Core Committer| |Chris Hostetter |Lucene Core Committer| |Grant Ingersoll | Lucene Core Committer| |Mike Klaas| | |Shalin Shekhar Mangar| | |Ryan McKinley| Lucene Contrib Committer| |Mark Miller |Lucene Core Committer| |Noble Paul| | |Yonik Seeley| Lucene Core Committer| |Koji Sekiguchi|Lucene Contrib Committer| Add HTMLStripReader and WordDelimiterFilter from SOLR - Key: LUCENE-1377 URL: https://issues.apache.org/jira/browse/LUCENE-1377 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 2.3.2 Reporter: Jason Rutherglen Priority: Minor Original Estimate: 24h Remaining Estimate: 24h SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very useful for a wide variety of use cases. It would be good to place them into core Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788874#action_12788874 ] Mark Miller commented on LUCENE-2133: - I don't know that back compat is really a concern if we are just leaving the old API intact as part of that, with its own caching mechanism? Just deprecate the old API, and make a new one. This is a big pain, because you have to be sure you don't straddle the two apis on upgrading, but thats the boat we will be in anyway. Which means a new impl should provide enough benefits to make that large pain worth enduring. 831 was not committed for the same reason - it didn't bring enough to table to be worth it after we got to a per segment cache in another way. Since I don't see that this provides anything over 831, I don't see how its not in the same boat. I'm not sure we should target a specific release with this - we don't even know when 3.1 is going to happen. 2.9 took a year. Its anybodies guess - we should prob just do what makes sense and commit it when its ready. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788908#action_12788908 ] Mark Miller commented on LUCENE-2133: - bq. LUCENE-831 still requires a static FieldCache, the root of all evil :) It doesn't require one though? It supports a cache per segment reader just like this. Except its called a ValueSource. The CacheByReaderValueSource is just there to handle a back compat issue - its something that we would want to get around and use the reader valuesource for instead - but that patch still had a long way to go. Overall, from what I can see, the approach was about the same. bq. It probably makes sense to start from one of Hoss's original patches or even from scratch That was said before a lot more work was done. The API was actually starting to shape up nicely. bq. The more complex the patches are, the longer it will take to integrate them into a new version. Of course - and this is a complex issue with a lot of upgrade pain. Like with 831, it not really worth the pain to users without more benefits. bq. The more such patches you have, the longer it will take to get to a new release. Thats not really true. 3.1 does't need this patch - there would be no reason to hold it for it. Patches go in when they are ready. bq. Let's make it simple, submit what we have and build upon that. I dont think thats simple :) The patch can be iterated on outside of trunk as easy as in. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788951#action_12788951 ] Mark Miller commented on LUCENE-2133: - bq. That is, it adds a lot of duplicated code / different possible implementations for the same thing. Things were still ugly were not likely to stick around - 831 was very much a work in progress. The solution there to handle back compat issues was a working solution that would need to be improved upon. 831 was still in experimentation state - issues that need more though had hacked in working solutions. We had a more general cache at one point, and began working towards ValueSources based on discussion. The latest 831 patch is an exploration of that, not a final product. bq. They should store arbitrary data, allow cache inspection, eviction of entries and so on. Thats extremely simple to add to an IndexReader - we were thinking of a ValueSource as something different than a basic cache. {quote} It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option. {quote} A branch is certainly a possibility, but with only one person working on it, I think its overkill. With some additional interest, a branch can make sense - otherwise its not worth the merging headaches. You also have to have a committer(s) thats willing to take on the merging. At one point, 831 was much more like this patch. Discussion along what Mike brought up above started transforming it to something else. We essentially decided that unless that much was brought to the table, the disrupting change just wasn't worth it for a different cache API. I'm def a proponent of FieldCache reform - but I think we want to fully flesh it out before committing to something in trunk. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance)
[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788951#action_12788951 ] Mark Miller edited comment on LUCENE-2133 at 12/10/09 9:48 PM: --- bq. That is, it adds a lot of duplicated code / different possible implementations for the same thing. Things that were still ugly were not likely to stick around - 831 was very much a work in progress. The solution there to handle back compat issues was a working solution that would need to be improved upon. 831 was still in experimentation state - issues that need more though had hacked in working solutions. We had a more general cache at one point, and began working towards ValueSources based on discussion. The latest 831 patch is an exploration of that, not a final product. bq. They should store arbitrary data, allow cache inspection, eviction of entries and so on. Thats extremely simple to add to an IndexReader - we were thinking of a ValueSource as something different than a basic cache. {quote} It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option. {quote} A branch is certainly a possibility, but with only one person working on it, I think its overkill. With some additional interest, a branch can make sense - otherwise its not worth the merging headaches. You also have to have a committer(s) thats willing to take on the merging. At one point, 831 was much more like this patch. Discussion along what Mike brought up above started transforming it to something else. We essentially decided that unless that much was brought to the table, the disrupting change just wasn't worth it for a different cache API. I'm def a proponent of FieldCache reform - but I think we want to fully flesh it out before committing to something in trunk. was (Author: markrmil...@gmail.com): bq. That is, it adds a lot of duplicated code / different possible implementations for the same thing. Things were still ugly were not likely to stick around - 831 was very much a work in progress. The solution there to handle back compat issues was a working solution that would need to be improved upon. 831 was still in experimentation state - issues that need more though had hacked in working solutions. We had a more general cache at one point, and began working towards ValueSources based on discussion. The latest 831 patch is an exploration of that, not a final product. bq. They should store arbitrary data, allow cache inspection, eviction of entries and so on. Thats extremely simple to add to an IndexReader - we were thinking of a ValueSource as something different than a basic cache. {quote} It is indeed a complex problem but it can easily be split into several subtasks that can be addressed by different people in parallel. To allow such a development, we have to somehow get the base code it into SVN, not necessarily trunk, admittedly, a branch would also do. Of course, this requires also additional work to keep it in sync with trunk. If we can really assume to have 3.1 in one year, we have lots of time for developing a stable, powerful new API directly in trunk. Of course, this is a decision related to release management and not to the actual problem. I can live with both ways (trunk vs. branch), but, in my opinion, managing the changes just as patch files in jira is not a viable option. {quote} A branch is certainly a possibility, but with only one person working on it, I think its overkill. With some additional interest, a branch can make sense - otherwise its not worth the merging headaches. You also have to have a committer(s) thats willing to take on the merging. At one point, 831 was much more like this patch. Discussion along what Mike brought up above started transforming it to something else. We essentially decided that unless that much was brought to the table, the disrupting change just wasn't worth it for a different cache API. I'm def a proponent of FieldCache reform - but I think we want to fully flesh it out before committing to something in trunk. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField -
[jira] Commented: (LUCENE-2018) Reconsider boolean max clause exception
[ https://issues.apache.org/jira/browse/LUCENE-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787658#action_12787658 ] Mark Miller commented on LUCENE-2018: - I still think this should be removed - or moved to the MTQ query itself - then a setting on the queryparser could set it, or a user could set it. It shouldn't be a sys property, and I don't necessarily think it should be on by default either. Reconsider boolean max clause exception --- Key: LUCENE-2018 URL: https://issues.apache.org/jira/browse/LUCENE-2018 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Fix For: 3.1 Now that we have smarter multi-term queries, I think its time to reconsider the boolean max clause setting. It made more sense before, because you could hit it more unaware when the multi-term queries got huge - now its more likely that if it happens its because a user built the boolean themselves. And no duh thousands more boolean clauses means slower perf and more resources needed. We don't throw an exception when you try to use a ton of resources in a thousand other ways. The current setting also suffers from the static hell argument - especially when you consider something like Solr's multicore feature - you can have different settings for this in different cores, and the last one is going to win. Its ugly. Yes, that could be addressed better in Solr as well - but I still think it should be less ugly in Lucene as well. I'd like to consider either doing away with it, or raising it by quite a bit at the least. Or an alternative better solution. Right now, it aint so great. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787711#action_12787711 ] Mark Miller commented on LUCENE-2133: - There are a bunch or unrelated changes (imports/names/exception thrown) that should be pulled from this patch. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final notes: - The patch would be simpler if no backwards compatibility was necessary. The Lucene community has to decide which classes/methods can immediately be removed, which ones later, which not at all. Whenever new classes depend on the old ones, an appropriate notice exists in the javadocs. - The
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787715#action_12787715 ] Mark Miller commented on LUCENE-2133: - Hmm ... nevermind. The exception is related and most of the imports are correct - brain spin. Didn't see that import org.apache.lucene.search.SortField; // for javadocs wasn't being used anymore anyway. import org.apache.lucene.search.fields.IndexFieldCache in NumericQuery should get a //javadoc so someone doesn't accidently remove it. And I guess the t to threadLocal change doesn't hurt with the amount your changing that anyway. Its a better name. This looks pretty nice overall. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) -
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787729#action_12787729 ] Mark Miller commented on LUCENE-2133: - A couple more quick notes: I know the FieldComparator class is ugly, but I'm not sure we should pull the rug by putting the impls in a new package. On the other hand, its not likely to affect many and it was experimental - so its a tough call. Its a lot of classes in there ;) I'm also not sure if fields is the right package name? And do the Filters belong in that package? Also, almost a non issue, but extending a deprecated class is going to be an ultra minor back compat break when its removed. Not likely a problem though. But we might put a note to that affect to be clear. It is almost self documenting anyway though :) Rather then changing the tests to the new classes, we should prob copy them and make new ones - then remove them when the deprecations are removed. Also, you should pull the author tag(s) - all credit is through JIRA and Changes. (I only see it like once, so I bet thats eclipse?) I havn't done a thorough review it all, but this is pretty great stuff to appear so complete and out of nowhere :) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787734#action_12787734 ] Mark Miller commented on LUCENE-2133: - It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final notes: - The patch would be simpler if no backwards compatibility was necessary. The Lucene community has to decide which classes/methods can immediately be removed, which ones later, which not at all. Whenever new classes depend on the old ones, an appropriate notice
[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787734#action_12787734 ] Mark Miller edited comment on LUCENE-2133 at 12/8/09 8:42 PM: -- It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed (same with IndexFieldCacheRangeFilter). was (Author: markrmil...@gmail.com): It looks like FieldCacheTermsFilterDocIdSet is using the wrong StringIndex? And I think the FieldCache import in that class can be removed. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787748#action_12787748 ] Mark Miller commented on LUCENE-2133: - bq. I think it does not hurt either. I didn't notice that you actually just deprecated the originals - I guess thats not a complete rug pull ... By the way, I don't think you need to deprecate something in a new class ( IndexFieldCacheImpl): {code} /** * @deprecated Use {...@link #clear()} instead. */ public void purgeAllCaches() { init(); } {code} [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) - FieldCacheImpl (= IndexFieldCacheImpl) - all classes in FieldCacheImpl (= several package-level classes) - all subclasses of FieldComparator (= several package-level classes) Final notes: - The patch would be simpler if no
[jira] Issue Comment Edited: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787752#action_12787752 ] Mark Miller edited comment on LUCENE-2133 at 12/8/09 9:34 PM: -- And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they *really* intend to. *edit* This type of change actually even exaggerates that problem (though if we want to improve things here, its something we will have to deal with). Now you might have a mixture of old api/new api caches as well if you don't properly upgrade everything at once. was (Author: markrmil...@gmail.com): And what about the doubling up insanity? It looks like you just commented out that check? It appears to me that thats still an issue we want to check for - we want to make sure Lucene core and users have a way to be sure they are not using a toplevel reader and its sub readers for caches unless they *really* intend to. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of
[jira] Commented: (LUCENE-2133) [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField
[ https://issues.apache.org/jira/browse/LUCENE-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787767#action_12787767 ] Mark Miller commented on LUCENE-2133: - bq. not bind the cache so hard to the IndexReader (which was also the problem with the last FieldCache), instead just make it a plugin component At a minimum, you should be able to set the cache for the reader. bq. For the functionality of Lucene, FieldCache is not needed, sorting is just an addon on searching The way he has it, this is not just for the fieldache, but also the fieldsreader and vectorreader - if we go down that road, we should consider norms as well. bq. I see no problems with appling it soon I still think it might be a little early. This has a lot of consequences. [PATCH] IndexCache: Refactoring of FieldCache, FieldComparator, SortField - Key: LUCENE-2133 URL: https://issues.apache.org/jira/browse/LUCENE-2133 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1, 3.0 Reporter: Christian Kohlschütter Attachments: LUCENE-2133-complete.patch, LUCENE-2133.patch, LUCENE-2133.patch Hi all, up to the current version Lucene contains a conceptual flaw, that is the FieldCache. The FieldCache is a singleton which is supposed to cache certain information for every IndexReader that is currently open The FieldCache is flawed because it is incorrect to assume that: 1. one IndexReader instance equals one index. In fact, there can be many clones (of SegmentReader) or decorators (FilterIndexReader) which all access the very same data. 2. the cache information remains valid for the lifetime of an IndexReader. In fact, some IndexReaders may be reopen()'ed and thus they may contain completely different information. 3. all IndexReaders need the same type of cache. In fact, because of the limitations imposed by the singleton construct there was no implementation other than FieldCacheImpl. Furthermore, FieldCacheImpl and FieldComparator are bloated by several static inner-classes that could be moved to package level. There have been a few attempts to improve FieldCache, namely LUCENE-831, LUCENE-1579 and LUCENE-1749, but the overall situation remains the same: There is a central registry for assigning Caches to IndexReader instances. I now propose the following: 1. Obsolete FieldCache and FieldCacheKey and provide index-specific, extensible cache instances (IndexCache). IndexCaches provide common caching functionality for all IndexReaders and may be extended (for example, SegmentReader would have a SegmentReaderIndexCache and store different data than a regular IndexCache) 2. Add the index-specific field cache (IndexFieldCache) to the IndexCache. IndexFieldCache is an interface just like FieldCache and may support different implementations. 3. The IndexCache instances may be flushed/closed by the associated IndexReaders whenever necessary. 4. Obsolete FieldCacheSanityChecker because no more insanities are expected (or at least, they do not impact the overall performance) 5. Refactor FieldCacheImpl and the related classes (FieldComparator, SortField) I have provided an patch which takes care of all these issues. It passes all JUnit tests. The patch is quite large, admittedly, but the change required several modifications and some more to preserve backwards-compatibility. Backwards-compatibility is preserved by moving some of the updated functionality in the package org.apache.lucene.search.fields (field comparators and parsers, SortField) while adding wrapper instances and keeping old code in org.apache.lucene.search. In detail and besides the above mentioned improvements, the following is provided: 1. An IndexCache specific for SegmentReaders. The two ThreadLocals are moved from SegmentReader to SegmentReaderIndexCache. 2. A housekeeping improvement to CloseableThreadLocal. Now delegates the close() method to all registered instances by calling an onClose() method with the threads' instances. 3. Analyzer.close now may throw an IOException (this already is covered by java.io.Closeable). 4. A change to Collector: allow IndexCache instead of IndexReader being passed to setNextReader() 5. SortField's numeric types have been replaced by direct assignments of FieldComparatorSource. This removes the switch statements and the possibility to throw IllegalArgumentExceptions because of unsupported type values. The following classes have been deprecated and replaced by new classes in org.apache.lucene.search.fields: - FieldCacheRangeFilter (= IndexFieldCacheRangeFilter) - FieldCacheTermsFilter (= IndexFieldCacheTermsFilter) - FieldCache (= IndexFieldCache) -
[jira] Resolved: (LUCENE-2106) Benchmark does not close its Reader when OpenReader/CloseReader are not used
[ https://issues.apache.org/jira/browse/LUCENE-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-2106. - Resolution: Fixed Benchmark does not close its Reader when OpenReader/CloseReader are not used Key: LUCENE-2106 URL: https://issues.apache.org/jira/browse/LUCENE-2106 Project: Lucene - Java Issue Type: Bug Components: contrib/benchmark Affects Versions: 3.0 Reporter: Mark Miller Assignee: Mark Miller Fix For: 3.0.1, 3.1 Attachments: LUCENE-2106.patch Only the Searcher is closed, but because the reader is passed to the Searcher, the Searcher does not close the Reader, causing a resource leak. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1844) Speed up junit tests
[ https://issues.apache.org/jira/browse/LUCENE-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787004#action_12787004 ] Mark Miller commented on LUCENE-1844: - It should work fine. Speed up junit tests Key: LUCENE-1844 URL: https://issues.apache.org/jira/browse/LUCENE-1844 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Assignee: Michael McCandless Fix For: 3.1 Attachments: FastCnstScoreQTest.patch, hi_junit_test_runtimes.png, LUCENE-1844-Junit3.patch, LUCENE-1844.patch, LUCENE-1844.patch, LUCENE-1844.patch As Lucene grows, so does the number of JUnit tests. This is obviously a good thing, but it comes with longer and longer test times. Now that we also run back compat tests in a standard test run, this problem is essentially doubled. There are some ways this may get better, including running parallel tests. You will need the hardware to fully take advantage, but it should be a nice gain. There is already an issue for this, and Junit 4.6, 4.7 have the beginnings of something we might be able to count on soon. 4.6 was buggy, and 4.7 still doesn't come with nice ant integration. Parallel tests will come though. Beyond parallel testing, I think we also need to concentrate on keeping our tests lean. We don't want to sacrifice coverage or quality, but I'm sure there is plenty of fat to skim. I've started making a list of some of the longer tests - I think with some work we can make our tests much faster - and then with parallelization, I think we could see some really great gains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787239#action_12787239 ] Mark Miller commented on LUCENE-2130: - Whoops - a little off in that summary - you would't apply a huge boolean query - you'd just have a sparser filter. This might not be that beneficial. Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787248#action_12787248 ] Mark Miller commented on LUCENE-2130: - Okay - so talking to Robert in chat - the advantage when you are enumerating a lot of terms is that you avoid DirectoryReaders MultiTermEnum and its PQ. Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2130: Attachment: LUCENE-2130.patch The ugly patch Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor Attachments: LUCENE-2130.patch This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787239#action_12787239 ] Mark Miller edited comment on LUCENE-2130 at 12/8/09 2:11 AM: -- Whoops - a little off in that summary - you would't apply a huge boolean query - you'd just have a sparser filter. This might not be that beneficial. * edit * Smaller, sparser filter? was (Author: markrmil...@gmail.com): Whoops - a little off in that summary - you would't apply a huge boolean query - you'd just have a sparser filter. This might not be that beneficial. Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor Attachments: LUCENE-2130.patch This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787250#action_12787250 ] Mark Miller edited comment on LUCENE-2130 at 12/8/09 2:16 AM: -- The ugly patch - (which doesn't yet handle the filter supplied case) was (Author: markrmil...@gmail.com): The ugly patch Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor Attachments: LUCENE-2130.patch This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787239#action_12787239 ] Mark Miller edited comment on LUCENE-2130 at 12/8/09 3:25 AM: -- Whoops - a little off in that summary - you would't apply a huge boolean query - you'd just have a sparser filter. This might not be that beneficial. * edit * Smaller, sparser filter? *edit* Err - in the ConstantScore mode, i guess your really just subdividing the filter - so no real benefit. I didn't realize it, but the constant booleanquery mode does still use the booleanquery (of course, why else have it) - but its only going to be with few clauses, so neither is really a benefit. was (Author: markrmil...@gmail.com): Whoops - a little off in that summary - you would't apply a huge boolean query - you'd just have a sparser filter. This might not be that beneficial. * edit * Smaller, sparser filter? Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor Attachments: LUCENE-2130.patch This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2130: Comment: was deleted (was: Whoops - a little off in that summary - you would't apply a huge boolean query - you'd just have a sparser filter. This might not be that beneficial. * edit * Smaller, sparser filter? *edit* Err - in the ConstantScore mode, i guess your really just subdividing the filter - so no real benefit. I didn't realize it, but the constant booleanquery mode does still use the booleanquery (of course, why else have it) - but its only going to be with few clauses, so neither is really a benefit.) Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor Attachments: LUCENE-2130.patch This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2130: Description: This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto (when the heuristic doesnt cut over to constant filter), or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This also allows you to avoid DirectoryReaders MultiTermEnum and its PQ. (See Roberts comment below). No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. Not sure its worth the baggage for the win - but perhaps the objective can be met in another way. was: This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto, constant, or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This way, if you have a bunch of really large segments and a lot of really small segments, you wouldn't apply a huge booleanquery against all of the small segments which don't have those terms anyway. How advantageous this is, I'm not sure yet. No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. I've spewed too much confusion in this issue - just going to rewrite the summary. Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor Attachments: LUCENE-2130.patch This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto (when the heuristic doesnt cut over to constant filter), or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This also allows you to avoid DirectoryReaders MultiTermEnum and its PQ. (See Roberts comment below). No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. Not sure its worth the baggage for the win - but perhaps the objective can be met in another way. -- This
[jira] Commented: (LUCENE-2132) the demo application does not work as of 3.0
[ https://issues.apache.org/jira/browse/LUCENE-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787287#action_12787287 ] Mark Miller commented on LUCENE-2132: - tsk tsk - got to run that demo release manager ;) The webapp demo too (which I think we should drop just because of that - its outdated and annoying to maintain). the demo application does not work as of 3.0 Key: LUCENE-2132 URL: https://issues.apache.org/jira/browse/LUCENE-2132 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: 3.0 Reporter: Robert Muir Fix For: 3.1 Attachments: LUCENE-2132.patch the demo application does not work. QueryParser needs a Version argument. While I am here, remove @author too -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-874) Automatic reopen of IndexSearcher/IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller closed LUCENE-874. -- Resolution: Won't Fix Automatic reopen of IndexSearcher/IndexReader - Key: LUCENE-874 URL: https://issues.apache.org/jira/browse/LUCENE-874 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: João Fonseca Priority: Minor To improve performance, a single instance of IndexSearcher should be used. However, if the index is updated, it's hard to close/reopen it, because multiple threads may be accessing it at the same time. Lucene should include an out-of-the-box solution to this problem. Either a new class should be implemented to manage this behaviour (singleton IndexSearcher, plus detection of a modified index, plus safely closing and reopening the IndexSearcher) or this could be behind the scenes by the IndexSearcher class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields
[ https://issues.apache.org/jira/browse/LUCENE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller closed LUCENE-252. -- Resolution: Fixed Assignee: (was: Lucene Developers) This issue is too old - if a new patch/proposal is brought up we can reopen it. [PATCH] Problem with Sort logic on tokenized fields --- Key: LUCENE-252 URL: https://issues.apache.org/jira/browse/LUCENE-252 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.4 Environment: Operating System: other Platform: All Reporter: Aviran Mordo Attachments: dif.txt, FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch, FieldCacheImpl_Tokenized_fields_lucene_2.2-dev.patch When you set s SortField to a Text field which gets tokenized FieldCacheImpl uses the term to do the sort, but then sorting is off especially with more then one word in the field. I think it is much more logical to sort by field's string value if the sort field is Tokenized and stored. This way you'll get the CORRECT sort order -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1000) queryparsersyntax.html escaping section needs beefed up
[ https://issues.apache.org/jira/browse/LUCENE-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1000: Fix Version/s: 3.1 queryparsersyntax.html escaping section needs beefed up --- Key: LUCENE-1000 URL: https://issues.apache.org/jira/browse/LUCENE-1000 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Hoss Man Fix For: 3.1 the query syntax documentation is currently lacking several key pieces of info: 1) that unicode style escapes are valid 2) that any character can be escaped with a backslash, not just special chars. ..we should probably beef up the Escaping Special Characters section -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1923) Add toString() or getName() method to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787290#action_12787290 ] Mark Miller commented on LUCENE-1923: - Hows that patch coming ;) Add toString() or getName() method to IndexReader - Key: LUCENE-1923 URL: https://issues.apache.org/jira/browse/LUCENE-1923 Project: Lucene - Java Issue Type: Wish Components: Index Reporter: Tim Smith It would be very useful for debugging if IndexReader either had a getName() method, or a toString() implementation that would get a string identification for the reader. for SegmentReader, this would return the same as getSegmentName() for Directory readers, this would return the generation id? for MultiReader, this could return something like multi(sub reader name, sub reader name, sub reader name, ...) right now, i have to check instanceof for SegmentReader, then call getSegmentName(), and for all other IndexReader types, i would have to do something like get the IndexCommit and get the generation off it (and this may throw UnsupportedOperationException, at which point i have would have to recursively walk sub readers and try again) I could work up a patch if others like this idea -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2018) Reconsider boolean max clause exception
[ https://issues.apache.org/jira/browse/LUCENE-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2018: Fix Version/s: 3.1 Reconsider boolean max clause exception --- Key: LUCENE-2018 URL: https://issues.apache.org/jira/browse/LUCENE-2018 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Fix For: 3.1 Now that we have smarter multi-term queries, I think its time to reconsider the boolean max clause setting. It made more sense before, because you could hit it more unaware when the multi-term queries got huge - now its more likely that if it happens its because a user built the boolean themselves. And no duh thousands more boolean clauses means slower perf and more resources needed. We don't throw an exception when you try to use a ton of resources in a thousand other ways. The current setting also suffers from the static hell argument - especially when you consider something like Solr's multicore feature - you can have different settings for this in different cores, and the last one is going to win. Its ugly. Yes, that could be addressed better in Solr as well - but I still think it should be less ugly in Lucene as well. I'd like to consider either doing away with it, or raising it by quite a bit at the least. Or an alternative better solution. Right now, it aint so great. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-421) Numeric range searching with large value sets
[ https://issues.apache.org/jira/browse/LUCENE-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller closed LUCENE-421. -- Resolution: Fixed Assignee: (was: Lucene Developers) Closing - a few years old now and we currently have NumericRangeQuery. Numeric range searching with large value sets - Key: LUCENE-421 URL: https://issues.apache.org/jira/browse/LUCENE-421 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 1.4 Environment: Operating System: other Platform: Other Reporter: Randy Puttick Priority: Minor Attachments: FieldCache.java, FieldCacheImpl.java, FloatRangeQuery.java, FloatRangeScorer.java, FloatRangeScorer.java, IntegerRangeQuery.java, IntegerRangeQueryTestCase.java, IntegerRangeScorer.java, IntegerRangeScorer.java, IntStack.java, RangeQuery.java, Sort.java I have a set of enhancements that build on the numeric sorting cache introduced by Tim Jones and that provide integer and floating point range searches over numeric ranges that are far too large to be implemented via the current term range rewrite mechanism. I'm new to Apache and trying to find out how to attach the source files for the changes for your consideration. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-379) Contribution: Efficient Sorting of DateField/DateTools Encoded Timestamp Long Values
[ https://issues.apache.org/jira/browse/LUCENE-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller closed LUCENE-379. -- Resolution: Fixed Assignee: (was: Lucene Developers) closing - patch is a few years old and we have Numeric for this now. Contribution: Efficient Sorting of DateField/DateTools Encoded Timestamp Long Values Key: LUCENE-379 URL: https://issues.apache.org/jira/browse/LUCENE-379 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 1.4 Environment: Operating System: All Platform: Other Reporter: Rasik Pandey Priority: Minor Attachments: org.apache.lucene.search.LongSortComparator.zip, org.apache.lucene.search.ZIP, org.apache.lucene.search.ZIP, patchTestSort.txt, patchTestSort.txt, patchTestSort.txt Hello Tim, As promised, the sort functionality for long values is included in the attached files. patchTestSort.txt contains the diff info. for my modifications to the TestSort.java class org.apache.lucene.search.ZIP contains the three new class files for efficient sorting of long field values and of encoded timestamp field values as long values. Let me know if you have any questions. Regards, Rus -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2085) Update PayloadSpanUtil
[ https://issues.apache.org/jira/browse/LUCENE-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2085: Fix Version/s: 3.1 Update PayloadSpanUtil -- Key: LUCENE-2085 URL: https://issues.apache.org/jira/browse/LUCENE-2085 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.9.1 Reporter: Mark Miller Assignee: Mark Miller Fix For: 3.1 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents
[ https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller closed LUCENE-1286. --- Resolution: Fixed This isn't likely to go anywhere anytime soon - Koji's FastVectorHighlighter, while requiring termvectors, accomplishes this pretty nicely. LargeDocHighlighter - another span highlighter optimized for large documents Key: LUCENE-1286 URL: https://issues.apache.org/jira/browse/LUCENE-1286 Project: Lucene - Java Issue Type: Improvement Components: contrib/highlighter Affects Versions: 2.4 Reporter: Mark Miller Priority: Minor The existing Highlighter API is rich and well designed, but the approach taken is not very efficient for large documents. I believe that this is because the current Highlighter rebuilds the document by running through and scoring every every token in the tokenstream. With a break in the current API, an alternate approach can be taken: rebuild the document by running through the query terms by using their offsets. The benefit is clear - a large doc will have a large tokenstream, but a query will likely be very small in comparison. I expect this approach to be quite a bit faster for very large documents, while still supporting Phrase and Span queries. First rough patch to follow shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-375) fish*~ parses to PrefixQuery - should be a parse exception
[ https://issues.apache.org/jira/browse/LUCENE-375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-375: --- Assignee: Luis Alves (was: Lucene Developers) fish*~ parses to PrefixQuery - should be a parse exception -- Key: LUCENE-375 URL: https://issues.apache.org/jira/browse/LUCENE-375 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 1.4 Environment: Operating System: other Platform: Other Reporter: Erik Hatcher Assignee: Luis Alves Priority: Minor QueryParser parses fish*~ into a fish* PrefixQuery and silently drops the ~. This really should be a parse exception. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-858) link from Lucene web page to API docs
[ https://issues.apache.org/jira/browse/LUCENE-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-858: --- Fix Version/s: 3.1 Assignee: (was: Grant Ingersoll) link from Lucene web page to API docs - Key: LUCENE-858 URL: https://issues.apache.org/jira/browse/LUCENE-858 Project: Lucene - Java Issue Type: Improvement Reporter: Daniel Naber Fix For: 3.1 There should be a way to link from e.g. http://lucene.apache.org/java/docs/gettingstarted.html to the API docs, but not just to the start page with the frame set but to a specific page, e.g. this: http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/overview-summary.html#overview_description To make this work a way to set a relative link is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1307) Remove Contributions page
[ https://issues.apache.org/jira/browse/LUCENE-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1307: Fix Version/s: 3.1 Remove Contributions page - Key: LUCENE-1307 URL: https://issues.apache.org/jira/browse/LUCENE-1307 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Otis Gospodnetic Priority: Minor Fix For: 3.1 On Fri, May 16, 2008 at 10:06 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hola, Does anyone think the Contributions page should be removed? http://lucene.apache.org/java/2_3_2/contributions.html It looks so outdated that I think it may give newcomers a bad impression of Lucene (What, this is it for contributions?). The only really valuable piece there is Luke, but Luke must be mentioned in a dozen places on the Wiki anyway. Should we remove the Contributions page? Yonik and Grant gave their +1s. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1941) MinPayloadFunction returns 0 when only one payload is present
[ https://issues.apache.org/jira/browse/LUCENE-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1941: Fix Version/s: 3.1 3.0.1 MinPayloadFunction returns 0 when only one payload is present - Key: LUCENE-1941 URL: https://issues.apache.org/jira/browse/LUCENE-1941 Project: Lucene - Java Issue Type: Bug Components: Query/Scoring Affects Versions: 2.9 Reporter: Erik Hatcher Fix For: 3.0.1, 3.1 In some experiments with payload scoring through PayloadTermQuery, I'm seeing 0 returned when using MinPayloadFunction. I believe there is a bug there. No time at the moment to flesh out a unit test, but wanted to report it for tracking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1271) ClassCastException when using ParallelMultiSearcher.search(Query query, Filter filter, int n, Sort sort)
[ https://issues.apache.org/jira/browse/LUCENE-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1271: Fix Version/s: 3.1 ClassCastException when using ParallelMultiSearcher.search(Query query, Filter filter, int n, Sort sort) Key: LUCENE-1271 URL: https://issues.apache.org/jira/browse/LUCENE-1271 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.3, 2.3.1 Environment: MS Windows XP (SP 2), JDK 1.5.0 Update 12 Reporter: Kai Burjack Priority: Minor Fix For: 3.1 Stacktrace-Output in Console: Exception in thread MultiSearcher thread #1 java.lang.ClassCastException: org.apache.lucene.search.ScoreDoc at org.apache.lucene.search.FieldDocSortedHitQueue.lessThan(FieldDocSortedHitQueue.java:105) at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:139) at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:53) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:78) at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:63) at org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:272) Exception in thread MultiSearcher thread #2 java.lang.ClassCastException: org.apache.lucene.search.ScoreDoc at org.apache.lucene.search.FieldDocSortedHitQueue.lessThan(FieldDocSortedHitQueue.java:105) at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:139) at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:53) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:78) at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:63) at org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:272) Stack-Trace in resulting exception while performing the JUnit-Test: java.lang.ClassCastException: org.apache.lucene.search.ScoreDoc at org.apache.lucene.search.FieldDocSortedHitQueue.lessThan(FieldDocSortedHitQueue.java:105) at org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:155) at org.apache.lucene.util.PriorityQueue.pop(PriorityQueue.java:106) at org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:146) at org.apache.lucene.search.Searcher.search(Searcher.java:78) at class calling the Searcher.search(Query query, Filter filter, int n, Sort sort) method with filter:null and sort:null at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-860) site should call project Lucene Java, not just Lucene
[ https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-860: --- Fix Version/s: 3.1 site should call project Lucene Java, not just Lucene - Key: LUCENE-860 URL: https://issues.apache.org/jira/browse/LUCENE-860 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doug Cutting Assignee: Mark Miller Priority: Minor Fix For: 3.1 Attachments: LUCENE-860.patch To avoid confusion with the top-level Lucene project, the Lucene Java website should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1736) DateTools.java general improvements
[ https://issues.apache.org/jira/browse/LUCENE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1736: Fix Version/s: 3.1 DateTools.java general improvements --- Key: LUCENE-1736 URL: https://issues.apache.org/jira/browse/LUCENE-1736 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: David Smiley Priority: Minor Fix For: 3.1 Attachments: cleanerDateTools.patch Applying the attached patch shows the improvements to DateTools.java that I think should be done. All logic that does anything at all is moved to instance methods of the inner class Resolution. I argue this is more object-oriented. 1. In cases where Resolution is an argument to the method, I can simply invoke the appropriate call on the Resolution object. Formerly there was a big branch if/else. 2. Instead of synchronized being used seemingly everywhere, synchronized is used to sync on the object that is not threadsafe, be it a DateFormat or Calendar instance. 3. Since different DateFormat and Calendar instances are created per-Resolution, there is now less lock contention since threads using different resolutions will not use the same locks. 4. The old implementation of timeToString rounded the time before formatting it. That's unnecessary since the format only includes the resolution desired. 5. round() now uses a switch statement that benefits from fall-through (no break). Another debatable improvement that could be made is putting the resolution instances into an array indexed by format length. This would mean I could remove the switch in lookupResolutionByLength() and avoid the length constants there. Maybe that would be a bit too over-engineered when the switch is fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-636) [PATCH] Differently configured Lucene 'instances' in same JVM
[ https://issues.apache.org/jira/browse/LUCENE-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-636. Resolution: Fixed Fix Version/s: 3.0 [PATCH] Differently configured Lucene 'instances' in same JVM - Key: LUCENE-636 URL: https://issues.apache.org/jira/browse/LUCENE-636 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Johan Stuyts Fix For: 3.0 Attachments: Lucene2DifferentConfigurations.patch Currently Lucene can be configured using system properties. When running multiple 'instances' of Lucene for different purposes in the same JVM, it is not possible to use different settings for each 'instance'. I made changes to some Lucene classes so you can pass a configuration to that class. The Lucene 'instance' will use the settings from that configuration. The changes do not effect the API and/or the current behavior so are backwards compatible. In addition to the changes above I also made the SegmentReader and SegmentTermDocs extensible outside of their package. I would appreciate the inclusion of these changes but don't mind creating a separate issue for them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1052) Add an termInfosIndexDivisor to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-1052. - Resolution: Fixed This issue was resolved - lets open a new one if we want to do more. Add an termInfosIndexDivisor to IndexReader - Key: LUCENE-1052 URL: https://issues.apache.org/jira/browse/LUCENE-1052 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.2 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1052.patch, LUCENE-1052.patch, termInfosConfigurer.patch The termIndexInterval, set during indexing time, let's you tradeoff how much RAM is used by a reader to load the indexed terms vs cost of seeking to the specific term you want to load. But the downside is you must set it at indexing time. This issue adds an indexDivisor to TermInfosReader so that on opening a reader you could further sub-sample the the termIndexInterval to use less RAM. EG a setting of 2 means every 2 * termIndexInterval is loaded into RAM. This is particularly useful if your index has a great many terms (eg you accidentally indexed binary terms). Spinoff from this thread: http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller closed LUCENE-1859. --- Resolution: Won't Fix TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2130: Attachment: LUCENE-2130.patch updated Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor Fix For: Flex Branch Attachments: LUCENE-2130.patch, LUCENE-2130.patch This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto (when the heuristic doesnt cut over to constant filter), or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This also allows you to avoid DirectoryReaders MultiTermEnum and its PQ. (See Roberts comment below). No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. Not sure its worth the baggage for the win - but perhaps the objective can be met in another way. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2130) Investigate Rewriting Constant Scoring MultiTermQueries per segment
[ https://issues.apache.org/jira/browse/LUCENE-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-2130: Fix Version/s: Flex Branch Investigate Rewriting Constant Scoring MultiTermQueries per segment --- Key: LUCENE-2130 URL: https://issues.apache.org/jira/browse/LUCENE-2130 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Priority: Minor Fix For: Flex Branch Attachments: LUCENE-2130.patch, LUCENE-2130.patch This issue is likely not to go anywhere, but I thought we might explore it. The only idea I have come up with is fairly ugly, and unless something better comes up, this is not likely to happen. But if we could rewrite constant score multi-term queries per segment, MTQ's with auto (when the heuristic doesnt cut over to constant filter), or constant boolean rewrite could enum terms against a single segment and then apply a boolean query against each segment with just the terms that are known to be in that segment. This also allows you to avoid DirectoryReaders MultiTermEnum and its PQ. (See Roberts comment below). No biggie, not likely, but what the heck. So the ugly way to do it is to add a property to query's and weights - lateCnstRewrite or something, that defaults to false. MTQ would return true if its in a constant score mode. On the top level rewrite, if this is detected, an empty ConstantScoreQuery is made, and its Weight is turned to lateCnstRewrite and it keeps a ref to the original MTQ query. It also gets its boost set to the MTQ's boost. Then when we are searching per segment, if the Weight is lateCnstRewrite, we grab the orig query and actually do the rewrite against the subreader and grab the actual constantscore weight. It works I think - but its a little ugly. Not sure its worth the baggage for the win - but perhaps the objective can be met in another way. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1859) TermAttributeImpl's buffer will never shrink if it grows too big
[ https://issues.apache.org/jira/browse/LUCENE-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786680#action_12786680 ] Mark Miller commented on LUCENE-1859: - without a proposed patch from someone, I'm tempted to close this issue... TermAttributeImpl's buffer will never shrink if it grows too big -- Key: LUCENE-1859 URL: https://issues.apache.org/jira/browse/LUCENE-1859 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.9 Reporter: Tim Smith Priority: Minor This was also an issue with Token previously as well If a TermAttributeImpl is populated with a very long buffer, it will never be able to reclaim this memory Obviously, it can be argued that Tokenizer's should never emit large tokens, however it seems that the TermAttributeImpl should have a reasonable static MAX_BUFFER_SIZE such that if the term buffer grows bigger than this, it will shrink back down to this size once the next token smaller than MAX_BUFFER_SIZE is set I don't think i have actually encountered issues with this yet, however it seems like if you have multiple indexing threads, you could end up with a char[Integer.MAX_VALUE] per thread (in the very worst case scenario) perhaps growTermBuffer should have the logic to shrink if the buffer is currently larger than MAX_BUFFER_SIZE and it needs less than MAX_BUFFER_SIZE -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-774) TopDocs and TopFieldDocs does not implement equals and hashCode
[ https://issues.apache.org/jira/browse/LUCENE-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786681#action_12786681 ] Mark Miller commented on LUCENE-774: Still want to push forward with this issue? TopDocs and TopFieldDocs does not implement equals and hashCode --- Key: LUCENE-774 URL: https://issues.apache.org/jira/browse/LUCENE-774 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Karl Wettin Priority: Trivial Attachments: extendsObject.diff -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-792) PrecedenceQueryParser misinterprets queries starting with NOT
[ https://issues.apache.org/jira/browse/LUCENE-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786683#action_12786683 ] Mark Miller commented on LUCENE-792: Based on its state, we should prob deprecate PrecedenceQueryParser in favor of the precedence thats about to land in the new QueryParser impl. PrecedenceQueryParser misinterprets queries starting with NOT - Key: LUCENE-792 URL: https://issues.apache.org/jira/browse/LUCENE-792 Project: Lucene - Java Issue Type: Bug Components: QueryParser Affects Versions: 2.0.0 Reporter: Eric Jain NOT foo AND baz is parsed as -(+foo +baz) instead of -foo +bar. (I'm setting parser.setDefaultOperator(PrecedenceQueryParser.AND_OPERATOR) but the issue applies otherwise too.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-860) site should call project Lucene Java, not just Lucene
[ https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786684#action_12786684 ] Mark Miller commented on LUCENE-860: I'm actually more confused when I see Lucene Java than I am by Lucene :) But, I'll commit this soon if no one has any objections. site should call project Lucene Java, not just Lucene - Key: LUCENE-860 URL: https://issues.apache.org/jira/browse/LUCENE-860 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doug Cutting Priority: Minor Attachments: LUCENE-860.patch To avoid confusion with the top-level Lucene project, the Lucene Java website should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-860) site should call project Lucene Java, not just Lucene
[ https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller reassigned LUCENE-860: -- Assignee: Mark Miller site should call project Lucene Java, not just Lucene - Key: LUCENE-860 URL: https://issues.apache.org/jira/browse/LUCENE-860 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doug Cutting Assignee: Mark Miller Priority: Minor Attachments: LUCENE-860.patch To avoid confusion with the top-level Lucene project, the Lucene Java website should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-874) Automatic reopen of IndexSearcher/IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786686#action_12786686 ] Mark Miller commented on LUCENE-874: Anyone interested in this issue? I think the new ref stuff actually makes this rather easy now ... Automatic reopen of IndexSearcher/IndexReader - Key: LUCENE-874 URL: https://issues.apache.org/jira/browse/LUCENE-874 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: João Fonseca Priority: Minor To improve performance, a single instance of IndexSearcher should be used. However, if the index is updated, it's hard to close/reopen it, because multiple threads may be accessing it at the same time. Lucene should include an out-of-the-box solution to this problem. Either a new class should be implemented to manage this behaviour (singleton IndexSearcher, plus detection of a modified index, plus safely closing and reopening the IndexSearcher) or this could be behind the scenes by the IndexSearcher class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-902) Check on PositionIncrement with StopFilter.
[ https://issues.apache.org/jira/browse/LUCENE-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786687#action_12786687 ] Mark Miller commented on LUCENE-902: This patch is severely out of date - could we get an update? Check on PositionIncrement with StopFilter. Key: LUCENE-902 URL: https://issues.apache.org/jira/browse/LUCENE-902 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.2 Reporter: Toru Matsuzawa Attachments: stopfilter.patch, stopfilter20070604.patch, stopfilter20070605.patch, stopfilter20070608.patch PositionIncrement set with Tokenizer is not considered with StopFilter. When PositionIncrement of Token is 1, it is deleted by StopFilter. However, when PositionIncrement of Token following afterwards is 0, it is not deleted. I think that it is necessary to be deleted. Because it is thought same Token when PositionIncrement is 0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-644) Contrib: another highlighter approach
[ https://issues.apache.org/jira/browse/LUCENE-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786688#action_12786688 ] Mark Miller commented on LUCENE-644: I think its time to close this issue - further work here should probably be applied to the FastVectorHighlighter (which is very similar and now in contrib). Contrib: another highlighter approach - Key: LUCENE-644 URL: https://issues.apache.org/jira/browse/LUCENE-644 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Ronnie Kolehmainen Priority: Minor Attachments: FulltextHighlighter.java, FulltextHighlighter.java, FulltextHighlighterTest.java, FulltextHighlighterTest.java, svn-diff.patch, svn-diff.patch, TokenSources.java, TokenSources.java.diff Mark Harwoods highlighter package is a great contribution to Lucene, I've used it a lot! However, when you have *large* documents (fields), highlighting can be quite time consuming if you increase the number of bytes to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is often too low for indexed PDFs etcetera, which results in empty highlight strings. This is an alternative approach using term position vectors only to build fragment info objects. Then a StringReader can read the relevant fragments and skip() between them. This is a lot faster. Also, this method uses the *entire* field for finding the best fragments so you're always guaranteed to get a highlight snippet. Because this method only works with fields which have term positions stored one can check if this method works for a particular field using following code (taken from TokenSources.java): TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, field); if (tfv != null tfv instanceof TermPositionVector) { // use FulltextHighlighter } else { // use standard Highlighter } Someone else might find this useful so I'm posting the code here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org