[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1593: --- Attachment: LUCENE-1593.patch Patch includes all discussed changes, and defaults TSDC and TFC to out-of-order collection. It also covers the changes to the tag. Note that currently BS and BS2 check if they should init in next()/skipTo/score - I will fix it in the other issue after I separate between them (i.e., not having BS2 instantiate BS), via a topScorer() or something. All tests pass > Optimizations to TopScoreDocCollector and TopFieldCollector > --- > > Key: LUCENE-1593 > URL: https://issues.apache.org/jira/browse/LUCENE-1593 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, > PerfTest.java > > > This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code > to remove unnecessary checks. The plan is: > # Ensure that IndexSearcher returns segements in increasing doc Id order, > instead of numDocs(). > # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs > will always have larger ids and therefore cannot compete. > # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) > and remove the check if reusableSD == null. > # Also move to use "changing top" and then call adjustTop(), in case we > update the queue. > # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" > for the last SortField. But, doing so should not be necessary (since we > already break ties by docID), and is in fact less efficient (once the above > optimization is in). > # Investigate PQ - can we deprecate insert() and have only > insertWithOverflow()? Add a addDummyObjects method which will populate the > queue without "arranging" it, just store the objects in the array (this can > be used to pre-populate sentinel values)? > I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1542) NearSpansUnordered.getPayload does not always return the correct payloads when terms are located at the same position
[ https://issues.apache.org/jira/browse/LUCENE-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706365#action_12706365 ] Jonathan Mamou commented on LUCENE-1542: I think that the bug is not related to payload and to the fact that terms at located at the same position. It seems to occur only for the first term of the document, if its positionIncrement is equal to 0. In this case, the position of the first term will be wrong: -1 if there is no payload, and 2147483647 if there is a payload. > NearSpansUnordered.getPayload does not always return the correct payloads > when terms are located at the same position > - > > Key: LUCENE-1542 > URL: https://issues.apache.org/jira/browse/LUCENE-1542 > Project: Lucene - Java > Issue Type: Bug >Affects Versions: 2.4 >Reporter: Mark Miller >Priority: Minor > > More info in LUCENE-1465 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706373#action_12706373 ] Michael McCandless commented on LUCENE-1593: Patch looks good! I can confirm that all tests pass. (Though the back-compat tag has moved forward -- I just carried that part of the patch forward). Thanks Shai. Some small comments: * No need to call out the API change to HitQueue in CHANGES.txt (it's package private) * Can't the out-of-order TFC classes do the "pre-populate pqueue w/ sentinel" optimization? * There are some leftover javadocs references to Query#scoresDosInOrder (in at least TSDC) * I don't think we should this yet, but..: it'd be conceivably possible at indexing time to record if a given field ever uses the sentinel value for that field's type. If it doesn't, we can also safely pre-fill the queue w/ sentinels even for in-order scoring. (There are many barriers to doing this optimization today in Lucene). > Optimizations to TopScoreDocCollector and TopFieldCollector > --- > > Key: LUCENE-1593 > URL: https://issues.apache.org/jira/browse/LUCENE-1593 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, > PerfTest.java > > > This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code > to remove unnecessary checks. The plan is: > # Ensure that IndexSearcher returns segements in increasing doc Id order, > instead of numDocs(). > # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs > will always have larger ids and therefore cannot compete. > # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) > and remove the check if reusableSD == null. > # Also move to use "changing top" and then call adjustTop(), in case we > update the queue. > # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" > for the last SortField. But, doing so should not be necessary (since we > already break ties by docID), and is in fact less efficient (once the above > optimization is in). > # Investigate PQ - can we deprecate insert() and have only > insertWithOverflow()? Add a addDummyObjects method which will populate the > queue without "arranging" it, just store the objects in the array (this can > be used to pre-populate sentinel values)? > I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706377#action_12706377 ] Michael McCandless commented on LUCENE-1313: {quote} > RAMDir changes (deletes are applied, or a new RAM segment is > created), we must push down to DW that usage with a new synchronized > method. Sounds like we create a subclass of RAMDirectory with this functionality? {quote} I don't think that's needed. I think whenever IW makes a change to the RAMDir, which is easily tracked, it pushes to DW the new RAMDir size. {quote} > We don't need IW.getRamLogMergePolicy()? Because we don't want the user customizing this? {quote} That, and because it's only used to determine CFS or not, which we've turned off for RAMDir. {quote} > We should no longer need IndexWriter.getFlushDirectory? IE, IW > once again has a single "Directory" as seen by IFD, > DocFieldProcessorPerThread, etc. In the NRT case, this is an FSD; in > the non-NRT case it's the Dir that was passed in (unless, in a future > issue, we explore using FSD, too, for better performance). Pass in FSD in the constructor of DocumentsWriter (and others) as before? {quote} Right. All these places could care less if they are dealing w/ FSD or a "real" dir. They should simply use the Directory API as they previously did. {quote} > I still don't think we need a separate RAMMergeScheduler; I > think CMS should simply always run such merges (ie not block on max > thread count). IW.getNextMerge can then revert to its former > self. Where does the thread come from for this if we're using max threads? If we allocate one, we're over limit and keeping it around. We'd need a more advanced threadpool that elastically grows the thread pool and kills threads that are unused over time. With Java 1.5 we can use ThreadPoolExecutor. Is a dedicated thread pool something we want to go to? Even then we can potentially still max out a given thread pool with requests to merge one directory or the other. We'd probably still need two separate thread pools. {quote} The thread is simply launched w/o checking maxThreadCount, if the merge is in RAM. Right, with JDK 1.5 we can make CMS better about pooling threads. Right now it does no long-term pooling (unless another merge happens to be needed when a thread finishes its last merge). {quote} > MergePolicy.OneMerge.segString no longer needs to take a > Directory (because it now stores a Directory). Yeah, I noticed this, I'll change it. MergeSpecification.segString is public and takes a directory that is not required. What to do? {quote} Do the usual back-compat dance -- deprecate it and add the new one. {quote} > The dual directories is continuing to push deeper (when I'm > wanting to do the reverse). EG, MergeScheduler.getDestinationDirs > should not be needed? If we remove getFlushDirectory, are you saying getDirectory should return the FSD if RAM NRT is turned on? This seems counter intuitive in that we still need a clear separation of the two directories? The user would expect the directory they passed into the ctor to be returned? {quote} I agree, we should leave getDirectory() as is (returns whatever Dir was passed in). We can keep getFlushDirectory, but it should not have duality inside it -- it should simply return the FSD (in the NRT case) or the normal dir. I don't really like the name getFlushDirectory... but can't think of a better one yet. Then, nothing outside of IW should ever know there are two directories at play. They all simply deal with the one and only Directory that IW hands out. On the "when to flush to RAM" question... I agree it's tricky. This logic belongs in the RAMMergePolicy. That policy needs to be empowered to decide if a new flush goes to RAM or disk, to decide when to merge all RAM segments to a new disk segment, to be able to check if IW is in NRT mode, etc. Probably the RAM merge policy also needs control over how much of the RAM buffer it's going to give to DW, too. At first the policy should not change the non-NRT case (ie one always flushes straight to disk). We can play w/ that in a separate issue. Need to think more about the logic... > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possibl
[jira] Resolved: (LUCENE-1621) deprecate term and getTerm in MultiTermQuery
[ https://issues.apache.org/jira/browse/LUCENE-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved LUCENE-1621. - Resolution: Fixed > deprecate term and getTerm in MultiTermQuery > > > Key: LUCENE-1621 > URL: https://issues.apache.org/jira/browse/LUCENE-1621 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1621.patch, LUCENE-1621.patch > > > This means moving getTerm and term up to sub classes as appropriate and > reimplementing equals, hashcode as appropriate in sub classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1593: --- Attachment: LUCENE-1593.patch * Removed the leftover references to Query#scoresDocInOrder. * Removed the text from CHANGES * I don't think we should do anything in TFC for now. It will only save one 'if' and adding sentinel values is not so trivial. Maybe leave it for a specializer code? > Optimizations to TopScoreDocCollector and TopFieldCollector > --- > > Key: LUCENE-1593 > URL: https://issues.apache.org/jira/browse/LUCENE-1593 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, > LUCENE-1593.patch, PerfTest.java > > > This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code > to remove unnecessary checks. The plan is: > # Ensure that IndexSearcher returns segements in increasing doc Id order, > instead of numDocs(). > # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs > will always have larger ids and therefore cannot compete. > # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) > and remove the check if reusableSD == null. > # Also move to use "changing top" and then call adjustTop(), in case we > update the queue. > # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" > for the last SortField. But, doing so should not be necessary (since we > already break ties by docID), and is in fact less efficient (once the above > optimization is in). > # Investigate PQ - can we deprecate insert() and have only > insertWithOverflow()? Add a addDummyObjects method which will populate the > queue without "arranging" it, just store the objects in the array (this can > be used to pre-populate sentinel values)? > I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706447#action_12706447 ] Michael McCandless commented on LUCENE-1593: bq. I don't think we should do anything in TFC for now. It will only save one 'if' and adding sentinel values is not so trivial. Maybe leave it for a specializer code? OK I agree, let's not do this one for now. New patch looks good -- I'll review it some more and then wait a few days and commit. Thanks Shai! > Optimizations to TopScoreDocCollector and TopFieldCollector > --- > > Key: LUCENE-1593 > URL: https://issues.apache.org/jira/browse/LUCENE-1593 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, > LUCENE-1593.patch, PerfTest.java > > > This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code > to remove unnecessary checks. The plan is: > # Ensure that IndexSearcher returns segements in increasing doc Id order, > instead of numDocs(). > # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs > will always have larger ids and therefore cannot compete. > # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) > and remove the check if reusableSD == null. > # Also move to use "changing top" and then call adjustTop(), in case we > update the queue. > # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" > for the last SortField. But, doing so should not be necessary (since we > already break ties by docID), and is in fact less efficient (once the above > optimization is in). > # Investigate PQ - can we deprecate insert() and have only > insertWithOverflow()? Add a addDummyObjects method which will populate the > queue without "arranging" it, just store the objects in the array (this can > be used to pre-populate sentinel values)? > I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1593: --- Attachment: LUCENE-1593.patch OK I made some tiny fixes: * Added CHANGES entry explaining that Sort no longer tacks on SortField.FIELD_DOC since that tie break is already handled internally * MultiSearcher.search was creating too big an array of ScoreDocs, and was incorrectly (because sentinels were not used) avoiding HitQueue.size(). * Renamed IndexSearcher.sortedStarts -> docStarts (they are no longer sorted) * Made BS2.initCountingSumScorer private again * Small whitespace fixes I think it's ready to commit! I'll wait a day or two. > Optimizations to TopScoreDocCollector and TopFieldCollector > --- > > Key: LUCENE-1593 > URL: https://issues.apache.org/jira/browse/LUCENE-1593 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, > LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java > > > This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code > to remove unnecessary checks. The plan is: > # Ensure that IndexSearcher returns segements in increasing doc Id order, > instead of numDocs(). > # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs > will always have larger ids and therefore cannot compete. > # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) > and remove the check if reusableSD == null. > # Also move to use "changing top" and then call adjustTop(), in case we > update the queue. > # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" > for the last SortField. But, doing so should not be necessary (since we > already break ties by docID), and is in fact less efficient (once the above > optimization is in). > # Investigate PQ - can we deprecate insert() and have only > insertWithOverflow()? Add a addDummyObjects method which will populate the > queue without "arranging" it, just store the objects in the array (this can > be used to pre-populate sentinel values)? > I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706557#action_12706557 ] Jason Rutherglen commented on LUCENE-1313: -- {quote}I don't think that's needed. I think whenever IW makes a change to the RAMDir, which is easily tracked, it pushes to DW the new RAMDir size.{quote} Because we know the IW.ramdir is a RAMDirectory implementation, we can use sizeInBytes? It's synchronized, maybe we want a different method that's not? It seems like keeping track of all files writes outside ramdir is going to be difficult? For example when we do deletes via SegmentReader how would we keep track of that? {quote}That, and because it's only used to determine CFS or not, which we've turned off for RAMDir.{quote} So we let the user set the RAMMergePolicy but not get it? {quote}The thread is simply launched w/o checking maxThreadCount, if the merge is in RAM.{quote} Hmm... We can't just create threads and let them be garbage collected as JVMs tend to throw OOMs with this. If we go down this route of a single CMS, maybe we can borrow some code from an Apache project that's implemented a threadpool. > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706568#action_12706568 ] Shai Erera commented on LUCENE-1593: bq. MultiSearcher.search was creating too big an array of ScoreDocs, and was incorrectly (because sentinels were not used) avoiding HitQueue.size(). Oh right ... I forgot to roll that back since HitQueue is initialized in those cases to not pre-populate with sentinel values. > Optimizations to TopScoreDocCollector and TopFieldCollector > --- > > Key: LUCENE-1593 > URL: https://issues.apache.org/jira/browse/LUCENE-1593 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Shai Erera > Fix For: 2.9 > > Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, > LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java > > > This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code > to remove unnecessary checks. The plan is: > # Ensure that IndexSearcher returns segements in increasing doc Id order, > instead of numDocs(). > # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs > will always have larger ids and therefore cannot compete. > # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) > and remove the check if reusableSD == null. > # Also move to use "changing top" and then call adjustTop(), in case we > update the queue. > # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" > for the last SortField. But, doing so should not be necessary (since we > already break ties by docID), and is in fact less efficient (once the above > optimization is in). > # Investigate PQ - can we deprecate insert() and have only > insertWithOverflow()? Add a addDummyObjects method which will populate the > queue without "arranging" it, just store the objects in the array (this can > be used to pre-populate sentinel values)? > I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
QueryWeight (Was... Re: [jira] Created: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness)
> * Add to Searcher the new QueryWeight variants. If you make QueryWeight a subclass of Query, do you need any new methods? Before Weight existed, only Query and Scorer existed. Compiling a Scorer involved "weighting the query", by factoring IDF etc, then calling query.Scorer(). To make Query objects reusable, Weight was introduced as an intermediate stage. Making QueryWeight a subclass of Query would be entirely within the spirit of the original design, since the role played by Weight was originally performed by a Query. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706604#action_12706604 ] Johan Kindgren commented on LUCENE-1260: Wouldn't the simplest solution be to refactor out the static methods, replace them with instance methods and remove the getNormDecoder method? This would enable a pluggable behavior without introducing a new Codec. Would cause minor changes to 11 classes in the core, and would also clean up the code from static stuff. As described in LUCENE-1261. > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin > Attachments: LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706611#action_12706611 ] Michael McCandless commented on LUCENE-1313: {quote} > I don't think that's needed. I think whenever IW makes a > change to the RAMDir, which is easily tracked, it pushes to DW > the new RAMDir size. Because we know the IW.ramdir is a RAMDirectory implementation, we can use sizeInBytes? It's synchronized, maybe we want a different method that's not? It seems like keeping track of all files writes outside ramdir is going to be difficult? For example when we do deletes via SegmentReader how would we keep track of that? {quote} We should definitely just use the sizeInBytes() method. I'm saying that IW knows when it writes new files to the RAMDir (flushing deletes, flushing new segment) and it's only at those times that it should call sizeInBytes() and push that value down to DW. {quote} > That, and because it's only used to determine CFS or not, > which we've turned off for RAMDir. So we let the user set the RAMMergePolicy but not get it? {quote} Oh, we should add a getter (getRAMMergePolicy, not getLogMergePolicy) for it, but it should return MergePolicy not LogMergePolicy. {quote} > The thread is simply launched w/o checking > maxThreadCount, if the merge is in RAM. Hmm... We can't just create threads and let them be garbage collected as JVMs tend to throw OOMs with this. If we go down this route of a single CMS, maybe we can borrow some code from an Apache project that's implemented a threadpool. {quote} This is how CMS has always been. It launches threads relatively rarely -- this shouldn't lead to OOMs. One can always subclass CMS if this is somehow a problem. Or we could modify CMS to pool its threads (as a new issue)? > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706662#action_12706662 ] Jason Rutherglen commented on LUCENE-1313: -- In the patch the merge policies are split up which requires some of the RAM NRT logic to be in updatePendingMerges. One solution is to have a merge policy that manages merging to ram and to disk, kind of an overarching merge policy for the primary MP and the RAM MP. This would push the logic of ram merging and primary dir merging to the meta merge policy which would clean up IW from managing ram segs vs. prim segs. Does IW.optimize and IW.expungeDeletes operate on the ramdir as well (the expungeDeletes javadoc implies calling IR.numDeletedDocs will return zero when there are no deletes). > Realtime Search > --- > > Key: LUCENE-1313 > URL: https://issues.apache.org/jira/browse/LUCENE-1313 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, > LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, > lucene-1313.patch, lucene-1313.patch, lucene-1313.patch > > > Realtime search with transactional semantics. > Possible future directions: > * Optimistic concurrency > * Replication > Encoding each transaction into a set of bytes by writing to a RAMDirectory > enables replication. It is difficult to replicate using other methods > because while the document may easily be serialized, the analyzer cannot. > I think this issue can hold realtime benchmarks which include indexing and > searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: analysis-data.zip Lexical dictionary files, unzip it to somewhere, run TestSmartChineseAnalyzer with this command: java org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer -Danalysis.data.dir=/path/to/analysis-data/ > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao > Attachments: analysis-data.zip, LUCENE-1629.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese
[ https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoping Gao updated LUCENE-1629: - Attachment: LUCENE-1629.patch Here is all the source code of intelligent analyzer for Chinese. About 2500 lines The unit TestCase contains a main method, which needs lexical dictionary to run, so I will post the binary lexical dictionary soon. > contrib intelligent Analyzer for Chinese > > > Key: LUCENE-1629 > URL: https://issues.apache.org/jira/browse/LUCENE-1629 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: for java 1.5 or higher, lucene 2.4.1 >Reporter: Xiaoping Gao > Attachments: LUCENE-1629.patch > > > I wrote a Analyzer for apache lucene for analyzing sentences in Chinese > language. it's called "imdict-chinese-analyzer", the project on google code > is here: http://code.google.com/p/imdict-chinese-analyzer/ > In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) > "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence > properly, or there will be mis-understandings everywhere in the index > constructed by Lucene, and the accuracy of the search engine will be affected > seriously! > Although there are two analyzer packages in apache repository which can > handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or > every two adjoining characters as a single word, this is obviously not true > in reality, also this strategy will increase the index size and hurt the > performance baddly. > The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model > (HMM), so it can tokenize chinese sentence in a really intelligent way. > Tokenizaion accuracy of this model is above 90% according to the paper > "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about > 60%. > As imdict-chinese-analyzer is a really fast and intelligent. I want to > contribute it to the apache lucene repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org