[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-05-06 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1593:
---

Attachment: LUCENE-1593.patch

Patch includes all discussed changes, and defaults TSDC and TFC to out-of-order 
collection. It also covers the changes to the tag.

Note that currently BS and BS2 check if they should init in next()/skipTo/score 
- I will fix it in the other issue after I separate between them (i.e., not 
having BS2 instantiate BS), via a topScorer() or something.

All tests pass

> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, 
> PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1542) NearSpansUnordered.getPayload does not always return the correct payloads when terms are located at the same position

2009-05-06 Thread Jonathan Mamou (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706365#action_12706365
 ] 

Jonathan Mamou commented on LUCENE-1542:


I think that the bug is not related to payload and to the fact that terms at 
located at the same position. 
It seems to occur only for the first term of the document, if its 
positionIncrement is equal to 0. In this case, the position of the first term 
will be wrong: -1 if there is no payload, and 2147483647 if there is a payload.

> NearSpansUnordered.getPayload does not always return the correct payloads 
> when terms are located at the same position
> -
>
> Key: LUCENE-1542
> URL: https://issues.apache.org/jira/browse/LUCENE-1542
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Mark Miller
>Priority: Minor
>
> More info in LUCENE-1465

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-05-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706373#action_12706373
 ] 

Michael McCandless commented on LUCENE-1593:


Patch looks good!  I can confirm that all tests pass.  (Though the back-compat 
tag has moved forward -- I just carried that part of the patch forward).  
Thanks Shai.

Some small comments:

  * No need to call out the API change to HitQueue in CHANGES.txt
(it's package private)

  * Can't the out-of-order TFC classes do the "pre-populate pqueue w/
sentinel" optimization?

  * There are some leftover javadocs references to
Query#scoresDosInOrder (in at least TSDC)

  * I don't think we should this yet, but..: it'd be conceivably
possible at indexing time to record if a given field ever uses the
sentinel value for that field's type.  If it doesn't, we can also
safely pre-fill the queue w/ sentinels even for in-order scoring.
(There are many barriers to doing this optimization today in
Lucene).


> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, 
> PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-05-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706377#action_12706377
 ] 

Michael McCandless commented on LUCENE-1313:



{quote}
> RAMDir changes (deletes are applied, or a new RAM segment is
> created), we must push down to DW that usage with a new synchronized
> method.

Sounds like we create a subclass of RAMDirectory with this
functionality?
{quote}

I don't think that's needed.  I think whenever IW makes a change to
the RAMDir, which is easily tracked, it pushes to DW the new RAMDir
size.

{quote}
> We don't need IW.getRamLogMergePolicy()?

Because we don't want the user customizing this?
{quote}
That, and because it's only used to determine CFS or not, which we've
turned off for RAMDir.

{quote}
> We should no longer need IndexWriter.getFlushDirectory? IE, IW
> once again has a single "Directory" as seen by IFD,
> DocFieldProcessorPerThread, etc. In the NRT case, this is an FSD; in
> the non-NRT case it's the Dir that was passed in (unless, in a future
> issue, we explore using FSD, too, for better performance).

Pass in FSD in the constructor of DocumentsWriter (and others) as
before?
{quote}

Right.  All these places could care less if they are dealing w/ FSD or
a "real" dir.  They should simply use the Directory API as they
previously did.

{quote}
> I still don't think we need a separate RAMMergeScheduler; I
> think CMS should simply always run such merges (ie not block on max
> thread count). IW.getNextMerge can then revert to its former
> self.

Where does the thread come from for this if we're using max threads?
If we allocate one, we're over limit and keeping it around. We'd need
a more advanced threadpool that elastically grows the thread pool and
kills threads that are unused over time. With Java 1.5 we can use
ThreadPoolExecutor. Is a dedicated thread pool something we want to
go to? Even then we can potentially still max out a given thread pool
with requests to merge one directory or the other. We'd probably
still need two separate thread pools.
{quote}

The thread is simply launched w/o checking maxThreadCount, if the
merge is in RAM.

Right, with JDK 1.5 we can make CMS better about pooling threads.
Right now it does no long-term pooling (unless another merge happens
to be needed when a thread finishes its last merge).

{quote}
> MergePolicy.OneMerge.segString no longer needs to take a
> Directory (because it now stores a Directory).

Yeah, I noticed this, I'll change it. MergeSpecification.segString is
public and takes a directory that is not required. What to do?
{quote}
Do the usual back-compat dance -- deprecate it and add the new one.

{quote}
> The dual directories is continuing to push deeper (when I'm
> wanting to do the reverse). EG, MergeScheduler.getDestinationDirs
> should not be needed?

If we remove getFlushDirectory, are you saying getDirectory should
return the FSD if RAM NRT is turned on? This seems counter intuitive
in that we still need a clear separation of the two directories? The
user would expect the directory they passed into the ctor to be
returned?
{quote}

I agree, we should leave getDirectory() as is (returns whatever Dir
was passed in).

We can keep getFlushDirectory, but it should not have duality inside it
-- it should simply return the FSD (in the NRT case) or the normal
dir.  I don't really like the name getFlushDirectory... but can't
think of a better one yet.

Then, nothing outside of IW should ever know there are two directories
at play.  They all simply deal with the one and only Directory that IW
hands out.

On the "when to flush to RAM" question... I agree it's tricky.  This
logic belongs in the RAMMergePolicy.  That policy needs to be
empowered to decide if a new flush goes to RAM or disk, to decide when
to merge all RAM segments to a new disk segment, to be able to check
if IW is in NRT mode, etc.  Probably the RAM merge policy also needs
control over how much of the RAM buffer it's going to give to DW,
too. At first the policy should not change the non-NRT case (ie one
always flushes straight to disk).  We can play w/ that in a separate
issue.  Need to think more about the logic...


> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possibl

[jira] Resolved: (LUCENE-1621) deprecate term and getTerm in MultiTermQuery

2009-05-06 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved LUCENE-1621.
-

Resolution: Fixed

> deprecate term and getTerm in MultiTermQuery
> 
>
> Key: LUCENE-1621
> URL: https://issues.apache.org/jira/browse/LUCENE-1621
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1621.patch, LUCENE-1621.patch
>
>
> This means moving getTerm and term up to sub classes as appropriate and 
> reimplementing equals, hashcode as appropriate in sub classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-05-06 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1593:
---

Attachment: LUCENE-1593.patch

* Removed the leftover references to Query#scoresDocInOrder.
* Removed the text from CHANGES
* I don't think we should do anything in TFC for now. It will only save one 
'if' and adding sentinel values is not so trivial. Maybe leave it for a 
specializer code?

> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, 
> LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-05-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706447#action_12706447
 ] 

Michael McCandless commented on LUCENE-1593:


bq. I don't think we should do anything in TFC for now. It will only save one 
'if' and adding sentinel values is not so trivial. Maybe leave it for a 
specializer code?

OK I agree, let's not do this one for now.

New patch looks good -- I'll review it some more and then wait a few days and 
commit.  Thanks Shai!

> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, 
> LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-05-06 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1593:
---

Attachment: LUCENE-1593.patch

OK I made some tiny fixes:

  * Added CHANGES entry explaining that Sort no longer tacks on
SortField.FIELD_DOC since that tie break is already handled
internally

  * MultiSearcher.search was creating too big an array of ScoreDocs,
and was incorrectly (because sentinels were not used) avoiding
HitQueue.size().

  * Renamed IndexSearcher.sortedStarts -> docStarts (they are no
longer sorted)

  * Made BS2.initCountingSumScorer private again

  * Small whitespace fixes

I think it's ready to commit!  I'll wait a day or two.


> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, 
> LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-05-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706557#action_12706557
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

{quote}I don't think that's needed. I think whenever IW makes a
change to the RAMDir, which is easily tracked, it pushes to DW
the new RAMDir size.{quote}

Because we know the IW.ramdir is a RAMDirectory implementation,
we can use sizeInBytes? It's synchronized, maybe we want a
different method that's not? It seems like keeping track of all
files writes outside ramdir is going to be difficult? For
example when we do deletes via SegmentReader how would we keep
track of that?

{quote}That, and because it's only used to determine CFS or not,
which we've turned off for RAMDir.{quote}

So we let the user set the RAMMergePolicy but not get it?

{quote}The thread is simply launched w/o checking
maxThreadCount, if the merge is in RAM.{quote}

Hmm... We can't just create threads and let them be garbage
collected as JVMs tend to throw OOMs with this. If we go down
this route of a single CMS, maybe we can borrow some code from
an Apache project that's implemented a threadpool.



> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-05-06 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706568#action_12706568
 ] 

Shai Erera commented on LUCENE-1593:


bq. MultiSearcher.search was creating too big an array of ScoreDocs, and was 
incorrectly (because sentinels were not used) avoiding HitQueue.size().

Oh right ... I forgot to roll that back since HitQueue is initialized in those 
cases to not pre-populate with sentinel values.

> Optimizations to TopScoreDocCollector and TopFieldCollector
> ---
>
> Key: LUCENE-1593
> URL: https://issues.apache.org/jira/browse/LUCENE-1593
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shai Erera
> Fix For: 2.9
>
> Attachments: LUCENE-1593.patch, LUCENE-1593.patch, LUCENE-1593.patch, 
> LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-05-06 Thread Shai Erera (JIRA)
Mating Collector and Scorer on doc Id orderness
---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
API on Scorer and Collector such that one can create an optimized Collector 
based on a given Scorer's doc-id orderness and vice versa. Copied from 
LUCENE-1593, here is the list of changes:

# Deprecate Weight and create QueryWeight (abstract class) with a new 
scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. 
QueryWeight implements Weight, while score(reader) calls score(reader, false /* 
out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract.
#* Also add QueryWeightWrapper to wrap a given Weight implementation. This one 
will also be deprecated, as well as package-private.
#* Add to Query variants of createWeight and weight which return QueryWeight. 
For now, I prefer to add a default impl which wraps the Weight variant instead 
of overriding in all Query extensions, and in 3.0 when we remove the Weight 
variants - override in all extending classes.
# Add to Scorer isOutOfOrder with a default to false, and override in BS to 
true.
# Modify BooleanWeight to extend QueryWeight and implement the new scorer 
method to return BS2 or BS based on the number of required scorers and 
setAllowOutOfOrder.
# Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false.
#* Use it in IndexSearcher.search methods, that accept a Collector, in order to 
create the appropriate Scorer, using the new QueryWeight.
#* Provide a static create method to TFC and TSDC which accept this as an 
argument and creates the proper instance.
#* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create 
the optimized Collector instance.
# Modify IndexSearcher to use all of the above logic.

The only class I'm worried about, and would like to verify with you, is 
Searchable. If we want to deprecate all the search methods on IndexSearcher, 
Searcher and Searchable which accept Weight and add new ones which accept 
QueryWeight, we must do the following:
* Deprecate Searchable in favor of Searcher.
* Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
break back-compat and add them as abstract (like we've done with the new 
Collector method) or (2) add them with a default impl to call the Weight 
versions, documenting these will become abstract in 3.0.
* Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
Searcher. That's the part I'm a little bit worried about - Searchable 
implements java.rmi.Remote, which means there could be an implementation out 
there which implements Searchable and extends something different than 
UnicastRemoteObject, like Activeable. I think there is very small chance this 
has actually happened, but would like to confirm with you guys first.
* Add a deprecated, package-private, SearchableWrapper which extends Searcher 
and delegates all calls to the Searchable member.
* Deprecate all uses of Searchable and add Searcher instead, defaulting the old 
ones to use SearchableWrapper.
* Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding 
overriding these new methods.

One other optimization that was discussed in LUCENE-1593 is to expose a 
topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
will be called, and additionally add a start() method to DISI. That will allow 
Scorers to initialize either on start() or score(Collector). This was proposed 
mainly because of BS and BS2 which check if they are initialized in every call 
to next(), skipTo() and score(). Personally I prefer to see that in a separate 
issue, following that one (as it might add methods to QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



QueryWeight (Was... Re: [jira] Created: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness)

2009-05-06 Thread Marvin Humphrey
> * Add to Searcher the new QueryWeight variants. 

If you make QueryWeight a subclass of Query, do you need any new methods?

Before Weight existed, only Query and Scorer existed. Compiling a Scorer
involved "weighting the query", by factoring IDF etc, then calling
query.Scorer().  To make Query objects reusable, Weight was introduced as an
intermediate stage.  Making QueryWeight a subclass of Query would be entirely
within the spirit of the original design, since the role played by Weight was
originally performed by a Query.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2009-05-06 Thread Johan Kindgren (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706604#action_12706604
 ] 

Johan Kindgren commented on LUCENE-1260:


Wouldn't the simplest solution be to refactor out the static methods, replace 
them with instance methods and remove the getNormDecoder method? This would 
enable a pluggable behavior without introducing a new Codec.
Would cause minor changes to 11 classes in the core, and would also clean up 
the code from static stuff.

As described in LUCENE-1261.

> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
> Attachments: LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-05-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706611#action_12706611
 ] 

Michael McCandless commented on LUCENE-1313:


{quote}
> I don't think that's needed. I think whenever IW makes a
> change to the RAMDir, which is easily tracked, it pushes to DW
> the new RAMDir size.

Because we know the IW.ramdir is a RAMDirectory implementation,
we can use sizeInBytes? It's synchronized, maybe we want a
different method that's not? It seems like keeping track of all
files writes outside ramdir is going to be difficult? For
example when we do deletes via SegmentReader how would we keep
track of that?
{quote}

We should definitely just use the sizeInBytes() method.

I'm saying that IW knows when it writes new files to the RAMDir
(flushing deletes, flushing new segment) and it's only at those times
that it should call sizeInBytes() and push that value down to DW.

{quote}
> That, and because it's only used to determine CFS or not,
> which we've turned off for RAMDir.

So we let the user set the RAMMergePolicy but not get it?
{quote}

Oh, we should add a getter (getRAMMergePolicy, not getLogMergePolicy)
for it, but it should return MergePolicy not LogMergePolicy.

{quote}
> The thread is simply launched w/o checking
> maxThreadCount, if the merge is in RAM.

Hmm... We can't just create threads and let them be garbage
collected as JVMs tend to throw OOMs with this. If we go down
this route of a single CMS, maybe we can borrow some code from
an Apache project that's implemented a threadpool.
{quote}

This is how CMS has always been.  It launches threads relatively
rarely -- this shouldn't lead to OOMs.  One can always subclass CMS if
this is somehow a problem.  Or we could modify CMS to pool its threads
(as a new issue)?


> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-05-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706662#action_12706662
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

In the patch the merge policies are split up which requires some
of the RAM NRT logic to be in updatePendingMerges. 

One solution is to have a merge policy that manages merging to
ram and to disk, kind of an overarching merge policy for the
primary MP and the RAM MP. This would push the logic of ram
merging and primary dir merging to the meta merge policy which
would clean up IW from managing ram segs vs. prim segs.

Does IW.optimize and IW.expungeDeletes operate on the ramdir as
well (the expungeDeletes javadoc implies calling
IR.numDeletedDocs will return zero when there are no deletes). 

> Realtime Search
> ---
>
> Key: LUCENE-1313
> URL: https://issues.apache.org/jira/browse/LUCENE-1313
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
> LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, 
> lucene-1313.patch, lucene-1313.patch, lucene-1313.patch
>
>
> Realtime search with transactional semantics.  
> Possible future directions:
>   * Optimistic concurrency
>   * Replication
> Encoding each transaction into a set of bytes by writing to a RAMDirectory 
> enables replication.  It is difficult to replicate using other methods 
> because while the document may easily be serialized, the analyzer cannot.
> I think this issue can hold realtime benchmarks which include indexing and 
> searching concurrently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-06 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: analysis-data.zip

Lexical dictionary files, unzip it to somewhere, run TestSmartChineseAnalyzer 
with this command:
java org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer 
-Danalysis.data.dir=/path/to/analysis-data/


> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
> Attachments: analysis-data.zip, LUCENE-1629.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1629) contrib intelligent Analyzer for Chinese

2009-05-06 Thread Xiaoping Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoping Gao updated LUCENE-1629:
-

Attachment: LUCENE-1629.patch

Here is all the source code of intelligent analyzer for Chinese. About 2500 
lines
The unit TestCase contains a main method, which needs lexical dictionary to 
run, so I will post the binary lexical dictionary soon.

> contrib intelligent Analyzer for Chinese
> 
>
> Key: LUCENE-1629
> URL: https://issues.apache.org/jira/browse/LUCENE-1629
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: for java 1.5 or higher, lucene 2.4.1
>Reporter: Xiaoping Gao
> Attachments: LUCENE-1629.patch
>
>
> I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
> language. it's called "imdict-chinese-analyzer", the project on google code 
> is here: http://code.google.com/p/imdict-chinese-analyzer/
> In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I)   "是"(am)   
> "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence 
> properly, or there will be mis-understandings everywhere in the index 
> constructed by Lucene, and the accuracy of the search engine will be affected 
> seriously!
> Although there are two analyzer packages in apache repository which can 
> handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or 
> every two adjoining characters as a single word, this is obviously not true 
> in reality, also this strategy will increase the index size and hurt the 
> performance baddly.
> The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model 
> (HMM), so it can tokenize chinese sentence in a really intelligent way. 
> Tokenizaion accuracy of this model is above 90% according to the paper 
> "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 
> 60%.
> As imdict-chinese-analyzer is a really fast and intelligent. I want to 
> contribute it to the apache lucene repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org