[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699545#action_12699545
 ] 

Uwe Schindler commented on LUCENE-831:
--

Hi, looks good:

I am only not sure, what would be the right caching ValueSource. If you use a 
caching value source externally from IndexReader, what should I use? The 
original trie patch used the CachingValueSource (as when the patch was done, 
there only existed CacingValueSource):

{code}
+  public static final ValueSource TRIE_VALUE_SOURCE = new 
CachingValueSource(new TrieValueSource());
{code}

But correct would be CacheByReaderValueSource as a per-JVM singleton? For the 
tests is its not a problem, because there is only one index with one segment. 
If I use CachingValueSurce as a singleton, it would cache all values from all 
index readers mixed together?



 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699558#action_12699558
 ] 

Uwe Schindler commented on LUCENE-1536:
---

How about DocIdSet adds a
{code}
boolean isRandomAccess() { return false; }
{code}
That is implemented to return false in the default abstract class for backwards 
compatibility.
If a DocIdSet is random access (backed by OpenBitSet or is the empty iterator), 
isRandomAccess() is overridden to return true and an additional method in 
DocIdSet is implemented, the default would be:
{code}
boolean acceptDoc(int docid) { throw new UnsupportedOperationException(); }
{code}
Both changes are backwards compatible, but filters using OpenBitSet would 
automatically be random access and support acceptDoc().

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699571#action_12699571
 ] 

Uwe Schindler commented on LUCENE-1536:
---

The empty docidset instance should *not* be random access :), so the only 
change would affect OpenBitSet to overwrite these two new methods from the 
default abstract class:
{code}
boolean isRandomAccess() { return true; }
boolean acceptDoc(int docid) { return get(docid); /* possibly inlined */ }
{code}

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699573#action_12699573
 ] 

Uwe Schindler commented on LUCENE-1536:
---

And the switch for different densities:
OpenBitSet could calculate its density in isRandomAccess() and return true or 
false depending on the density factors above. The search code then would only 
check initially isRandomAccess() (before starting filtering) and then switch 
between iterator or random acess api.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: automaton.patch

patch

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Attachments: automaton.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)
Automaton Query/Filter (scalable regex)
---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Attachments: automaton.patch

Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
suitable).

Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
Additionally all of the existing RegexQuery implementations in Lucene are 
really slow if there is no constant prefix. This implementation does not depend 
upon constant prefix, and runs the same query in 640ms.

Some use cases I envision:
 1. lexicography/etc on large text corpora
 2. looking for things such as urls where the prefix is not constant (http:// 
or ftp://)

The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
regular expressions into a DFA. Then, the filter enumerates terms in a 
special way, by using the underlying state machine. Here is my short 
description from the comments:

 The algorithm here is pretty basic. Enumerate terms but instead of a 
binary accept/reject do:
  
 1. Look at the portion that is OK (did not enter a reject state in the DFA)
 2. Generate the next possible String and seek to that.

the Query simply wraps the filter with ConstantScoreQuery.

I did not include the automaton.jar inside the patch but it can be downloaded 
from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699571#action_12699571
 ] 

Uwe Schindler edited comment on LUCENE-1536 at 4/16/09 2:27 AM:


The empty docidset instance should *not* be random access :), so the only 
change would affect OpenBitSet to overwrite these two new methods from the 
default abstract class:
{code}
boolean isRandomAccess() { return true; }
boolean acceptDoc(int docid) { return fastGet(docid); /* possibly inlined */ }
{code}

  was (Author: thetaphi):
The empty docidset instance should *not* be random access :), so the only 
change would affect OpenBitSet to overwrite these two new methods from the 
default abstract class:
{code}
boolean isRandomAccess() { return true; }
boolean acceptDoc(int docid) { return get(docid); /* possibly inlined */ }
{code}
  
 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Filtering documents out of IndexReader

2009-04-16 Thread Michael McCandless
On Tue, Apr 14, 2009 at 9:25 PM, Jeremy Volkman jvolk...@gmail.com wrote:

 Implementing this way allows me to write RAM indexes out to disk without
 blocking readers, and only block readers when I need to remap any filtered
 docs that may have been updated or deleted during the flushing process. I
 think this may beat using a straight IW for my requirements, but I'm not
 positive yet.

I think testing out-of-the-box NRT's performance should be your next
step: if it's sufficient, why bring all the complexity of tracking
these RAM indices?

 So I've currently got a SuppressedIndexReader extends FilterIndexReader, but
 due to 1483 and 1573 I had to implement IndexReader.getFieldCacheKey() to
 get any sort of decent search performance, which I'd rather not do since I'm
 aware its only temporary.

It's temporary because it's needed for the current field cache API,
which we hope to replace with LUCENE-831.  Still, it will likely be
shipped w/ 2.9 and then removed in 3.0.

LUCENE-1313 aims to support the RAM buffering for real, for cases
where performance of the current NRT is in fact limiting, but we still
have some iterating to do with that one

 Is it possible to perform a bunch of adds and deletes from an IW in an
 atomic action? Should I use addIndexesNoOptimize?

IW doesn't support this, so you'll have externally sychronize to
achieve this.  Earlier patches on LUCENE-1313 did have a Transaction
class for an atomic set of updates.

 If I go the filtered searcher direction, my filter will have to be aware of
 the portion of the MultiReader that corresponds to the disk index. Can I
 assume that my disk index will populate the lower portion of doc id space if
 it comes first in the list passed to the MultiReader constructor? The code
 says yes but the docs don't say anything.

This is true today, but is an implementation detail that's free to
change from release to release.

Also, I'd worry about search performance of the filtered searcher
approach, if that's an issue in your app.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699604#action_12699604
 ] 

Michael McCandless commented on LUCENE-1591:


All tests pass!  And patch looks good.  I'll commit shortly.  Thanks Shai!

 Enable bzip compression in benchmark
 

 Key: LUCENE-1591
 URL: https://issues.apache.org/jira/browse/LUCENE-1591
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: commons-compress-dev20090413.jar, 
 commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch


 bzip compression can aid the benchmark package by not requiring extracting 
 bzip files (such as enwiki) in order to index them. The plan is to add a 
 config parameter bzip.compression=true/false and in the relevant tasks either 
 decompress the input file or compress the output file using the bzip streams.
 It will add a dependency on ant.jar which contains two classes similar to 
 GZIPOutputStream and GZIPInputStream which compress/decompress files using 
 the bzip algorithm.
 bzip is known to be superior in its compression performance to the gzip 
 algorithm (~20% better compression), although it does the 
 compression/decompression a bit slower.
 I wil post a patch which adds this parameter and implement it in 
 LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the 
 capability to DocMaker or some of the super classes, so it can be inherited 
 by all sub-classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark

2009-04-16 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699620#action_12699620
 ] 

Shai Erera commented on LUCENE-1591:


Mike, did you commit the commons-compress jar too?

 Enable bzip compression in benchmark
 

 Key: LUCENE-1591
 URL: https://issues.apache.org/jira/browse/LUCENE-1591
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: commons-compress-dev20090413.jar, 
 commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch


 bzip compression can aid the benchmark package by not requiring extracting 
 bzip files (such as enwiki) in order to index them. The plan is to add a 
 config parameter bzip.compression=true/false and in the relevant tasks either 
 decompress the input file or compress the output file using the bzip streams.
 It will add a dependency on ant.jar which contains two classes similar to 
 GZIPOutputStream and GZIPInputStream which compress/decompress files using 
 the bzip algorithm.
 bzip is known to be superior in its compression performance to the gzip 
 algorithm (~20% better compression), although it does the 
 compression/decompression a bit slower.
 I wil post a patch which adds this parameter and implement it in 
 LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the 
 capability to DocMaker or some of the super classes, so it can be inherited 
 by all sub-classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699607#action_12699607
 ] 

Michael McCandless commented on LUCENE-1604:


OK, patch looks good.  All tests pass, even if I temporarily default 
disableFakeNorms to true (but back-compat tests fail, which is expected and 
is why we won't flip the default until 3.0).  Thanks Shon!

I still need to test perf cost of this change...

 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1591) Enable bzip compression in benchmark

2009-04-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1591.


Resolution: Fixed

 Enable bzip compression in benchmark
 

 Key: LUCENE-1591
 URL: https://issues.apache.org/jira/browse/LUCENE-1591
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: commons-compress-dev20090413.jar, 
 commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch


 bzip compression can aid the benchmark package by not requiring extracting 
 bzip files (such as enwiki) in order to index them. The plan is to add a 
 config parameter bzip.compression=true/false and in the relevant tasks either 
 decompress the input file or compress the output file using the bzip streams.
 It will add a dependency on ant.jar which contains two classes similar to 
 GZIPOutputStream and GZIPInputStream which compress/decompress files using 
 the bzip algorithm.
 bzip is known to be superior in its compression performance to the gzip 
 algorithm (~20% better compression), although it does the 
 compression/decompression a bit slower.
 I wil post a patch which adds this parameter and implement it in 
 LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the 
 capability to DocMaker or some of the super classes, so it can be inherited 
 by all sub-classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: automatonWithWildCard.patch

Here is an updated patch with AutomatonWildCardQuery.

This implements standard Lucene Wildcard query with AutomatonFilter.

This accelerates quite a few wildcard situations, such as ??(a|b)?cd*ef
Sorry, provides no help for leading *, but definitely for leading ?.

All wildcard tests pass.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Attachments: automaton.patch, automatonWithWildCard.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1591) Enable bzip compression in benchmark

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699643#action_12699643
 ] 

Michael McCandless commented on LUCENE-1591:


bq. Mike, did you commit the commons-compress jar too?

Woops, forgot, and now fixed -- thanks for catching that!

 Enable bzip compression in benchmark
 

 Key: LUCENE-1591
 URL: https://issues.apache.org/jira/browse/LUCENE-1591
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: commons-compress-dev20090413.jar, 
 commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, 
 LUCENE-1591.patch


 bzip compression can aid the benchmark package by not requiring extracting 
 bzip files (such as enwiki) in order to index them. The plan is to add a 
 config parameter bzip.compression=true/false and in the relevant tasks either 
 decompress the input file or compress the output file using the bzip streams.
 It will add a dependency on ant.jar which contains two classes similar to 
 GZIPOutputStream and GZIPInputStream which compress/decompress files using 
 the bzip algorithm.
 bzip is known to be superior in its compression performance to the gzip 
 algorithm (~20% better compression), although it does the 
 compression/decompression a bit slower.
 I wil post a patch which adds this parameter and implement it in 
 LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the 
 capability to DocMaker or some of the super classes, so it can be inherited 
 by all sub-classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699644#action_12699644
 ] 

Mark Miller commented on LUCENE-831:


Right, you really want to use CacheByReaderValueSource. Better would probably 
be to get that cache on the segment reader as well. But I think that would mean 
bringing back some sort of general cache to IndexReader. You would have to be 
able to attach arbitrary ValueSources to the reader. We will see what ends up 
materializing. I am agonizingly slow at understanding anything, but quick to 
move anyway ;)

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699656#action_12699656
 ] 

Michael McCandless commented on LUCENE-1593:


bq. if so, can we agree on the new names (add, updateTop)?

I think it makes sense to add these, returning the min value (and deprecate the 
old ones).

 Optimizations to TopScoreDocCollector and TopFieldCollector
 ---

 Key: LUCENE-1593
 URL: https://issues.apache.org/jira/browse/LUCENE-1593
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
 to remove unnecessary checks. The plan is:
 # Ensure that IndexSearcher returns segements in increasing doc Id order, 
 instead of numDocs().
 # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
 will always have larger ids and therefore cannot compete.
 # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
 and remove the check if reusableSD == null.
 # Also move to use changing top and then call adjustTop(), in case we 
 update the queue.
 # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker 
 for the last SortField. But, doing so should not be necessary (since we 
 already break ties by docID), and is in fact less efficient (once the above 
 optimization is in).
 # Investigate PQ - can we deprecate insert() and have only 
 insertWithOverflow()? Add a addDummyObjects method which will populate the 
 queue without arranging it, just store the objects in the array (this can 
 be used to pre-populate sentinel values)?
 I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699657#action_12699657
 ] 

Robert Muir commented on LUCENE-1606:
-

mark yeah, the enumeration helps a lot, it means a lot less comparisons, plus 
brics is *FAST*.

inside the AutomatonFilter i describe how it could possibly be done better, but 
I was afraid I would mess it up.
its affected somewhat by the size of the alphabet so if you were using it 
against lots of CJK text, it might be worth it to instead use the 
State/Transition objects in the package. Transitions are described by min and 
max character intervals and you can access intervals in sorted order...

its all so nice but I figure this is a start.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699659#action_12699659
 ] 

Michael McCandless commented on LUCENE-1606:


Can this do everything that RegexQuery currently does?  (Ie we'd deprecate 
RegexQuery)?

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1606:
---

Fix Version/s: 2.9

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1603) Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699660#action_12699660
 ] 

Michael McCandless commented on LUCENE-1603:


I think the name is good, so it's clear you have to provide a MultiTermQuery 
yourself (via subclass) to use it.

 Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement
 

 Key: LUCENE-1603
 URL: https://issues.apache.org/jira/browse/LUCENE-1603
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4, 2.9
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1603.patch, LUCENE-1603.patch, LUCENE-1603.patch


 This is a patch, that is needed for the MultiTermQuery-rewrite of TrieRange 
 (LUCENE-1602):
 - Make the private members protected, to have access to them from the very 
 special TrieRangeTermEnum 
 - Fix a small inconsistency (docFreq() now only returns a value, if a valid 
 term is existing)
 - Improvement of MultiTermFilter.getDocIdSet to return 
 DocIdSet.EMPTY_DOCIDSET, if the TermEnum is empty (less memory usage) and 
 faster.
 - Add the getLastNumberOfTerms() to MultiTermQuery for statistics on 
 different multi term queries and how may terms they affect, using this new 
 functionality, the improvement of TrieRange can be shown (extract from test 
 case there, 1 docs index, long values):
 {code}
 [junit] Average number of terms during random search on 'field8':
 [junit]  Trie query: 244.2
 [junit]  Classical query: 3136.94
 [junit] Average number of terms during random search on 'field4':
 [junit]  Trie query: 38.3
 [junit]  Classical query: 3018.68
 [junit] Average number of terms during random search on 'field2':
 [junit]  Trie query: 18.04
 [junit]  Classical query: 3539.42
 {code}
 All core tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699649#action_12699649
 ] 

Uwe Schindler commented on LUCENE-831:
--

This was the idea behin the FieldType: You register at the top-level 
IndexReader/MultiReader/whatever the parsers/valuesources (e.g. in a map coded 
by field), all subreaders would also get this map (passed through) and if one 
asks for cache values for a specific field, he would get the correctly decoded 
fields (from CSF, Univerter, TrieUniverter, Stored Fields [not really, but 
would be possible]). This was the original approach of this issue: attach 
caching to the single index/segmentreaders (with possibility to register 
valuesources for specific fields).
In this case the SortField ctors taking ValueSource or Parser can be cancelled 
(and we can do this for 2.9, as the Parser ctor of SortField was not yet 
released!).

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1603) Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699647#action_12699647
 ] 

Michael McCandless commented on LUCENE-1603:


Patch looks good -- I'll commit shortly.  Thanks Uwe!

 Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement
 

 Key: LUCENE-1603
 URL: https://issues.apache.org/jira/browse/LUCENE-1603
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4, 2.9
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1603.patch, LUCENE-1603.patch, LUCENE-1603.patch


 This is a patch, that is needed for the MultiTermQuery-rewrite of TrieRange 
 (LUCENE-1602):
 - Make the private members protected, to have access to them from the very 
 special TrieRangeTermEnum 
 - Fix a small inconsistency (docFreq() now only returns a value, if a valid 
 term is existing)
 - Improvement of MultiTermFilter.getDocIdSet to return 
 DocIdSet.EMPTY_DOCIDSET, if the TermEnum is empty (less memory usage) and 
 faster.
 - Add the getLastNumberOfTerms() to MultiTermQuery for statistics on 
 different multi term queries and how may terms they affect, using this new 
 functionality, the improvement of TrieRange can be shown (extract from test 
 case there, 1 docs index, long values):
 {code}
 [junit] Average number of terms during random search on 'field8':
 [junit]  Trie query: 244.2
 [junit]  Classical query: 3136.94
 [junit] Average number of terms during random search on 'field4':
 [junit]  Trie query: 38.3
 [junit]  Classical query: 3018.68
 [junit] Average number of terms during random search on 'field2':
 [junit]  Trie query: 18.04
 [junit]  Classical query: 3539.42
 {code}
 All core tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1603) Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699654#action_12699654
 ] 

Uwe Schindler commented on LUCENE-1603:
---

Do you think the name is good? MultiTermQueryWrapperFilter or simplier 
MultiTermFilter? Its not really one of both, its a mix between wrapper and the 
real filter: It wraps the query, but does the getDocIdSet and TermEnums himself.

 Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement
 

 Key: LUCENE-1603
 URL: https://issues.apache.org/jira/browse/LUCENE-1603
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4, 2.9
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1603.patch, LUCENE-1603.patch, LUCENE-1603.patch


 This is a patch, that is needed for the MultiTermQuery-rewrite of TrieRange 
 (LUCENE-1602):
 - Make the private members protected, to have access to them from the very 
 special TrieRangeTermEnum 
 - Fix a small inconsistency (docFreq() now only returns a value, if a valid 
 term is existing)
 - Improvement of MultiTermFilter.getDocIdSet to return 
 DocIdSet.EMPTY_DOCIDSET, if the TermEnum is empty (less memory usage) and 
 faster.
 - Add the getLastNumberOfTerms() to MultiTermQuery for statistics on 
 different multi term queries and how may terms they affect, using this new 
 functionality, the improvement of TrieRange can be shown (extract from test 
 case there, 1 docs index, long values):
 {code}
 [junit] Average number of terms during random search on 'field8':
 [junit]  Trie query: 244.2
 [junit]  Classical query: 3136.94
 [junit] Average number of terms during random search on 'field4':
 [junit]  Trie query: 38.3
 [junit]  Classical query: 3018.68
 [junit] Average number of terms during random search on 'field2':
 [junit]  Trie query: 18.04
 [junit]  Classical query: 3539.42
 {code}
 All core tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699663#action_12699663
 ] 

Mark Miller commented on LUCENE-831:



Thats somewhat possible now (with the exception that you can't yet set the 
value source for the segment reader yet - it would likely become an argument to 
the static open methods): ValueSource gets a field as an argument, so it is 
also easy enough to set a ValueSource that does trie encoding for arbitrary 
fields on the SegmentReader, eg FieldTypeValueSource could take arguments to 
configure it per field and then you set it on the IndexReader when you open it. 
Thats all still in the patch - its just a bit more of a pain than being able to 
set it at any time on the SortField as an override.

I guess I almost see things going just to the segment reader valuesource option 
though - once FieldCache goes back to standard, it might make sense to drop the 
SortField valuesource support too, and just do the segment ValueSource. Being 
able to init the SegmentReader with a ValueSource really allows for anything 
needed - I just wasn't sure if it was too much of a pain in comparison to also 
having a dynamic SortField override.

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1602) Rewrite TrieRange to use MultiTermQuery

2009-04-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1602:
--

Attachment: LUCENE-1602.patch

This is the final patch, with the changes for LUCENE-1603. I also added 
svn:eol-style to all files in trie and test-trie.
Because this is not yet committed, the patch may still fail to apply, but I 
will commit in the next few hours.

 Rewrite TrieRange to use MultiTermQuery
 ---

 Key: LUCENE-1602
 URL: https://issues.apache.org/jira/browse/LUCENE-1602
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, 
 LUCENE-1602.patch, LUCENE-1602.patch, queries.zip, queries.zip


 Issue for discussion here: 
 http://www.lucidimagination.com/search/document/46a548a79ae9c809/move_trierange_to_core_module_and_integration_issues
 This patch is a rewrite of TrieRange using MultiTermQuery like all other core 
 queries. This should make TrieRange identical in functionality to core range 
 queries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699669#action_12699669
 ] 

Michael McCandless commented on LUCENE-1536:


I like this approach!

But should we somehow decouple the density check vs the is random access check? 
 Ie, isRandomAccess should return true or false based on the underlying 
datastructure.  Then, somehow, I think the search code should determine whether 
a given docIdSet should be randomly accessed vs iterated?  (I'm not sure how 
yet!)

Also, we somehow need the mechanism to denormalize the application of the 
filter from top to bottom, ie, each leaf TermQuery involved in the full query 
needs to know to apply the random access filter just like it applies deletes.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1606:


Attachment: automatonWithWildCard2.patch

oops I did say in javadocs score is constant / boost only so when Wildcard has 
no wildcards and rewrites to termquery, wrap it with 
ConstantScoreQuery(QueryWrapperFilter)) to ensure this.



 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699662#action_12699662
 ] 

Robert Muir commented on LUCENE-1606:
-

Mike the thing it cant do is stuff that cannot be determinized. However I think 
you only need an NFA for capturing group related things:

http://oreilly.com/catalog/regex/chapter/ch04.html

One thing is that the brics syntax is a bit different. i.e. ^ and $ are implied 
and I think some things need to be escaped. 
So I think it can do everything RegexQuery does, but maybe different syntax is 
required.


 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699650#action_12699650
 ] 

Mark Miller commented on LUCENE-1606:
-

Very nice Robert. This looks like it would make a very nice addition to our 
regex support.

Found the benchmarks here quite interesting: 
http://tusker.org/regex/regex_benchmark.html (though it sounds like your 
special enumeration technique makes this regex imp even faster for our uses?)

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Attachments: automaton.patch, automatonWithWildCard.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1603) Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement

2009-04-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1603.


Resolution: Fixed

 Changes for TrieRange in FilteredTermEnum and MultiTermQuery improvement
 

 Key: LUCENE-1603
 URL: https://issues.apache.org/jira/browse/LUCENE-1603
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4, 2.9
Reporter: Uwe Schindler
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1603.patch, LUCENE-1603.patch, LUCENE-1603.patch


 This is a patch, that is needed for the MultiTermQuery-rewrite of TrieRange 
 (LUCENE-1602):
 - Make the private members protected, to have access to them from the very 
 special TrieRangeTermEnum 
 - Fix a small inconsistency (docFreq() now only returns a value, if a valid 
 term is existing)
 - Improvement of MultiTermFilter.getDocIdSet to return 
 DocIdSet.EMPTY_DOCIDSET, if the TermEnum is empty (less memory usage) and 
 faster.
 - Add the getLastNumberOfTerms() to MultiTermQuery for statistics on 
 different multi term queries and how may terms they affect, using this new 
 functionality, the improvement of TrieRange can be shown (extract from test 
 case there, 1 docs index, long values):
 {code}
 [junit] Average number of terms during random search on 'field8':
 [junit]  Trie query: 244.2
 [junit]  Classical query: 3136.94
 [junit] Average number of terms during random search on 'field4':
 [junit]  Trie query: 38.3
 [junit]  Classical query: 3018.68
 [junit] Average number of terms during random search on 'field2':
 [junit]  Trie query: 18.04
 [junit]  Classical query: 3539.42
 {code}
 All core tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699672#action_12699672
 ] 

Uwe Schindler commented on LUCENE-1606:
---

I looked into the patch, looks good. Maybe it would be good to make the new 
AutomatonRegExQuey als a subclass of MultiTermQuery. As you also seek/exchange 
the TermEnum, the needed FilteredTermEnum may be a little bit complicated. But 
you may do it in the same way like I commit soon for TrieRange (LUCENE-1602).
The latest changes from LUCENE-1603 make it possible to write a 
FilteredTermEnum, that handles over to different positioned TermEnums like you 
do.
With MultiTermQuery you get all for free: ConstantScore, Boolean rewrite and 
optionally the Filter (which is not needed here, I think). And: You could also 
overwrite difference in FilteredTermEnum to rank the hits.
A note: The FilteredTermEnum created by TrieRange is not for sure really 
ordered correctly according Term.compareTo(), but this is not really needed for 
MultiTermQuery.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



TermEnum.skipTo()

2009-04-16 Thread Robert Muir
while I was mucking with term enumeration i found that TermEnum.skipTo() has
a very simple implementation and has in javadocs that 'some implementations
are considerably more efficent', yet SegmentTermEnum definitely doesn't
reimplement it in a more efficient way.

For my purposes to skip around i simply close the term enum and get a new
one from the indexReader at a different starting point.

Not that I want to touch it, just mentioning i thought it was a little
non-obvious that skipTo() is so inefficient, it keeps enumerating until
compareTo() returns what it wants...

-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699673#action_12699673
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, I agree with you, with one caveat: for this functionality to work the Enum 
must be ordered correctly according to Term.compareTo().

Otherwise it will not work correctly...

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-16 Thread Shon Vella (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699674#action_12699674
 ] 

Shon Vella commented on LUCENE-1604:


Working on an update to the patch - MultiSegmentReader needs to set 
disableFakeNorms transitively to it's subReaders as well as to new subReaders 
on reopen.

 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: TermEnum.skipTo()

2009-04-16 Thread Mark Miller

Robert Muir wrote:
while I was mucking with term enumeration i found that 
TermEnum.skipTo() has a very simple implementation and has in javadocs 
that 'some implementations are considerably more efficent', yet 
SegmentTermEnum definitely doesn't reimplement it in a more efficient way.


For my purposes to skip around i simply close the term enum and get a 
new one from the indexReader at a different starting point.


Not that I want to touch it, just mentioning i thought it was a little 
non-obvious that skipTo() is so inefficient, it keeps enumerating 
until compareTo() returns what it wants...


--
Robert Muir
rcm...@gmail.com mailto:rcm...@gmail.com

Indeed - somewhat related: https://issues.apache.org/jira/browse/LUCENE-1592

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699675#action_12699675
 ] 

Uwe Schindler commented on LUCENE-1536:
---

I coupled the density check inside the OpenBitSet, because the internals of 
OpenBitset are responsible for determining how fast a sequential vs. random 
approach is. Maybe someone invents an new hyper-bitset that can faster do 
sequential accesses even in sparingly filled bitsets (e.g. fragmented bitset, 
bitset with RDBMS-like index). In this case, it has the responsibility to 
say: if density is between this and this i would use sequential.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 status (to port to Lucene.Net)

2009-04-16 Thread Michael McCandless
Hi George,

There's been a sudden burst of activity lately on 2.9 development...

I know there are some biggish remaining features we may want to get
into 2.9:

  * The new field cache (LUCENE-831; still being iterated/mulled),

  * Possible major rework of Field / Document  index-time vs
search-time Document

  * Applying filters via random-access API when possible  performant
(LUCENE-1536)

  * Possible further optimizations to how collection works
   (LUCENE-1593)

  * Maybe breaking core + contrib into a more uniform set of modules
(and figuring out how Trie(Numeric)RangeQuery/Filter fits in here)
-- the Modularization uber-thread.

  * Further improvements to near-realtime search (using RAMDir for
small recently flushed segments)

  * Many other small things and probably some big ones that I'm
forgetting now :)

So things are still in flux, and I'm really not sure on a release date
at this point.  Late last year, I was hoping for early this year, but
it's no longer early this year ;)

Mike

On Wed, Apr 15, 2009 at 9:17 PM, George Aroush geo...@aroush.net wrote:
 Hi Folks,

 This is George Aroush, I'm one of the committers on Lucene.Net - a port of
 Java Lucene to C# Lucene.

 I'm looking at the current trunk code of yet to be released Lucene 2.9 and I
 would like to port it to Lucene.Net.  If I do this now, we get the benefit
 of keeping our code base and release dates much closer to Java Lucene.
 However, this comes with a cost of carrying over unfinished work, known
 defects, and I have to keep an eye on new code that get committed into Java
 Lucene which must be ported over in a timely fashion.

 To help me determine when is a good time to start the port -- keep in mind,
 I will be taking the latest code off SVN -- I like to hear from the Java
 Lucene committers (and users who are playing or using Lucene 2.9 off SVN)
 about those questions:

 1) how stable the current code in the trunk is,
 2) do you still have feature work to deliver or just bug fixes, and
 3) what's your target date to release Java Lucene 2.9

 #1 is important, such that is anyone using it in production?

 Yes, I did look at the current open issues in JIRA, but that doesn't help me
 answer the above questions.

 Regards,

 -- George


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699676#action_12699676
 ] 

Uwe Schindler commented on LUCENE-1606:
---

It will work, that was what I said. For MultiTermQuery, it must *not* be 
ordered, the ordering is irrelevant for it, MultTermQuery only enumerates the 
terms. TrieRange is an example of that, the order of terms is not for sure 
ordered correctly (it is at the moment because of the internal implementation 
of splitLongRange(), but I tested it with the inverse order and it still 
worked). If you want to use the enum for something other, it will fail.
The filters inside MultiTermQuery and the BooleanQuery do not need to have the 
terms ordered.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699678#action_12699678
 ] 

Mark Miller commented on LUCENE-831:


So I'm flopping around on this, but I guess my latest take is that:

I want to drop the SortField ValueSource override option. Everything would need 
to be handled by overriding the segment reader ValueSource.

Drop the current back compat code for FieldCache - its mostly unnecessary I 
think. Instead, perhaps go back to orig FieldCache impl, except if the Reader 
is a segment reader, use the new ValueSource API ? Grrr - except if someone has 
mucked with the ValueSource or used a custom FieldCache Parser, it won't match 
correctly...thats it - you just can't straddle the two APIs. So I'll revert 
FieldCache to its former self and just deprecate.

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699680#action_12699680
 ] 

Michael McCandless commented on LUCENE-1536:


OK, if we do choose to couple, maybe we should name it useRandomAccess()?

Another filter optimization that'd be nice to get in is to somehow know that 
a filter has pre-incorporated deleted documents.  This way, once we have a 
solution for the push filter down to all TermScorers, we could have it only 
check the filter and not also deleted docs.  (This is one of the optimizations 
in LUCENE-1594).

We might eventually want/need some sort of external FilterManager that would 
handle this (ie, convert a filter to sparse vs random-access as appropriate, 
multiply in deleted docs, handle caching, etc).

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699685#action_12699685
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, i'll look and see how you do it for TrieRange.

if it can make the code for this simpler that will be fantastic. maybe by then 
I will have also figured out some way to cleanly and non-recursively use 
min/max character intervals in the state machine to decrease the amount of 
seeks and optimize a little bit.

 

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1602) Rewrite TrieRange to use MultiTermQuery

2009-04-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1602.
---

Resolution: Fixed

Committed revision 765618.

 Rewrite TrieRange to use MultiTermQuery
 ---

 Key: LUCENE-1602
 URL: https://issues.apache.org/jira/browse/LUCENE-1602
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1602.patch, LUCENE-1602.patch, LUCENE-1602.patch, 
 LUCENE-1602.patch, LUCENE-1602.patch, queries.zip, queries.zip


 Issue for discussion here: 
 http://www.lucidimagination.com/search/document/46a548a79ae9c809/move_trierange_to_core_module_and_integration_issues
 This patch is a rewrite of TrieRange using MultiTermQuery like all other core 
 queries. This should make TrieRange identical in functionality to core range 
 queries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699690#action_12699690
 ] 

Uwe Schindler commented on LUCENE-1606:
---

I committed TrieRange revision 765618. You can see the impl here:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/trie/TrieRangeTermEnum.java?view=markup

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: TermEnum.skipTo()

2009-04-16 Thread Mark Miller

Mark Miller wrote:

Robert Muir wrote:
while I was mucking with term enumeration i found that 
TermEnum.skipTo() has a very simple implementation and has in 
javadocs that 'some implementations are considerably more efficent', 
yet SegmentTermEnum definitely doesn't reimplement it in a more 
efficient way.


For my purposes to skip around i simply close the term enum and get a 
new one from the indexReader at a different starting point.


Not that I want to touch it, just mentioning i thought it was a 
little non-obvious that skipTo() is so inefficient, it keeps 
enumerating until compareTo() returns what it wants...


--
Robert Muir
rcm...@gmail.com mailto:rcm...@gmail.com
Indeed - somewhat related: 
https://issues.apache.org/jira/browse/LUCENE-1592



I've changed

Some implementations are considerably more efficient than that.

to

Some implementations *could* be considerably more efficient than a 
linear scan.

Check the implementation to be sure.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1592) fix or deprecate TermsEnum.skipTo

2009-04-16 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1592:
--

Summary: fix or deprecate TermsEnum.skipTo  (was: fix or deprecate 
TermsEnum.seek)

 fix or deprecate TermsEnum.skipTo
 -

 Key: LUCENE-1592
 URL: https://issues.apache.org/jira/browse/LUCENE-1592
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor
 Fix For: 2.9


 This method is a trap: it looks legitimate but it has hideously poor 
 performance (simple linear scan implemented in the TermsEnum base class since 
 none of the concrete impls override it with a more efficient implementation).
 The least we should do for 2.9 is deprecate the method with  a strong warning 
 about its performance.
 See here for background: 
 http://www.lucidimagination.com/search/document/77dc4f8e893d3cf3/possible_terminfosreader_speedup
 And, here for historical context: 
 http://www.lucidimagination.com/search/document/88f1b95b404ebf16/remove_termenum_skipto_term_target

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699693#action_12699693
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, thanks. I'll think on this and on other improvements. 
I'm not really confident in my ability to make the code much cleaner at the end 
of the day, but more efficient and get some things for free as you say.
For now it is working much better than a linear scan, and the improvements wont 
change the order, but might help a bit.

Think i should try to correct this issue or create a separate issue?


 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1592) fix or deprecate TermsEnum.skipTo

2009-04-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699696#action_12699696
 ] 

Mark Miller commented on LUCENE-1592:
-

I made a quick update to the javadoc so its a bit less misleading, but still 
needs to be resolved in a stronger manner, al la this issue.

 fix or deprecate TermsEnum.skipTo
 -

 Key: LUCENE-1592
 URL: https://issues.apache.org/jira/browse/LUCENE-1592
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor
 Fix For: 2.9


 This method is a trap: it looks legitimate but it has hideously poor 
 performance (simple linear scan implemented in the TermsEnum base class since 
 none of the concrete impls override it with a more efficient implementation).
 The least we should do for 2.9 is deprecate the method with  a strong warning 
 about its performance.
 See here for background: 
 http://www.lucidimagination.com/search/document/77dc4f8e893d3cf3/possible_terminfosreader_speedup
 And, here for historical context: 
 http://www.lucidimagination.com/search/document/88f1b95b404ebf16/remove_termenum_skipto_term_target

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699697#action_12699697
 ] 

Uwe Schindler commented on LUCENE-1606:
---

Let's stay with this issue!

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonWithWildCard.patch, 
 automatonWithWildCard2.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: TermEnum.skipTo()

2009-04-16 Thread Michael McCandless
Maybe we should deprecate it?

Mike

On Thu, Apr 16, 2009 at 9:04 AM, Mark Miller markrmil...@gmail.com wrote:
 Mark Miller wrote:

 Robert Muir wrote:

 while I was mucking with term enumeration i found that TermEnum.skipTo()
 has a very simple implementation and has in javadocs that 'some
 implementations are considerably more efficent', yet SegmentTermEnum
 definitely doesn't reimplement it in a more efficient way.

 For my purposes to skip around i simply close the term enum and get a new
 one from the indexReader at a different starting point.

 Not that I want to touch it, just mentioning i thought it was a little
 non-obvious that skipTo() is so inefficient, it keeps enumerating until
 compareTo() returns what it wants...

 --
 Robert Muir
 rcm...@gmail.com mailto:rcm...@gmail.com

 Indeed - somewhat related:
 https://issues.apache.org/jira/browse/LUCENE-1592

 I've changed

 Some implementations are considerably more efficient than that.

 to

 Some implementations *could* be considerably more efficient than a linear
 scan.
 Check the implementation to be sure.

 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: TermEnum.skipTo()

2009-04-16 Thread Shai Erera
I think it's a convenient method. Even if not performing, it's still more
convenient than forcing everyone who wants to use it to implement it by
himself. Perhaps a better implementation will exist in the future, and thus
everyone who'll use this method will be silently upgraded. Maybe such a
better implementation should be considered?

On Thu, Apr 16, 2009 at 4:46 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Maybe we should deprecate it?

 Mike

 On Thu, Apr 16, 2009 at 9:04 AM, Mark Miller markrmil...@gmail.com
 wrote:
  Mark Miller wrote:
 
  Robert Muir wrote:
 
  while I was mucking with term enumeration i found that
 TermEnum.skipTo()
  has a very simple implementation and has in javadocs that 'some
  implementations are considerably more efficent', yet SegmentTermEnum
  definitely doesn't reimplement it in a more efficient way.
 
  For my purposes to skip around i simply close the term enum and get a
 new
  one from the indexReader at a different starting point.
 
  Not that I want to touch it, just mentioning i thought it was a little
  non-obvious that skipTo() is so inefficient, it keeps enumerating until
  compareTo() returns what it wants...
 
  --
  Robert Muir
  rcm...@gmail.com mailto:rcm...@gmail.com
 
  Indeed - somewhat related:
  https://issues.apache.org/jira/browse/LUCENE-1592
 
  I've changed
 
  Some implementations are considerably more efficient than that.
 
  to
 
  Some implementations *could* be considerably more efficient than a
 linear
  scan.
  Check the implementation to be sure.
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 
 
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: TermEnum.skipTo()

2009-04-16 Thread Michael McCandless
That would be great... we need someone to pull a patch together (for
SegmentReader  Multi*Reader to implement it efficiently).

Mike

On Thu, Apr 16, 2009 at 9:50 AM, Shai Erera ser...@gmail.com wrote:
 I think it's a convenient method. Even if not performing, it's still more
 convenient than forcing everyone who wants to use it to implement it by
 himself. Perhaps a better implementation will exist in the future, and thus
 everyone who'll use this method will be silently upgraded. Maybe such a
 better implementation should be considered?

 On Thu, Apr 16, 2009 at 4:46 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Maybe we should deprecate it?

 Mike

 On Thu, Apr 16, 2009 at 9:04 AM, Mark Miller markrmil...@gmail.com
 wrote:
  Mark Miller wrote:
 
  Robert Muir wrote:
 
  while I was mucking with term enumeration i found that
  TermEnum.skipTo()
  has a very simple implementation and has in javadocs that 'some
  implementations are considerably more efficent', yet SegmentTermEnum
  definitely doesn't reimplement it in a more efficient way.
 
  For my purposes to skip around i simply close the term enum and get a
  new
  one from the indexReader at a different starting point.
 
  Not that I want to touch it, just mentioning i thought it was a little
  non-obvious that skipTo() is so inefficient, it keeps enumerating
  until
  compareTo() returns what it wants...
 
  --
  Robert Muir
  rcm...@gmail.com mailto:rcm...@gmail.com
 
  Indeed - somewhat related:
  https://issues.apache.org/jira/browse/LUCENE-1592
 
  I've changed
 
  Some implementations are considerably more efficient than that.
 
  to
 
  Some implementations *could* be considerably more efficient than a
  linear
  scan.
  Check the implementation to be sure.
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 
 
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



I wanna contribute a Chinese analyzer to lucene

2009-04-16 Thread Gao Pinker
Hi All!

I wrote a Analyzer for apache lucene for analyzing sentences in
*Chinese*language, it's called
*imdict-chinese-analyzer* as it is a subproject of
*imdict*http://www.imdict.net/,
which is an intelligent online dictionary.

The project on google code is here:
http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)
中国人(Chinese), *not* 我 是中 国人. So the analyzer must handle each
sentence properly, or there will be mis-understandings everywhere in the
index constructed by Lucene, and the accuracy of the search engine will be
affected seriously!

Although there are two analyzer packages in apache repository which can
handle Chinese:
ChineseAnalyzerhttp://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/and
CJKAnalyzerhttp://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/,
they take each character or every two adjoining characters as a single word,
this is obviously not true in reality, also this strategy will increase the
index size and hit the performance baddly.

The algorithm of* imdict-chinese-analyzer* is based on Hidden Markov Model
(HMM), so it can tokenize chinese sentence in a really intelligent way.
Tokenizaion accuracy of this model is above 90% according to the paper
HHMM-based
Chinese Lexical analyzer
ICTCLALhttp://www.nlp.org.cn/project/project.php?proj_id=6
.

As *imdict-chinese-analyzer* is a really fast intelligent Chinese Analyzer
for lucene written in Java. I want to share this project with every one
using Lucene.

This Analyzer contains two packages, *the source code* and the *lexical
dictionary*. I want to publish the source code using Apache license, but the
dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and
let the users download the dictionary from the google code site?

please help me about this contribution.


[jira] Issue Comment Edited: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-16 Thread Shon Vella (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699714#action_12699714
 ] 

Shon Vella edited comment on LUCENE-1604 at 4/16/09 7:16 AM:
-

Setting disableFakeNorms transitively isn't really needed because 
MultiSegmentReader doesn't make any calls to the subreaders that would cause it 
to create it's own fake norms. We probably ought to preserve the flag on 
clone() and reopen() though, which is going to be a little messy because 
IndexReader doesn't really implement either so it would have to be handled at 
the root of each concrete class hierarchy that does implement those. Any 
thoughts on whether we need this or not?

  was (Author: svella):
Setting disableFakeNorms transitively isn't really needed because 
MultiSegmentReader doesn't make any calls to the subreaders that would cause it 
to create it's own fake norms. We probably ought to preserve the flag on 
clone() and reopen() though, which is going to be a little messy because 
IndexReader doesn't really implement either so it would have to be handled at 
the root of each concrete class hierarchy that does implement those. Any 
thoughts?
  
 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-16 Thread Shon Vella (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699714#action_12699714
 ] 

Shon Vella commented on LUCENE-1604:


Setting disableFakeNorms transitively isn't really needed because 
MultiSegmentReader doesn't make any calls to the subreaders that would cause it 
to create it's own fake norms. We probably ought to preserve the flag on 
clone() and reopen() though, which is going to be a little messy because 
IndexReader doesn't really implement either so it would have to be handled at 
the root of each concrete class hierarchy that does implement those. Any 
thoughts?

 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: I wanna contribute a Chinese analyzer to lucene

2009-04-16 Thread Ken Krugler
I wrote a Analyzer for apache lucene for analyzing sentences in 
Chinese language, it's called imdict-chinese-analyzer as it is a 
subproject of http://www.imdict.net/imdict, which is an 
intelligent online dictionary.


The project on google code is here: 
http://code.google.com/p/imdict-chinese-analyzer/http://code.google.com/p/imdict-chinese-analyzer/


I took a quick look, but didn't see any code posted there yet.

[snip]

This Analyzer contains two packages, the source code and the lexical 
dictionary. I want to publish the source code using Apache license, 
but the dictionary which is under an ambigus license was not create 
by me.
So, can I only submit the source code to lucene contribution 
repository, and let the users download the dictionary from the 
google code site?


I believe your code can be a contrib, with a reference to the 
dictionary. So a first step would be to open an issue in Lucene's 
Jira (http://issues.apache.org/jira/browse/LUCENE), and post your 
source as a patch.


The best way to get the right answer to the legal issue is to post it 
to the legal-disc...@apache.org list (join it first), as Apache's 
lawyers can then respond to your specific question.


-- Ken
--
Ken Krugler
+1 530-210-6378

Re: TermEnum.skipTo()

2009-04-16 Thread Mark Miller
+1 on further handling (LUCENE-1592). I just wanted to get a doc change in
now rather than wait for that to complete. The statment that some
implementations provide more efficient impls is very misleading (its almost
an assertion that one exists) when no impls that ship with Lucene in fact
do.

On Thu, Apr 16, 2009 at 9:57 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 That would be great... we need someone to pull a patch together (for
 SegmentReader  Multi*Reader to implement it efficiently).

 Mike

 On Thu, Apr 16, 2009 at 9:50 AM, Shai Erera ser...@gmail.com wrote:
  I think it's a convenient method. Even if not performing, it's still more
  convenient than forcing everyone who wants to use it to implement it by
  himself. Perhaps a better implementation will exist in the future, and
 thus
  everyone who'll use this method will be silently upgraded. Maybe such a
  better implementation should be considered?
 
  On Thu, Apr 16, 2009 at 4:46 PM, Michael McCandless
  luc...@mikemccandless.com wrote:
 
  Maybe we should deprecate it?
 
  Mike
 
  On Thu, Apr 16, 2009 at 9:04 AM, Mark Miller markrmil...@gmail.com
  wrote:
   Mark Miller wrote:
  
   Robert Muir wrote:
  
   while I was mucking with term enumeration i found that
   TermEnum.skipTo()
   has a very simple implementation and has in javadocs that 'some
   implementations are considerably more efficent', yet SegmentTermEnum
   definitely doesn't reimplement it in a more efficient way.
  
   For my purposes to skip around i simply close the term enum and get
 a
   new
   one from the indexReader at a different starting point.
  
   Not that I want to touch it, just mentioning i thought it was a
 little
   non-obvious that skipTo() is so inefficient, it keeps enumerating
   until
   compareTo() returns what it wants...
  
   --
   Robert Muir
   rcm...@gmail.com mailto:rcm...@gmail.com
  
   Indeed - somewhat related:
   https://issues.apache.org/jira/browse/LUCENE-1592
  
   I've changed
  
   Some implementations are considerably more efficient than that.
  
   to
  
   Some implementations *could* be considerably more efficient than a
   linear
   scan.
   Check the implementation to be sure.
  
   --
   - Mark
  
   http://www.lucidimagination.com
  
  
  
  
   -
   To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-dev-h...@lucene.apache.org
  
  
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Assigned: (LUCENE-1605) Add subset method to BitVector

2009-04-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1605:
--

Assignee: Michael McCandless

 Add subset method to BitVector
 --

 Key: LUCENE-1605
 URL: https://issues.apache.org/jira/browse/LUCENE-1605
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Affects Versions: 2.9
Reporter: Jeremy Volkman
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1605.txt


 Recently I needed the ability to efficiently compute subsets of a BitVector. 
 The method is:
   public BitVector subset(int start, int end)
 where start is the starting index, inclusive and end is the ending index, 
 exclusive.
 Attached is a patch including the subset method as well as relevant unit 
 tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1605) Add subset method to BitVector

2009-04-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1605.


Resolution: Fixed

 Add subset method to BitVector
 --

 Key: LUCENE-1605
 URL: https://issues.apache.org/jira/browse/LUCENE-1605
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Affects Versions: 2.9
Reporter: Jeremy Volkman
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1605.txt


 Recently I needed the ability to efficiently compute subsets of a BitVector. 
 The method is:
   public BitVector subset(int start, int end)
 where start is the starting index, inclusive and end is the ending index, 
 exclusive.
 Attached is a patch including the subset method as well as relevant unit 
 tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1605) Add subset method to BitVector

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699718#action_12699718
 ] 

Michael McCandless commented on LUCENE-1605:


Patch looks good; I'll commit shortly.  Thanks Jeremy!

 Add subset method to BitVector
 --

 Key: LUCENE-1605
 URL: https://issues.apache.org/jira/browse/LUCENE-1605
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Affects Versions: 2.9
Reporter: Jeremy Volkman
Assignee: Michael McCandless
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1605.txt


 Recently I needed the ability to efficiently compute subsets of a BitVector. 
 The method is:
   public BitVector subset(int start, int end)
 where start is the starting index, inclusive and end is the ending index, 
 exclusive.
 Attached is a patch including the subset method as well as relevant unit 
 tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699720#action_12699720
 ] 

Michael McCandless commented on LUCENE-1604:


bq. Setting disableFakeNorms transitively isn't really needed because 
MultiSegmentReader doesn't make any calls to the subreaders that would cause it 
to create it's own fake norms

But since we score per-segment, TermScorer would ask each SegmentReader (in the 
MultiSegmentReader) for its norms?  So I think the sub readers need to know the 
setting.

bq. Any thoughts on whether we need this or not?

I think we do need each class implementing clone() and reopen() to properly 
carryover this setting.


 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



vacation

2009-04-16 Thread Michael McCandless
Just as a heads up, since we have so many neat Lucene improvements in
flight: tomorrow I leave for a week long vacation, in a nice warm
place that may or may not have internet access.  So if suddenly I stop
answering things, now you know why!

Keep hacking away ;)

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: vacation

2009-04-16 Thread Shai Erera
If it's nice and warm I hope for you that it doesn't have internet access,
so you won't be tempted to be dragged away from it ;)

On Thu, Apr 16, 2009 at 5:45 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Just as a heads up, since we have so many neat Lucene improvements in
 flight: tomorrow I leave for a week long vacation, in a nice warm
 place that may or may not have internet access.  So if suddenly I stop
 answering things, now you know why!

 Keep hacking away ;)

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: I wanna contribute a Chinese analyzer to lucene

2009-04-16 Thread Earwin Burrfoot
On Thu, Apr 16, 2009 at 18:16, Ken Krugler kkrugler_li...@transpac.com wrote:
 I wrote a Analyzer for apache lucene for analyzing sentences in Chinese
 language, it's called imdict-chinese-analyzer as it is a subproject of
 imdict, which is an intelligent online dictionary.

 The project on google code is here:
 http://code.google.com/p/imdict-chinese-analyzer/

 I took a quick look, but didn't see any code posted there yet.
http://code.google.com/p/imdict-chinese-analyzer/downloads/list  ?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: vacation

2009-04-16 Thread Michael McCandless
Yes I suppose that would be best ;)

Mike

On Thu, Apr 16, 2009 at 10:48 AM, Shai Erera ser...@gmail.com wrote:
 If it's nice and warm I hope for you that it doesn't have internet access,
 so you won't be tempted to be dragged away from it ;)

 On Thu, Apr 16, 2009 at 5:45 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Just as a heads up, since we have so many neat Lucene improvements in
 flight: tomorrow I leave for a week long vacation, in a nice warm
 place that may or may not have internet access.  So if suddenly I stop
 answering things, now you know why!

 Keep hacking away ;)

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Lucene 2.9 status (to port to Lucene.Net)

2009-04-16 Thread George Aroush
Thanks Mike.

A quick follow up question.  What's the status of
http://issues.apache.org/jira/browse/LUCENE-1313?  Can this work be applied
to Lucene 2.4.1 and still get it's benefit or are there other dependency /
issues with it that prevents us from doing so?

If anyone else knows, I welcome your input.

-- George

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com] 
 Sent: Thursday, April 16, 2009 8:36 AM
 To: java-dev@lucene.apache.org
 Subject: Re: Lucene 2.9 status (to port to Lucene.Net)
 
 Hi George,
 
 There's been a sudden burst of activity lately on 2.9 development...
 
 I know there are some biggish remaining features we may want 
 to get into 2.9:
 
   * The new field cache (LUCENE-831; still being iterated/mulled),
 
   * Possible major rework of Field / Document  index-time vs
 search-time Document
 
   * Applying filters via random-access API when possible  performant
 (LUCENE-1536)
 
   * Possible further optimizations to how collection works
(LUCENE-1593)
 
   * Maybe breaking core + contrib into a more uniform set of modules
 (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here)
 -- the Modularization uber-thread.
 
   * Further improvements to near-realtime search (using RAMDir for
 small recently flushed segments)
 
   * Many other small things and probably some big ones that I'm
 forgetting now :)
 
 So things are still in flux, and I'm really not sure on a 
 release date at this point.  Late last year, I was hoping for 
 early this year, but it's no longer early this year ;)
 
 Mike
 
 On Wed, Apr 15, 2009 at 9:17 PM, George Aroush 
 geo...@aroush.net wrote:
  Hi Folks,
 
  This is George Aroush, I'm one of the committers on Lucene.Net - a 
  port of Java Lucene to C# Lucene.
 
  I'm looking at the current trunk code of yet to be released 
 Lucene 2.9 
  and I would like to port it to Lucene.Net.  If I do this 
 now, we get 
  the benefit of keeping our code base and release dates much 
 closer to Java Lucene.
  However, this comes with a cost of carrying over unfinished work, 
  known defects, and I have to keep an eye on new code that get 
  committed into Java Lucene which must be ported over in a 
 timely fashion.
 
  To help me determine when is a good time to start the port 
 -- keep in 
  mind, I will be taking the latest code off SVN -- I like to 
 hear from 
  the Java Lucene committers (and users who are playing or 
 using Lucene 
  2.9 off SVN) about those questions:
 
  1) how stable the current code in the trunk is,
  2) do you still have feature work to deliver or just bug fixes, and
  3) what's your target date to release Java Lucene 2.9
 
  #1 is important, such that is anyone using it in production?
 
  Yes, I did look at the current open issues in JIRA, but 
 that doesn't 
  help me answer the above questions.
 
  Regards,
 
  -- George
 
 
  
 -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 status (to port to Lucene.Net)

2009-04-16 Thread Mark Miller
I wouldn't be surprised if it didnt depend on a couple other little 
issues - Jason or Mike would probably have to tell you that.


It does count a bit on LUCENE-1483 if you want to use it with 
FieldCaches or cached Filters though. It would still work with 1483, but 
would be much slower in those cases.


- Mark

George Aroush wrote:

Thanks Mike.

A quick follow up question.  What's the status of
http://issues.apache.org/jira/browse/LUCENE-1313?  Can this work be applied
to Lucene 2.4.1 and still get it's benefit or are there other dependency /
issues with it that prevents us from doing so?

If anyone else knows, I welcome your input.

-- George

  

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Thursday, April 16, 2009 8:36 AM

To: java-dev@lucene.apache.org
Subject: Re: Lucene 2.9 status (to port to Lucene.Net)

Hi George,

There's been a sudden burst of activity lately on 2.9 development...

I know there are some biggish remaining features we may want 
to get into 2.9:


  * The new field cache (LUCENE-831; still being iterated/mulled),

  * Possible major rework of Field / Document  index-time vs
search-time Document

  * Applying filters via random-access API when possible  performant
(LUCENE-1536)

  * Possible further optimizations to how collection works
   (LUCENE-1593)

  * Maybe breaking core + contrib into a more uniform set of modules
(and figuring out how Trie(Numeric)RangeQuery/Filter fits in here)
-- the Modularization uber-thread.

  * Further improvements to near-realtime search (using RAMDir for
small recently flushed segments)

  * Many other small things and probably some big ones that I'm
forgetting now :)

So things are still in flux, and I'm really not sure on a 
release date at this point.  Late last year, I was hoping for 
early this year, but it's no longer early this year ;)


Mike

On Wed, Apr 15, 2009 at 9:17 PM, George Aroush 
geo...@aroush.net wrote:


Hi Folks,

This is George Aroush, I'm one of the committers on Lucene.Net - a 
port of Java Lucene to C# Lucene.


I'm looking at the current trunk code of yet to be released 
  
Lucene 2.9 

and I would like to port it to Lucene.Net.  If I do this 
  
now, we get 

the benefit of keeping our code base and release dates much 
  

closer to Java Lucene.

However, this comes with a cost of carrying over unfinished work, 
known defects, and I have to keep an eye on new code that get 
committed into Java Lucene which must be ported over in a 
  

timely fashion.

To help me determine when is a good time to start the port 
  
-- keep in 

mind, I will be taking the latest code off SVN -- I like to 
  
hear from 

the Java Lucene committers (and users who are playing or 
  
using Lucene 


2.9 off SVN) about those questions:

1) how stable the current code in the trunk is,
2) do you still have feature work to deliver or just bug fixes, and
3) what's your target date to release Java Lucene 2.9

#1 is important, such that is anyone using it in production?

Yes, I did look at the current open issues in JIRA, but 
  
that doesn't 


help me answer the above questions.

Regards,

-- George



  

-


To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


  

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 status (to port to Lucene.Net)

2009-04-16 Thread Mark Miller
Whoops - should read: It should still work *without* 1483 but would be 
much slower in those cases (reloading the filter/fieldcache per reader 
rather than per segment).


Mark Miller wrote:
I wouldn't be surprised if it didnt depend on a couple other little 
issues - Jason or Mike would probably have to tell you that.


It does count a bit on LUCENE-1483 if you want to use it with 
FieldCaches or cached Filters though. It would still work with 1483, 
but would be much slower in those cases.


- Mark

George Aroush wrote:

Thanks Mike.

A quick follow up question.  What's the status of
http://issues.apache.org/jira/browse/LUCENE-1313?  Can this work be 
applied
to Lucene 2.4.1 and still get it's benefit or are there other 
dependency /

issues with it that prevents us from doing so?

If anyone else knows, I welcome your input.

-- George






--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Lucene 2.9 status (to port to Lucene.Net)

2009-04-16 Thread Uwe Schindler
These issues all depend so much on each other, i would suggest to simply try
Lucene-2.9-dev trunk (e.g. from downloaded from Hudson). We have this
running here without any problems. The problem with unreleased Lucene is
more, that if you try new features, there may be non-compatible changes
until the release, so you must keep track on changes on the components you
try out.
In general: If everything works for you, and you have backups of your
indexes, you can simply try out. If it works correctly, just use it!
Patching the relased version may make it more unstable than using the
development tree, that is more tested by all our committers :)

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: George Aroush [mailto:geo...@aroush.net]
 Sent: Thursday, April 16, 2009 5:05 PM
 To: java-dev@lucene.apache.org
 Subject: RE: Lucene 2.9 status (to port to Lucene.Net)
 
 Thanks Mike.
 
 A quick follow up question.  What's the status of
 http://issues.apache.org/jira/browse/LUCENE-1313?  Can this work be
 applied
 to Lucene 2.4.1 and still get it's benefit or are there other dependency /
 issues with it that prevents us from doing so?
 
 If anyone else knows, I welcome your input.
 
 -- George
 
  -Original Message-
  From: Michael McCandless [mailto:luc...@mikemccandless.com]
  Sent: Thursday, April 16, 2009 8:36 AM
  To: java-dev@lucene.apache.org
  Subject: Re: Lucene 2.9 status (to port to Lucene.Net)
 
  Hi George,
 
  There's been a sudden burst of activity lately on 2.9 development...
 
  I know there are some biggish remaining features we may want
  to get into 2.9:
 
* The new field cache (LUCENE-831; still being iterated/mulled),
 
* Possible major rework of Field / Document  index-time vs
  search-time Document
 
* Applying filters via random-access API when possible  performant
  (LUCENE-1536)
 
* Possible further optimizations to how collection works
 (LUCENE-1593)
 
* Maybe breaking core + contrib into a more uniform set of modules
  (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here)
  -- the Modularization uber-thread.
 
* Further improvements to near-realtime search (using RAMDir for
  small recently flushed segments)
 
* Many other small things and probably some big ones that I'm
  forgetting now :)
 
  So things are still in flux, and I'm really not sure on a
  release date at this point.  Late last year, I was hoping for
  early this year, but it's no longer early this year ;)
 
  Mike
 
  On Wed, Apr 15, 2009 at 9:17 PM, George Aroush
  geo...@aroush.net wrote:
   Hi Folks,
  
   This is George Aroush, I'm one of the committers on Lucene.Net - a
   port of Java Lucene to C# Lucene.
  
   I'm looking at the current trunk code of yet to be released
  Lucene 2.9
   and I would like to port it to Lucene.Net.  If I do this
  now, we get
   the benefit of keeping our code base and release dates much
  closer to Java Lucene.
   However, this comes with a cost of carrying over unfinished work,
   known defects, and I have to keep an eye on new code that get
   committed into Java Lucene which must be ported over in a
  timely fashion.
  
   To help me determine when is a good time to start the port
  -- keep in
   mind, I will be taking the latest code off SVN -- I like to
  hear from
   the Java Lucene committers (and users who are playing or
  using Lucene
   2.9 off SVN) about those questions:
  
   1) how stable the current code in the trunk is,
   2) do you still have feature work to deliver or just bug fixes, and
   3) what's your target date to release Java Lucene 2.9
  
   #1 is important, such that is anyone using it in production?
  
   Yes, I did look at the current open issues in JIRA, but
  that doesn't
   help me answer the above questions.
  
   Regards,
  
   -- George
  
  
  
  -
   To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-dev-h...@lucene.apache.org
  
  
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: I wanna contribute a Chinese analyzer to lucene

2009-04-16 Thread Steven A Rowe
In addition to Ken's suggestions, check out 
http://wiki.apache.org/lucene-java/HowToContribute for some help on getting set 
up. - Steve

From: Ken Krugler [mailto:kkrugler_li...@transpac.com]
Sent: Thursday, April 16, 2009 10:16 AM
To: java-dev@lucene.apache.org
Subject: Re: I wanna contribute a Chinese analyzer to lucene

I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
language, it's called imdict-chinese-analyzer as it is a subproject of 
imdicthttp://www.imdict.net/, which is an intelligent online dictionary.

The project on google code is here: 
http://code.google.com/p/imdict-chinese-analyzer/

I took a quick look, but didn't see any code posted there yet.

[snip]

This Analyzer contains two packages, the source code and the lexical 
dictionary. I want to publish the source code using Apache license, but the 
dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and 
let the users download the dictionary from the google code site?

I believe your code can be a contrib, with a reference to the dictionary. So a 
first step would be to open an issue in Lucene's Jira 
(http://issues.apache.org/jira/browse/LUCENE), and post your source as a patch.

The best way to get the right answer to the legal issue is to post it to the 
legal-disc...@apache.org list (join it first), as Apache's lawyers can then 
respond to your specific question.

-- Ken

--
Ken Krugler
+1 530-210-6378


[jira] Commented: (LUCENE-1600) Reduce usage of String.intern(), performance is terrible

2009-04-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699857#action_12699857
 ] 

Jason Rutherglen commented on LUCENE-1600:
--

contrib/MemoryIndex has a bunch of notes about how interning is
slow, and using (I believe) hashmaps of strings is better.
Comments on this approach?

 Reduce usage of String.intern(), performance is terrible
 

 Key: LUCENE-1600
 URL: https://issues.apache.org/jira/browse/LUCENE-1600
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4, 2.4.1
 Environment: Windows Server 2003 x64
 Hotspot JDK 1.6.0_12 64-bit
Reporter: Patrick Eger
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: intern.png, intern_perf.patch


 I profiled a simple MatchAllDocsQuery() against ~1.5 million documents (8 
 fields of short text, Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS), 
 then retrieved all documents via searcher.doc(i, fs). String.intern() showed 
 up as a top hotspot (see attached screenshot), so i implemented a small 
 optimization to not intern() for every new Field(), instead forcing the 
 intern in the FieldInfos class and adding a optional internName constructor 
 to Field. This reduced execution time for searching and iterating through all 
 documents by 35%. Results were similar for -server and -client.
 TRUNK (2.9) w/out patch: matched 1435563 in 8884 ms/search
 TRUNK (2.9) w/patch: matched 1435563 in 5786 ms/search

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1600) Reduce usage of String.intern(), performance is terrible

2009-04-16 Thread Patrick Eger (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699864#action_12699864
 ] 

Patrick Eger commented on LUCENE-1600:
--

Hashmaps would work also, but then they either need to be synchronized or kept 
per-thread, the former would probably kill all your performance gains and the 
latter would be annoying i think. A moderate usage of String.intern() is fine i 
think, my patch just takes it out of the hot-path (for my use-case at least). 
Other uses of String.intern() in the codebase may have different 
solutions/tradeoffs certainly.

 Reduce usage of String.intern(), performance is terrible
 

 Key: LUCENE-1600
 URL: https://issues.apache.org/jira/browse/LUCENE-1600
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4, 2.4.1
 Environment: Windows Server 2003 x64
 Hotspot JDK 1.6.0_12 64-bit
Reporter: Patrick Eger
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: intern.png, intern_perf.patch


 I profiled a simple MatchAllDocsQuery() against ~1.5 million documents (8 
 fields of short text, Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS), 
 then retrieved all documents via searcher.doc(i, fs). String.intern() showed 
 up as a top hotspot (see attached screenshot), so i implemented a small 
 optimization to not intern() for every new Field(), instead forcing the 
 intern in the FieldInfos class and adding a optional internName constructor 
 to Field. This reduced execution time for searching and iterating through all 
 documents by 35%. Results were similar for -server and -client.
 TRUNK (2.9) w/out patch: matched 1435563 in 8884 ms/search
 TRUNK (2.9) w/patch: matched 1435563 in 5786 ms/search

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1600) Reduce usage of String.intern(), performance is terrible

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699865#action_12699865
 ] 

Uwe Schindler commented on LUCENE-1600:
---

In addition to Mikes fixes, there are more places in FieldsReader, where 
intern() is used. The best would be to add the sme ctor to AbstractField, too 
and use it for LayzyField and so on, too.
If I have time, I attach a patch similar to Mikes (as he is on holidays).

 Reduce usage of String.intern(), performance is terrible
 

 Key: LUCENE-1600
 URL: https://issues.apache.org/jira/browse/LUCENE-1600
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4, 2.4.1
 Environment: Windows Server 2003 x64
 Hotspot JDK 1.6.0_12 64-bit
Reporter: Patrick Eger
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: intern.png, intern_perf.patch


 I profiled a simple MatchAllDocsQuery() against ~1.5 million documents (8 
 fields of short text, Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS), 
 then retrieved all documents via searcher.doc(i, fs). String.intern() showed 
 up as a top hotspot (see attached screenshot), so i implemented a small 
 optimization to not intern() for every new Field(), instead forcing the 
 intern in the FieldInfos class and adding a optional internName constructor 
 to Field. This reduced execution time for searching and iterating through all 
 documents by 35%. Results were similar for -server and -client.
 TRUNK (2.9) w/out patch: matched 1435563 in 8884 ms/search
 TRUNK (2.9) w/patch: matched 1435563 in 5786 ms/search

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene 2.9 status (to port to Lucene.Net)

2009-04-16 Thread Jason Rutherglen
LUCENE-1313 relies on LUCENE-1516 which is in trunk.  If you have other
questions George, feel free to ask.

On Thu, Apr 16, 2009 at 8:04 AM, George Aroush geo...@aroush.net wrote:

 Thanks Mike.

 A quick follow up question.  What's the status of
 http://issues.apache.org/jira/browse/LUCENE-1313?  Can this work be
 applied
 to Lucene 2.4.1 and still get it's benefit or are there other dependency /
 issues with it that prevents us from doing so?

 If anyone else knows, I welcome your input.

 -- George

  -Original Message-
  From: Michael McCandless [mailto:luc...@mikemccandless.com]
  Sent: Thursday, April 16, 2009 8:36 AM
  To: java-dev@lucene.apache.org
  Subject: Re: Lucene 2.9 status (to port to Lucene.Net)
 
  Hi George,
 
  There's been a sudden burst of activity lately on 2.9 development...
 
  I know there are some biggish remaining features we may want
  to get into 2.9:
 
* The new field cache (LUCENE-831; still being iterated/mulled),
 
* Possible major rework of Field / Document  index-time vs
  search-time Document
 
* Applying filters via random-access API when possible  performant
  (LUCENE-1536)
 
* Possible further optimizations to how collection works
 (LUCENE-1593)
 
* Maybe breaking core + contrib into a more uniform set of modules
  (and figuring out how Trie(Numeric)RangeQuery/Filter fits in here)
  -- the Modularization uber-thread.
 
* Further improvements to near-realtime search (using RAMDir for
  small recently flushed segments)
 
* Many other small things and probably some big ones that I'm
  forgetting now :)
 
  So things are still in flux, and I'm really not sure on a
  release date at this point.  Late last year, I was hoping for
  early this year, but it's no longer early this year ;)
 
  Mike
 
  On Wed, Apr 15, 2009 at 9:17 PM, George Aroush
  geo...@aroush.net wrote:
   Hi Folks,
  
   This is George Aroush, I'm one of the committers on Lucene.Net - a
   port of Java Lucene to C# Lucene.
  
   I'm looking at the current trunk code of yet to be released
  Lucene 2.9
   and I would like to port it to Lucene.Net.  If I do this
  now, we get
   the benefit of keeping our code base and release dates much
  closer to Java Lucene.
   However, this comes with a cost of carrying over unfinished work,
   known defects, and I have to keep an eye on new code that get
   committed into Java Lucene which must be ported over in a
  timely fashion.
  
   To help me determine when is a good time to start the port
  -- keep in
   mind, I will be taking the latest code off SVN -- I like to
  hear from
   the Java Lucene committers (and users who are playing or
  using Lucene
   2.9 off SVN) about those questions:
  
   1) how stable the current code in the trunk is,
   2) do you still have feature work to deliver or just bug fixes, and
   3) what's your target date to release Java Lucene 2.9
  
   #1 is important, such that is anyone using it in production?
  
   Yes, I did look at the current open issues in JIRA, but
  that doesn't
   help me answer the above questions.
  
   Regards,
  
   -- George
  
  
  
  -
   To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
   For additional commands, e-mail: java-dev-h...@lucene.apache.org
  
  
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: vacation

2009-04-16 Thread Jason Rutherglen
Enjoy, I just got back from mine, tropical Minneapolis.

On Thu, Apr 16, 2009 at 7:45 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Just as a heads up, since we have so many neat Lucene improvements in
 flight: tomorrow I leave for a week long vacation, in a nice warm
 place that may or may not have internet access.  So if suddenly I stop
 answering things, now you know why!

 Keep hacking away ;)

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Issue Comment Edited: (LUCENE-1600) Reduce usage of String.intern(), performance is terrible

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699865#action_12699865
 ] 

Uwe Schindler edited comment on LUCENE-1600 at 4/16/09 2:13 PM:


In addition to Mikes fixes, there are more places in FieldsReader, where 
intern() is used. The best would be to add the sme ctor to AbstractField, too 
and use it for LayzyField and so on, too.
If I have time, I attach a patch similar to Patrick's.

  was (Author: thetaphi):
In addition to Mikes fixes, there are more places in FieldsReader, where 
intern() is used. The best would be to add the sme ctor to AbstractField, too 
and use it for LayzyField and so on, too.
If I have time, I attach a patch similar to Mikes (as he is on holidays).
  
 Reduce usage of String.intern(), performance is terrible
 

 Key: LUCENE-1600
 URL: https://issues.apache.org/jira/browse/LUCENE-1600
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4, 2.4.1
 Environment: Windows Server 2003 x64
 Hotspot JDK 1.6.0_12 64-bit
Reporter: Patrick Eger
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: intern.png, intern_perf.patch


 I profiled a simple MatchAllDocsQuery() against ~1.5 million documents (8 
 fields of short text, Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS), 
 then retrieved all documents via searcher.doc(i, fs). String.intern() showed 
 up as a top hotspot (see attached screenshot), so i implemented a small 
 optimization to not intern() for every new Field(), instead forcing the 
 intern in the FieldInfos class and adding a optional internName constructor 
 to Field. This reduced execution time for searching and iterating through all 
 documents by 35%. Results were similar for -server and -client.
 TRUNK (2.9) w/out patch: matched 1435563 in 8884 ms/search
 TRUNK (2.9) w/patch: matched 1435563 in 5786 ms/search

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: I wanna contribute a Chinese analyzer to lucene

2009-04-16 Thread Otis Gospodnetic


 --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch





From: Gao Pinker xiaoping...@gmail.com
To: java-dev@lucene.apache.org
Sent: Thursday, April 16, 2009 9:58:51 AM
Subject: I wanna contribute a Chinese analyzer to lucene

Hi All!

I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
language, it's called imdict-chinese-analyzer as it is a subproject of imdict, 
which is an intelligent online dictionary.

The project on google code is here: 
http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
properly, or there will be mis-understandings everywhere in the index 
constructed by Lucene, and the accuracy of the search engine will be affected 
seriously!

Although there are two analyzer packages in apache repository which can handle 
Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two 
adjoining characters as a single word, this is obviously not true in reality, 
also this strategy will increase the index size and hit the performance baddly.

The algorithm ofimdict-chinese-analyzer is based on Hidden Markov Model (HMM), 
so it can  tokenize chinese sentence in a really intelligent way. Tokenizaion 
accuracy of this model is above 90% according to the paper HHMM-based Chinese 
Lexical analyzer ICTCLAL.

As imdict-chinese-analyzer is a really fast intelligent Chinese Analyzer for 
lucene written in Java. I want to share this project with every one using 
Lucene.

This Analyzer contains two packages, the source code and the lexical 
dictionary. I want to publish the source code using Apache license, but the 
dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and 
let the users download the dictionary from the google code site?

please help me about this contribution.

RE: vacation

2009-04-16 Thread Uwe Schindler
Have fun and relax! My next holiday will be after a meeting in Japan, I will
visit Kyoto (end of May). It will be hot there, too...!

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Thursday, April 16, 2009 4:46 PM
 To: java-dev@lucene.apache.org
 Subject: vacation
 
 Just as a heads up, since we have so many neat Lucene improvements in
 flight: tomorrow I leave for a week long vacation, in a nice warm
 place that may or may not have internet access.  So if suddenly I stop
 answering things, now you know why!
 
 Keep hacking away ;)
 
 Mike
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: I wanna contribute a Chinese analyzer to lucene

2009-04-16 Thread Otis Gospodnetic
This would be a great contribution.
I took a quick look at the ZIP file and noticed it depends on, say, 
net.imdict.wordsegment.WordSegmenter, but I didn't see that class anywhere.  I 
assume you will patch and polish things, but I thought I'd point this out.


Thanks!
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch





From: Gao Pinker xiaoping...@gmail.com
To: java-dev@lucene.apache.org
Sent: Thursday, April 16, 2009 9:58:51 AM
Subject: I wanna contribute a Chinese analyzer to lucene

Hi All!

I wrote a Analyzer for apache lucene for analyzing sentences in Chinese 
language, it's called imdict-chinese-analyzer as it is a subproject of imdict, 
which is an intelligent online dictionary.

The project on google code is here: 
http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, 我是中国人(I am Chinese), should be tokenized as 我(I)   是(am)   
中国人(Chinese), not 我 是中 国人. So the analyzer must handle each sentence 
properly, or there will be mis-understandings everywhere in the index 
constructed by Lucene, and the accuracy of the search engine will be affected 
seriously!

Although there are two analyzer packages in apache repository which can handle 
Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two 
adjoining characters as a single word, this is obviously not true in reality, 
also this strategy will increase the index size and hit the performance baddly.

The algorithm ofimdict-chinese-analyzer is based on Hidden Markov Model (HMM), 
so it can  tokenize chinese sentence in a really intelligent way. Tokenizaion 
accuracy of this model is above 90% according to the paper HHMM-based Chinese 
Lexical analyzer ICTCLAL.

As imdict-chinese-analyzer is a really fast intelligent Chinese Analyzer for 
lucene written in Java. I want to share this project with every one using 
Lucene.

This Analyzer contains two packages, the source code and the lexical 
dictionary. I want to publish the source code using Apache license, but the 
dictionary which is under an ambigus license was not create by me.
So, can I only submit the source code to lucene contribution repository, and 
let the users download the dictionary from the google code site?

please help me about this contribution.

[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-16 Thread Shon Vella (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699872#action_12699872
 ] 

Shon Vella commented on LUCENE-1604:


What should the transitive behavior of MultiReader, FilterReader, and 
ParallelReader be? I'm inclined to say they shouldn't pass through to their 
subordinate readers because they don't really own them. 

 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699880#action_12699880
 ] 

Mark Miller commented on LUCENE-831:


Okay, now that I half way understand this issue, I think I have to go back to 
the basic motivations. The original big win was taken away by 1483, so lets see 
if we really need a new API for the wins we have left.

h3. Advantage of new API (kind of as it is in the patch)
FieldCache is interface and it would be nice to move to abstract class, 
ExtendedFieldCache is ugly
Avoid global sync by IndexReader to access cache
its easier/cleaner to block caching by multireaders (though I am almost 
thinking I would prefer warnings/advice about performance and encouragement to 
move to per segment)
It becomes easier to share a ValueSource instance across readers.

h3. Disadvantages of new API
If we want only SegmentReaders to have a ValueSource, you can't efficiently 
back the old API with the new, causing RAM reqs jumps if you straddle the two 
APIs and ask for the same array data from each.

Its probably a higher barrier to a custom Parser to implement and init a Reader 
with a ValueSource (presumably that works per field) than to simply pass the 
Parser on a SortField. However, Parser stops making sense if we end up being 
able to back ValueSource with column stride fields. We could allow ValueSource 
to be passed on the SortField (the current incarnation of this patch), but then 
you have to go back to a global cache by reader the ValueSources passed that 
way (you would also still have the per segment reader, settable ValueSource).

h3. Advantages of staying with old API
Avoid forcing large migration for users, with possible RAM req penalties if 
they don't switch from deprecated code (we are doing something similar with 
1483 even without deprecated code though - if you were using an external 
multireader FieldCache that matched a sort FieldCache key, youd double your RAM 
reqs).

h3. Thoughts
If we stayed with the old API, we could still allow a custom FieldCache to be 
supplied. We could still back FieldCacheImpl with Uninverter to reduce code. We 
could still have CachingFieldCache. Though CachingValueSource is much better :) 
FieldCache implies caching, and so the name would be confusing. We could also 
avoid CachingFieldCache though, as just making a pluggable FieldCache would 
allow alternate caching implementations (with a bit more effort).

We could deprecate the Parser methods and force supplying a new FieldCache impl 
for custom uninversion to get to an API suitable to be backed by CSF.

Or:

We could also move to ValueSource, but allow a ValueSource on multi-readers. 
That would probably make straddling the API's much more possible (and 
efficient) in the default case. We could advise that its best to work per 
segment, but leave the option to the user.

h3. Conclusion
I am not sure. I thought I was convinced we might as well not even move from 
FieldCache at all, but now that I've written a bit out, I'm thinking it would 
be worth going to ValueSource. I'm just not positive on what we should support. 
SortField ValueSource override keyed by reader? ValueSources on MultiReaders?

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards 

Re: vacation

2009-04-16 Thread Marvin Humphrey
On Thu, Apr 16, 2009 at 10:45:49AM -0400, Michael McCandless wrote:
 Just as a heads up, since we have so many neat Lucene improvements in
 flight: tomorrow I leave for a week long vacation, in a nice warm
 place that may or may not have internet access.  So if suddenly I stop
 answering things, now you know why!

I've got plenty to keep myself busy while you're gone. :)  We'll manage on
autopilot for a little while.  Enjoy your break.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699893#action_12699893
 ] 

Uwe Schindler commented on LUCENE-831:
--

We have the problem with the ValueSource-override not only with SortField. Also 
Functions Queries need the additional ValueSource-override and other places 
too. So a central place to register a ValueSource per field for a IndexReader 
(MultiReader,... passing down to segments) would really be nice.

For the caching problem: Possibly the ValueSource given to SortField etc. 
behaves like the current parser. The cache in IndexReader should also be keyed 
by the ValueSource. So the SortField/FunctionQuery ValueSource override is 
passed down to IndexReader's cache. If the IndexReader has an entry in its 
cache for same (field, ValueSource, ...) key, it could use the arrays from 
there, if not fill cache with an array from the overridden ValueSource. I would 
really make the ValueSource per-field.

Univerter inner class should be made public and the Univerter should accept a 
starting term to iterate (overwrite ) and the newTerm() method should be able 
to return false to stop iterating (see my ValueSource example for trie). With 
that one could easily create a subclass of univerter with a own parser logic 
(like trie).

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-16 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699892#action_12699892
 ] 

Jason Rutherglen commented on LUCENE-1536:
--

I thought we are going to get LUCENE-1518 working to compare the performance 
against passing the filter into TermDocs? 

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1518) Merge Query and Filter classes

2009-04-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1518:
---

Fix Version/s: 2.9

 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699939#action_12699939
 ] 

Michael McCandless commented on LUCENE-1536:


Ahh right, we should re-test performance of this after LUCENE-1518 is done.

 if a filter can support random access API, we should use it
 ---

 Key: LUCENE-1536
 URL: https://issues.apache.org/jira/browse/LUCENE-1536
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1536.patch


 I ran some performance tests, comparing applying a filter via
 random-access API instead of current trunk's iterator API.
 This was inspired by LUCENE-1476, where we realized deletions should
 really be implemented just like a filter, but then in testing found
 that switching deletions to iterator was a very sizable performance
 hit.
 Some notes on the test:
   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
   * I test across multiple queries.  1-X means an OR query, eg 1-4
 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
 AND 3 AND 4.  u s means united states (phrase search).
   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
 95, 98, 99, 99.9 (filter is non-null but all bits are set),
 100 (filter=null, control)).
   * Method high means I use random-access filter API in
 IndexSearcher's main loop.  Method low means I use random-access
 filter API down in SegmentTermDocs (just like deleted docs
 today).
   * Baseline (QPS) is current trunk, where filter is applied as iterator up
 high (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1604) Stop creating huge arrays to represent the absense of field norms

2009-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699941#action_12699941
 ] 

Michael McCandless commented on LUCENE-1604:


bq. I'm inclined to say they shouldn't pass through to their subordinate 
readers because they don't really own them.

I agree.

 Stop creating huge arrays to represent the absense of field norms
 -

 Key: LUCENE-1604
 URL: https://issues.apache.org/jira/browse/LUCENE-1604
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.9
Reporter: Shon Vella
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1604.patch, LUCENE-1604.patch


 Creating and keeping around huge arrays that hold a constant value is very 
 inefficient both from a heap usage standpoint and from a localility of 
 reference standpoint. It would be much more efficient to use null to 
 represent a missing norms table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-04-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1263#action_1263
 ] 

Mark Miller commented on LUCENE-831:


I think we don't want to expose Uninverter though? The API should be neutral 
enough to naturally support loading from CSF, in which case Uninverter doesnt 
make sense...so we were going to go with having to override the value source to 
handle uninverter type stuff.

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831-trieimpl.patch, LUCENE-831.03.28.2008.diff, 
 LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org