[jira] Commented: (LUCENE-1410) PFOR implementation

2008-10-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638543#action_12638543
 ] 

Michael McCandless commented on LUCENE-1410:


Paul, in decompress I added "inputSize = -1" at the top, so that the header is 
re-read.  I need this so I can re-use a single PFor instance during decompress.

> PFOR implementation
> ---
>
> Key: LUCENE-1410
> URL: https://issues.apache.org/jira/browse/LUCENE-1410
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
> TestPFor2.java, TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1410) PFOR implementation

2008-10-10 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638564#action_12638564
 ] 

Paul Elschot commented on LUCENE-1410:
--

Did you also move to relative addressing in the buffer?

Another question: I suppose the place to add this initially would be in 
IndexOutput and IndexInput?
In that case it would make sense to reserve (some bits of) the first byte in 
the compressed buffer
for the compression method, and use these bits there to call PFor or another 
(de)compression method.

> PFOR implementation
> ---
>
> Key: LUCENE-1410
> URL: https://issues.apache.org/jira/browse/LUCENE-1410
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
> TestPFor2.java, TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1410) PFOR implementation

2008-10-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638573#action_12638573
 ] 

Michael McCandless commented on LUCENE-1410:



Another thing that bit me was the bufferByteSize(): if this returns
something that's not 0 mod 4, you must increase it to the next
multiple of 4 otherwise you will lose data since ByteBuffer is big
endian by default.  We should test little endian to see if performance
changes (on different CPUs).

bq. Did you also move to relative addressing in the buffer? 

No I haven't done that, but I think we should.  I believe it's faster.  I'm 
trying now to get a rudimentary test working for TermQuery using pfor.

{quote}
Another question: I suppose the place to add this initially would be in 
IndexOutput and IndexInput?
In that case it would make sense to reserve (some bits of) the first byte in 
the compressed buffer
for the compression method, and use these bits there to call PFor or another 
(de)compression method.
{quote}

This gets into flexible indexing...

Ideally we do this in a pluggable way, so that PFor is just one such
plugin, simple vInts is another, etc.

I could see a compression layer living "above" IndexInput/Output,
since logically how you encode an int block into bytes is independent
from the means of storage.

But: such an abstraction may hurt performance too much since during
read it would entail an extra buffer copy.  So maybe we should just
add methods to IndexInput/Output, or, make a new
IntBlockInput/Output.

Also, some things you now store in the header of each block should
presumably move to the start of the file instead (eg the compression
method), or if we move to a separate "schema" file that can record
which compressor was used per file, we'd put this there.

So I'm not yet exactly sure how we should tie this in "for real"...


> PFOR implementation
> ---
>
> Key: LUCENE-1410
> URL: https://issues.apache.org/jira/browse/LUCENE-1410
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Other
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: autogen.tgz, LUCENE-1410b.patch, LUCENE-1410c.patch, 
> TestPFor2.java, TestPFor2.java, TestPFor2.java
>
>   Original Estimate: 21840h
>  Remaining Estimate: 21840h
>
> Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1417) Allowing for distance measures that incorporate frequency/popularity for SuggestWord comparison

2008-10-10 Thread Jason Rennie (JIRA)
Allowing for distance measures that incorporate frequency/popularity for 
SuggestWord comparison
---

 Key: LUCENE-1417
 URL: https://issues.apache.org/jira/browse/LUCENE-1417
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spellchecker
Affects Versions: 2.4
Reporter: Jason Rennie


Spelling suggestions are currently ordered first by a string edit distance 
measure, then by popularity/frequency.  This limits the ability of 
popularity/frequency to affect suggestions.  I think it would be better for the 
distance measure to accept popularity/frequency as an argument and provide a 
distance/score that incorporates any popularity/frequency considerations.  I.e. 
change StringDistance.getDistance to accept an additional argument: frequency 
of the potential suggestion.

The new SuggestWord.compareTo function would only order by score.  We could 
achieve the existing behavior by adding a small inverse frequency value to the 
distances.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-1415) MultiPhraseQuery has incorrect hashCode() implementation - Leads to Solr Cache misses

2008-10-10 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley reassigned LUCENE-1415:


Assignee: Yonik Seeley

> MultiPhraseQuery has incorrect hashCode() implementation - Leads to Solr 
> Cache misses
> -
>
> Key: LUCENE-1415
> URL: https://issues.apache.org/jira/browse/LUCENE-1415
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Todd Feak
>Assignee: Yonik Seeley
> Attachments: LUCENE-1415.patch, LUCENE-1415.patch, 
> MultiPhraseQuery.java, MultiPhraseQueryTest.java
>
>
> I found this while hunting for the cause of Solr Cache misses.
> The MultiPhraseQuery class hashCode() implementation is non-deterministic. It 
> uses termArrays.hashCode() in the computation. The contents of that ArrayList 
> are actually arrays themselves, which return there reference ID as a hashCode 
> instead of returning a hashCode which is based on the contents of the array. 
> I would suggest an implementation involving the Arrays.hashCode() method.
> I will try to submit a patch soon, off for today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1415) MultiPhraseQuery has incorrect hashCode() implementation - Leads to Solr Cache misses

2008-10-10 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved LUCENE-1415.
--

Resolution: Fixed

Thanks, I just committed this.

> MultiPhraseQuery has incorrect hashCode() implementation - Leads to Solr 
> Cache misses
> -
>
> Key: LUCENE-1415
> URL: https://issues.apache.org/jira/browse/LUCENE-1415
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Todd Feak
>Assignee: Yonik Seeley
> Attachments: LUCENE-1415.patch, LUCENE-1415.patch, 
> MultiPhraseQuery.java, MultiPhraseQueryTest.java
>
>
> I found this while hunting for the cause of Solr Cache misses.
> The MultiPhraseQuery class hashCode() implementation is non-deterministic. It 
> uses termArrays.hashCode() in the computation. The contents of that ArrayList 
> are actually arrays themselves, which return there reference ID as a hashCode 
> instead of returning a hashCode which is based on the contents of the array. 
> I would suggest an implementation involving the Arrays.hashCode() method.
> I will try to submit a patch soon, off for today.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1418) QueryParser can throw NullPointerException during parsing of some queries in case if default field passed to constructor is null

2008-10-10 Thread Alexei Dets (JIRA)
QueryParser can throw NullPointerException during parsing of some queries in 
case if default field passed to constructor is null


 Key: LUCENE-1418
 URL: https://issues.apache.org/jira/browse/LUCENE-1418
 Project: Lucene - Java
  Issue Type: Bug
  Components: QueryParser
Affects Versions: 2.4
 Environment: CentOS 5.2 (probably any applies)
Reporter: Alexei Dets
Priority: Minor


In case if QueryParser was constructed using "QueryParser(String f,  Analyzer 
a)" constructor and f equals null then QueryParser can fail with 
NullPointerException during parsing of some queries that _does_ contain field 
name but have unbalanced parenthesis.

Example 1:
Query:  field:(expr1) expr2)
Result:
java.lang.NullPointerException
at org.apache.lucene.index.Term.(Term.java:50)
at org.apache.lucene.index.Term.(Term.java:36)
at 
org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:543)
at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1324)
at 
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1211)
at 
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1168)
at 
org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1128)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:170)

Example2:
Query:  field:(expr1) "expr2")
Result:
java.lang.NullPointerException
at org.apache.lucene.index.Term.(Term.java:50)
at org.apache.lucene.index.Term.(Term.java:36)
at 
org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:543)
at 
org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:612)
at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1459)
at 
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1211)
at 
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1168)
at 
org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1128)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:170)

Workaround: pass in constructor empty string as a default field name - in this 
case QueryParser.parse method will throw ParseException (expected result 
because query string is wrong) instead of NullPointerException.

It is not obvious to me how to fix this so I'll describe my usecase, may be I'm 
doing something completely wrong.
Basically I have a set of per-field queries entered by user and need to 
programmatically construct (after some preprocessing) one real Lucene query 
combined from these user-entered per-field subqueries.
To achieve this I basically do the following (simplified a bit):

QueryParser parser = new QueryParser(null, analyzer); // I'll always provide a 
field name in a query string as it is different each time and I don't have any 
default
BooleanQuery query = new BooleanQuery();
Query subQuery1 = parser.parse(field1 + ":(" + queryString1 + ')');
query.add(subQuery1, operator1); // operator = BooleanClause.Occur.MUST, 
BooleanClause.Occur.MUST_NOT or BooleanClause.Occur.SHOULD
Query subQuery2 = parser.parse(field2 + ":(" + queryString2 + ')');
query.add(subQuery2, operator2); 
Query subQuery3 = parser.parse(field3 + ":(" + queryString3 + ')');
query.add(subQuery3, operator3); 
...

IMHO either QueryParser constructor should be changed to throw 
NullPointerException/InvalidArgumentException in case of null field passed (and 
API documentation updated) or QueryParser.parse behavior should be fixed to 
correctly throw ParseException instead of NullPointerException. Also IMHO of a 
great help can be _public_ setField/getField methods of QueryParser (that 
set/get field), this can help in use cases like my:

QueryParser parser = new QueryParser(null, analyzer); // or add constructor 
with analyzer _only_ for such cases
BooleanQuery query = new BooleanQuery();
parser.setField(field1);
Query subQuery1 = parser.parse(queryString1);
query.add(subQuery1, operator1);
parser.setField(field2);
Query subQuery2 = parser.parse(queryString2);
query.add(subQuery2, operator2); 
...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]