lucene performance questions

2010-05-18 Thread Ravi Patel

 

I have a bunch of fields that have single values such as "date", "id", "flagged"

 

I've noticed that if I Index Tokenize them, my queries are much faster than if 
they are Untokenized.


In My query, I'm using a BooleanQuery or RangeFilter/Query and 
querying/sorting/filterling based on these values.

Example uses:

SortField minuteSort = new SortField("date", SortField.STRING, reverse);

filter = new RangeFilter("id", lowerId, upperId, false, false);

booleanQuery.Add(new TermQuery(new Term("flagged", "true")), 
BooleanClause.Occur.MUST_NOT);

 

Two Questions:

1.  Is there a cost at search-time in making fields Tokenized that don't need 
to be?  I assume there's a cost at Index time, but I'm not too worried about 
the Index cost.

2.  Should fields that are used in my 3 example lines above by Tokenized?  If 
not, why am I seeing a huge performance difference when they are UnTokenized?  
I'm really not running any queries that require some sort of analysis on these 
fields other than that they are indexed as-s
  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with 
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

[jira] Commented: (LUCENENET-54) ArgumentOurOfRangeException caused by SF.Snowball.Ext.DanishStemmer

2010-05-18 Thread Jason Fitzharris (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENENET-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868685#action_12868685
 ] 

Jason Fitzharris commented on LUCENENET-54:
---

I encountered the same issue when using the Finnish stemmer. The problem is 
similar to LUCENENET-102 as Java and .NET defines string.substring differently. 
Java uses

string.substring(firstIndex, lastIndex)

whereas .NET uses

string.Substring(startIndex, length)

The solution is to change the line 467 in Snowballprogram.slice_to from

s.Append(current.ToString(bra, ket));

to

s.Append(current.ToString(bra, len));

len is an existing but unused variable which is declared as

int len = ket - bra;


> ArgumentOurOfRangeException caused by SF.Snowball.Ext.DanishStemmer
> ---
>
> Key: LUCENENET-54
> URL: https://issues.apache.org/jira/browse/LUCENENET-54
> Project: Lucene.Net
>  Issue Type: Bug
> Environment: Windows XP SP2, lucene.net v2.0 004
>Reporter: Torsten Rendelmann
>Assignee: George Aroush
>Priority: Critical
>
> Exception Information
> System.SystemException: System.Reflection.TargetInvocationException: 
> Exception has been thrown by the target of an invocation. ---> 
> System.ArgumentOutOfRangeException: Index and length must refer to a location 
> within the string.
> Parameter name: length
>at System.String.Substring(Int32 startIndex, Int32 length)
>at System.Text.StringBuilder.ToString(Int32 startIndex, Int32 length)
>at SF.Snowball.SnowballProgram.slice_to(StringBuilder s)
>at SF.Snowball.Ext.DanishStemmer.r_undouble()
>at SF.Snowball.Ext.DanishStemmer.Stem()
>--- End of inner exception stack trace ---
>at System.Reflection.RuntimeMethodInfo.InternalInvoke(Object obj, 
> BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo 
> culture, Boolean isBinderDefault, Assembly caller, Boolean verifyAccess)
>at System.Reflection.RuntimeMethodInfo.InternalInvoke(Object obj, 
> BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo 
> culture, Boolean verifyAccess)
>at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags 
> invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
>at System.Reflection.MethodBase.Invoke(Object obj, Object[] parameters)
>at System.Reflection.MethodInfo.Invoke(Object obj, Object[] parameters)
>at Lucene.Net.Analysis.Snowball.SnowballFilter.Next()
>at Lucene.Net.Analysis.Snowball.SnowballFilter.Next()
>at Lucene.Net.Index.DocumentWriter.InvertDocument(Document doc)
>at Lucene.Net.Index.DocumentWriter.AddDocument(String segment, Document 
> doc)
>at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer 
> analyzer)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: lucene performance questions

2010-05-18 Thread Digy
Whether you tokenize them or not, there shouldn't be any performance change.
(ignoring the parsing of a few words of user's query)
Is this some kind of XY problem
(http://dictionary.babylon.com/xy%20problem/)

DIGY



-Original Message-
From: Ravi Patel [mailto:rpat...@live.com] 
Sent: Tuesday, May 18, 2010 4:34 PM
To: lucene-net-dev@lucene.apache.org
Subject: lucene performance questions


 

I have a bunch of fields that have single values such as "date", "id",
"flagged"

 

I've noticed that if I Index Tokenize them, my queries are much faster than
if they are Untokenized.


In My query, I'm using a BooleanQuery or RangeFilter/Query and
querying/sorting/filterling based on these values.

Example uses:

SortField minuteSort = new SortField("date", SortField.STRING, reverse);

filter = new RangeFilter("id", lowerId, upperId, false, false);

booleanQuery.Add(new TermQuery(new Term("flagged", "true")),
BooleanClause.Occur.MUST_NOT);

 

Two Questions:

1.  Is there a cost at search-time in making fields Tokenized that don't
need to be?  I assume there's a cost at Index time, but I'm not too worried
about the Index cost.

2.  Should fields that are used in my 3 example lines above by Tokenized?
If not, why am I seeing a huge performance difference when they are
UnTokenized?  I'm really not running any queries that require some sort of
analysis on these fields other than that they are indexed as-s
  
_
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with
Hotmail. 
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28
326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5