lucene performance questions
I have a bunch of fields that have single values such as "date", "id", "flagged" I've noticed that if I Index Tokenize them, my queries are much faster than if they are Untokenized. In My query, I'm using a BooleanQuery or RangeFilter/Query and querying/sorting/filterling based on these values. Example uses: SortField minuteSort = new SortField("date", SortField.STRING, reverse); filter = new RangeFilter("id", lowerId, upperId, false, false); booleanQuery.Add(new TermQuery(new Term("flagged", "true")), BooleanClause.Occur.MUST_NOT); Two Questions: 1. Is there a cost at search-time in making fields Tokenized that don't need to be? I assume there's a cost at Index time, but I'm not too worried about the Index cost. 2. Should fields that are used in my 3 example lines above by Tokenized? If not, why am I seeing a huge performance difference when they are UnTokenized? I'm really not running any queries that require some sort of analysis on these fields other than that they are indexed as-s _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
[jira] Commented: (LUCENENET-54) ArgumentOurOfRangeException caused by SF.Snowball.Ext.DanishStemmer
[ https://issues.apache.org/jira/browse/LUCENENET-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868685#action_12868685 ] Jason Fitzharris commented on LUCENENET-54: --- I encountered the same issue when using the Finnish stemmer. The problem is similar to LUCENENET-102 as Java and .NET defines string.substring differently. Java uses string.substring(firstIndex, lastIndex) whereas .NET uses string.Substring(startIndex, length) The solution is to change the line 467 in Snowballprogram.slice_to from s.Append(current.ToString(bra, ket)); to s.Append(current.ToString(bra, len)); len is an existing but unused variable which is declared as int len = ket - bra; > ArgumentOurOfRangeException caused by SF.Snowball.Ext.DanishStemmer > --- > > Key: LUCENENET-54 > URL: https://issues.apache.org/jira/browse/LUCENENET-54 > Project: Lucene.Net > Issue Type: Bug > Environment: Windows XP SP2, lucene.net v2.0 004 >Reporter: Torsten Rendelmann >Assignee: George Aroush >Priority: Critical > > Exception Information > System.SystemException: System.Reflection.TargetInvocationException: > Exception has been thrown by the target of an invocation. ---> > System.ArgumentOutOfRangeException: Index and length must refer to a location > within the string. > Parameter name: length >at System.String.Substring(Int32 startIndex, Int32 length) >at System.Text.StringBuilder.ToString(Int32 startIndex, Int32 length) >at SF.Snowball.SnowballProgram.slice_to(StringBuilder s) >at SF.Snowball.Ext.DanishStemmer.r_undouble() >at SF.Snowball.Ext.DanishStemmer.Stem() >--- End of inner exception stack trace --- >at System.Reflection.RuntimeMethodInfo.InternalInvoke(Object obj, > BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo > culture, Boolean isBinderDefault, Assembly caller, Boolean verifyAccess) >at System.Reflection.RuntimeMethodInfo.InternalInvoke(Object obj, > BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo > culture, Boolean verifyAccess) >at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags > invokeAttr, Binder binder, Object[] parameters, CultureInfo culture) >at System.Reflection.MethodBase.Invoke(Object obj, Object[] parameters) >at System.Reflection.MethodInfo.Invoke(Object obj, Object[] parameters) >at Lucene.Net.Analysis.Snowball.SnowballFilter.Next() >at Lucene.Net.Analysis.Snowball.SnowballFilter.Next() >at Lucene.Net.Index.DocumentWriter.InvertDocument(Document doc) >at Lucene.Net.Index.DocumentWriter.AddDocument(String segment, Document > doc) >at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer > analyzer) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: lucene performance questions
Whether you tokenize them or not, there shouldn't be any performance change. (ignoring the parsing of a few words of user's query) Is this some kind of XY problem (http://dictionary.babylon.com/xy%20problem/) DIGY -Original Message- From: Ravi Patel [mailto:rpat...@live.com] Sent: Tuesday, May 18, 2010 4:34 PM To: lucene-net-dev@lucene.apache.org Subject: lucene performance questions I have a bunch of fields that have single values such as "date", "id", "flagged" I've noticed that if I Index Tokenize them, my queries are much faster than if they are Untokenized. In My query, I'm using a BooleanQuery or RangeFilter/Query and querying/sorting/filterling based on these values. Example uses: SortField minuteSort = new SortField("date", SortField.STRING, reverse); filter = new RangeFilter("id", lowerId, upperId, false, false); booleanQuery.Add(new TermQuery(new Term("flagged", "true")), BooleanClause.Occur.MUST_NOT); Two Questions: 1. Is there a cost at search-time in making fields Tokenized that don't need to be? I assume there's a cost at Index time, but I'm not too worried about the Index cost. 2. Should fields that are used in my 3 example lines above by Tokenized? If not, why am I seeing a huge performance difference when they are UnTokenized? I'm really not running any queries that require some sort of analysis on these fields other than that they are indexed as-s _ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28 326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5