Re: Solr searching performance issues, using large documents

Peter Spam Thu, 05 Aug 2010 17:28:44 -0700

I've read through the DataImportHandler page a few times, and still can't 
figure out how to separate a large document into smaller documents.  Any hints? 
:-)  Thanks!


-Peter

On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:

> Spanning won't work- you would have to make overlapping mini-documents
> if you want to support this.
> 
> I don't know how big the chunks should be- you'll have to experiment.
> 
> Lance
> 
> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <ps...@mac.com> wrote:
>> What would happen if the search query phrase spanned separate document 
>> chunks?
>> 
>> Also, what would the optimal size of chunks be?
>> 
>> Thanks!
>> 
>> 
>> -Peter
>> 
>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>> 
>>> Not that I know of.
>>> 
>>> The DataImportHandler has the ability to create multiple documents
>>> from one input stream. It is possible to create a DIH file that reads
>>> large log files and splits each one into N documents, with the file
>>> name as a common field. The DIH wiki page tells you in general how to
>>> make a DIH file.
>>> 
>>> http://wiki.apache.org/solr/DataImportHandler
>>> 
>>> From this, you should be able to make a DIH file that puts log files
>>> in as separate documents. As to splitting files up into
>>> mini-documents, you might have to write a bit of Javascript to achieve
>>> this. There is no data structure or software that implements
>>> structured documents.
>>> 
>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <ps...@mac.com> wrote:
>>>> Thanks for the pointer, Lance!  Is there an example of this somewhere?
>>>> 
>>>> 
>>>> -Peter
>>>> 
>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>> 
>>>>> Ah! You're not just highlighting, you're snippetizing. This makes it 
>>>>> easier.
>>>>> 
>>>>> Highlighting does not stream- it pulls the entire stored contents into
>>>>> one string and then pulls out the snippet.  If you want this to be
>>>>> fast, you have to split up the text into small pieces and only
>>>>> snippetize from the most relevant text. So, separate documents with a
>>>>> common group id for the document it came from. You might have to do 2
>>>>> queries to achieve what you want, but the second query for the same
>>>>> query will be blindingly fast. Often <1ms.
>>>>> 
>>>>> Good luck!
>>>>> 
>>>>> Lance
>>>>> 
>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <ps...@mac.com> wrote:
>>>>>> However, I do need to search the entire document, or else the 
>>>>>> highlighting will sometimes be blank :-(
>>>>>> Thanks!
>>>>>> 
>>>>>> - Peter
>>>>>> 
>>>>>> ps. sorry for the many responses - I'm rushing around trying to get this 
>>>>>> working.
>>>>>> 
>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>> 
>>>>>>> Correction - it went from 17 seconds to 10 seconds - I was changing the 
>>>>>>> hl.regex.maxAnalyzedChars the first time.
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> -Peter
>>>>>>> 
>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>> 
>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>> 
>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>> 
>>>>>>>> Yes, I tried dropping it down to 21, but it didn't have much of an 
>>>>>>>> impact (one search I just tried went from 17 seconds to 15.8 seconds, 
>>>>>>>> and this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>> 
>>>>>>>>> ? Also regular expression highlighting is more expensive, I think.
>>>>>>>>> What does the 'fuzzy' variable mean? If you use this to query via
>>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>>> then you should try the trunk of solr which is a lot faster for fuzzy 
>>>>>>>>> or
>>>>>>>>> other wildcard search.
>>>>>>>> 
>>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>> 
>>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> - Peter
>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Peter.
>>>>>>>>> 
>>>>>>>>>> Data set: About 4,000 log files (will eventually grow to millions).  
>>>>>>>>>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>>>>>>>>>> 
>>>>>>>>>> Problem: When I search for common terms, the query time goes from 
>>>>>>>>>> under 2-3 seconds to about 60 seconds.  TermVectors etc are enabled. 
>>>>>>>>>>  When I disable highlighting, performance improves a lot, but is 
>>>>>>>>>> still slow for some queries (7 seconds).  Thanks in advance for any 
>>>>>>>>>> ideas!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -Peter
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> 4GB RAM server
>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>> 
>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> schema.xml changes:
>>>>>>>>>> 
>>>>>>>>>>  <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>>    <analyzer>
>>>>>>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>>    <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>    <filter class="solr.WordDelimiterFilterFactory" 
>>>>>>>>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>>>>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
>>>>>>>>>>    </analyzer>
>>>>>>>>>>  </fieldType>
>>>>>>>>>> 
>>>>>>>>>> ...
>>>>>>>>>> 
>>>>>>>>>> <field name="body" type="text_pl" indexed="true" stored="true" 
>>>>>>>>>> multiValued="false" termVectors="true" termPositions="true" 
>>>>>>>>>> termOffsets="true" />
>>>>>>>>>>  <field name="timestamp" type="date" indexed="true" stored="true" 
>>>>>>>>>> default="NOW" multiValued="false"/>
>>>>>>>>>> <field name="version" type="string" indexed="true" stored="true" 
>>>>>>>>>> multiValued="false"/>
>>>>>>>>>> <field name="device" type="string" indexed="true" stored="true" 
>>>>>>>>>> multiValued="false"/>
>>>>>>>>>> <field name="filename" type="string" indexed="true" stored="true" 
>>>>>>>>>> multiValued="false"/>
>>>>>>>>>> <field name="filesize" type="long" indexed="true" stored="true" 
>>>>>>>>>> multiValued="false"/>
>>>>>>>>>> <field name="pversion" type="int" indexed="true" stored="true" 
>>>>>>>>>> multiValued="false"/>
>>>>>>>>>> <field name="first2md5" type="string" indexed="false" stored="true" 
>>>>>>>>>> multiValued="false"/>
>>>>>>>>>> <field name="ckey" type="string" indexed="true" stored="true" 
>>>>>>>>>> multiValued="false"/>
>>>>>>>>>> 
>>>>>>>>>> ...
>>>>>>>>>> 
>>>>>>>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>> 
>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>> 
>>>>>>>>>>  <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>>  <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>> 
>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> The query:
>>>>>>>>>> 
>>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>>> facet = 
>>>>>>>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>>> fields = 
>>>>>>>>>> "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + 
>>>>>>>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
>>>>>>>>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>>>> 
>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? 
>>>>>>>>>> ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + 
>>>>>>>>>> termvectors + hl + hl_regex
>>>>>>>>>> 
>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + 
>>>>>>>>>> '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Lance Norskog
>>>>> goks...@gmail.com
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com

Re: Solr searching performance issues, using large documents

Reply via email to