However, I do need to search the entire document, or else the highlighting will
sometimes be blank :-(
Thanks!
- Peter
ps. sorry for the many responses - I'm rushing around trying to get this
working.
On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
> Correction - it went from 17 seconds to 10 seconds - I was changing the
> hl.regex.maxAnalyzedChars the first time.
> Thanks!
>
> -Peter
>
> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>
>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>
>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>
>> Yes, I tried dropping it down to 21, but it didn't have much of an impact
>> (one search I just tried went from 17 seconds to 15.8 seconds, and this is
>> an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>
>>> ? Also regular expression highlighting is more expensive, I think.
>>> What does the 'fuzzy' variable mean? If you use this to query via
>>> "~someTerm" instead "someTerm"
>>> then you should try the trunk of solr which is a lot faster for fuzzy or
>>> other wildcard search.
>>
>> "fuzzy" could be set to "*" but isn't right now.
>>
>> Thanks for the tips, Peter - this has been very frustrating!
>>
>>
>> - Peter
>>
>>> Regards,
>>> Peter.
>>>
>>>> Data set: About 4,000 log files (will eventually grow to millions).
>>>> Average log file is 850k. Largest log file (so far) is about 70MB.
>>>>
>>>> Problem: When I search for common terms, the query time goes from under
>>>> 2-3 seconds to about 60 seconds. TermVectors etc are enabled. When I
>>>> disable highlighting, performance improves a lot, but is still slow for
>>>> some queries (7 seconds). Thanks in advance for any ideas!
>>>>
>>>>
>>>> -Peter
>>>>
>>>>
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> 4GB RAM server
>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> schema.xml changes:
>>>>
>>>> <fieldType name="text_pl" class="solr.TextField">
>>>> <analyzer>
>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
>>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>>>> catenateAll="0" splitOnCaseChange="0"/>
>>>> </analyzer>
>>>> </fieldType>
>>>>
>>>> ...
>>>>
>>>> <field name="body" type="text_pl" indexed="true" stored="true"
>>>> multiValued="false" termVectors="true" termPositions="true"
>>>> termOffsets="true" />
>>>> <field name="timestamp" type="date" indexed="true" stored="true"
>>>> default="NOW" multiValued="false"/>
>>>> <field name="version" type="string" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="device" type="string" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="filename" type="string" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="filesize" type="long" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="pversion" type="int" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>> <field name="first2md5" type="string" indexed="false" stored="true"
>>>> multiValued="false"/>
>>>> <field name="ckey" type="string" indexed="true" stored="true"
>>>> multiValued="false"/>
>>>>
>>>> ...
>>>>
>>>> <dynamicField name="*" type="ignored" multiValued="true" />
>>>> <defaultSearchField>body</defaultSearchField>
>>>> <solrQueryParser defaultOperator="AND"/>
>>>>
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> solrconfig.xml changes:
>>>>
>>>> <maxFieldLength>2147483647</maxFieldLength>
>>>> <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>
>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> The query:
>>>>
>>>> rowStr = "&rows=10"
>>>> facet =
>>>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) +
>>>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/,
>>>> '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>
>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? ''
>>>> : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors
>>>> + hl + hl_regex
>>>>
>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' +
>>>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> http://karussell.wordpress.com/
>>>
>>
>