Thanks for the response, Karsten.

1) What's the recommended maximum chunk size?
2) Does my tokenizer look reasonable?


Thanks!
Pete

On Oct 21, 2011, at 2:28 AM, karsten-s...@gmx.de wrote:

> Hi Peter,
> 
> highlighting in large text files can not be fast without dividing the 
> original text in small piece.
> So take a look in
> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
> and in
> http://www.lucidimagination.com/blog/2010/09/16/2446/
> 
> Which means that you should divide your files and use
> Result Grouping / Field Collapsing
> to list only one hit per original document.
> 
> (xtf also would solve your problem "out of the box" but xtf does not use 
> solr).
> 
> Best regards
>  Karsten
> 
> -------- Original-Nachricht --------
>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>> Von: Peter Spam <ps...@mac.com>
>> An: solr-user@lucene.apache.org
>> Betreff: Can Solr handle large text files?
> 
>> I have about 20k text files, some very small, but some up to 300MB, and
>> would like to do text searching with highlighting.
>> 
>> Imagine the text is the contents of your syslog.
>> 
>> I would like to type in some terms, such as "error" and "mail", and have
>> Solr return the syslog lines with those terms PLUS two lines of context. 
>> Pretty much just like Google's highlighting.
>> 
>> 1) Can Solr handle this?  I had extremely long query times when I tried
>> this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I tried breaking
>> the files into 1MB pieces, but searching would be wonky => return the wrong
>> number of documents (ie. if one file had a term 5 times, and that was the
>> only file that had the term, I want 1 result, not 5 results).  
>> 
>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>> 
>>   <field name="body" type="text_pl" indexed="true" stored="true"
>> multiValued="false" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>> 
>>    <fieldType name="text_pl" class="solr.TextField">
>>      <analyzer>
>>        <tokenizer class="solr.StandardTokenizerFactory"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>> catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="0"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> 
>> Thanks!
>> Pete

Reply via email to