Thanks for the response, Karsten. 1) What's the recommended maximum chunk size? 2) Does my tokenizer look reasonable?
Thanks! Pete On Oct 21, 2011, at 2:28 AM, karsten-s...@gmx.de wrote: > Hi Peter, > > highlighting in large text files can not be fast without dividing the > original text in small piece. > So take a look in > http://xtf.cdlib.org/documentation/under-the-hood/#Chunking > and in > http://www.lucidimagination.com/blog/2010/09/16/2446/ > > Which means that you should divide your files and use > Result Grouping / Field Collapsing > to list only one hit per original document. > > (xtf also would solve your problem "out of the box" but xtf does not use > solr). > > Best regards > Karsten > > -------- Original-Nachricht -------- >> Datum: Thu, 20 Oct 2011 17:59:04 -0700 >> Von: Peter Spam <ps...@mac.com> >> An: solr-user@lucene.apache.org >> Betreff: Can Solr handle large text files? > >> I have about 20k text files, some very small, but some up to 300MB, and >> would like to do text searching with highlighting. >> >> Imagine the text is the contents of your syslog. >> >> I would like to type in some terms, such as "error" and "mail", and have >> Solr return the syslog lines with those terms PLUS two lines of context. >> Pretty much just like Google's highlighting. >> >> 1) Can Solr handle this? I had extremely long query times when I tried >> this with Solr 1.4.1 (yes I was using TermVectors, etc.). I tried breaking >> the files into 1MB pieces, but searching would be wonky => return the wrong >> number of documents (ie. if one file had a term 5 times, and that was the >> only file that had the term, I want 1 result, not 5 results). >> >> 2) What sort of tokenizer would be best? Here's what I'm using: >> >> <field name="body" type="text_pl" indexed="true" stored="true" >> multiValued="false" termVectors="true" termPositions="true" >> termOffsets="true" /> >> >> <fieldType name="text_pl" class="solr.TextField"> >> <analyzer> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.WordDelimiterFilterFactory" >> generateWordParts="0" generateNumberParts="0" catenateWords="0" >> catenateNumbers="0" >> catenateAll="0" splitOnCaseChange="0"/> >> </analyzer> >> </fieldType> >> >> >> Thanks! >> Pete