Re: Proper analyzer / tokenizer for syslog data?

Peter Spam Fri, 04 Nov 2011 16:36:45 -0700

Wow, I tried with minGramSize=1 and maxgramSize=1000 (I want someone to be able 
to search on any substring, just like "grep"), and the index is multiple orders 
of magnitude larger than my data!


There's got to be a better way to support full grep-like searching?


Thanks!
Pete

On Nov 4, 2011, at 1:20 AM, Ahmet Arslan wrote:

>> Example data:
>> 01/23/2011 05:12:34 [Test] a=1; hello_there=50;
>> data=[1,5,30%];
>> 
>> I would love to be able to just "grep" the data - ie. if I
>> search for "ello", it finds and returns "ello", and if I
>> search for "hello_there=5", it would match too.
>> 
>> Here's what I'm using now:
>> 
>>    <fieldType name="text_sy"
>> class="solr.TextField">
>>      <analyzer>
>>        <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>>        <filter
>> class="solr.LowerCaseFilterFactory"/>
>>        <filter
>> class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0"
>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>> splitOnCaseChange="0"/>
>>      </analyzer>
>>    </fieldType>
>> 
>> The problem with this is that if I search for a substring,
>> I don't get anything back.  For example, searching for
>> "ello" or "*ello*" doesn't return.  Any ideas?
>> 
>> http://localhost:8983/solr/select?q=*ello*&start=0&rows=50&hl.maxAnalyzedChars=2147483647&hl.useFastVectorHighlighter=true&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400
> 
> For sub-string match NGramFilterFactory is required at index time.
> 
> <filter class="solr.NGramFilterFactory" minGramSize="1"
> maxGramSize="15"/> 
> 
> Plus you may want to use WhiteSpaceTokenizer instead of 
> StandardTokenizerFactory. Analysis admin page displays behavior of each 
> tokenizer.

Re: Proper analyzer / tokenizer for syslog data?

Reply via email to