Advice on Exact Matching?

Scott Gonyea Thu, 30 Dec 2010 17:05:23 -0800

Hi,

I am trying to make sure that when I search for text—regardless of
what that text is—that I get an exact match.  I'm *still* getting some
issues, and this last mile is becoming very painful.  The solr field,
for which I'm setting this up on, is pasted below my explanation.  I
appreciate any help.


Explanation:

I'm crawling websites with Nutch.  I'm performing some
mechanical-turk-like filtering and term matching.  The problem is,
there's some very gnarly behavior in Solr due to any number of
gotchas.

If I want to find *all* Solr documents that match
"[id]somejunk\hi[/id]" then life is instantly hell.

Likewise, lots of whitespace in between words throws it off " john
says hello,  how are you?"  I would love to be able to search for
these exact phrases.  If that's just not practical (I'm more than
willing to live with a bloated search index), what would some other
strategies be?

There's no MapReduce in Solr; I could attempt to do Hadoop-streaming,
but that's not very ideal for a variety of reasons.


Solr Schema.xml, fieldType "text" (no, this is not used everywhere;
only on 2 fields):


    <fieldType name="text"    class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"     generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


Thank you,
Scott Gonyea

Advice on Exact Matching?

Reply via email to