Re: NGramTokenizer issue

Jonathan Ariel Wed, 25 Jun 2008 20:26:19 -0700

Well, it is working if I search just two letters, but that just tells me
that something is wrong somewhere.
The Analysis tools is showing me how "dog" is being tokenized to "do og", so
if when indexing and querying I'm using the same tokenizer/filters (which is
my case) I should get results even when searching "dog".


I've just created a small unit test in solr to try that out.

    public void testNGram() throws IOException, Exception {
        assertU("adding doc with ngram field",adoc("id", "42", "text_ngram",
"nice dog"));
        assertU("commiting",commit());

        assertQ("test query, expect one document",
            req("text_ngram:dog")
            ,"//[EMAIL PROTECTED]'1']"
            );
    }

As you can see I'm adding a document with the field text_ngram with the
value "nice dog".
Then I commit it and query it for "text_ngram:dog".

text_ngram is defined in the schema as:
    <fieldtype name="ngram_field" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2"
minGramSize="2" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="2"
minGramSize="2" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

This test passes. That means that I am able to get results when searching
"dog" on a ngram field, where min and max are set to 2 and where the value
of that field is "nice dog".
So it doesn't seems to be a issue in solr, although I am having this error
when using solr outside the unit test. It seems very improbable to think on
an environment issue.

Maybe I am doing something wrong. Any thoughts on that?

Thanks!

Jonathan

On Wed, Jun 25, 2008 at 9:44 PM, Norberto Meijome <[EMAIL PROTECTED]>
wrote:

> On Wed, 25 Jun 2008 15:37:09 -0300
> "Jonathan Ariel" <[EMAIL PROTECTED]> wrote:
>
> > I've been trying to use the NGramTokenizer and I ran into a problem.
> > It seems like solr is trying to match documents with all the tokens that
> the
> > analyzer returns from the query term. So if I index a document with a
> title
> > field with the value "nice dog" and search for "dog" (where the
> > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't
> get
> > any results.
>
> Hi Jonathan,
> I don't have the expertise yet to have gone straight into testing code with
> lucene, but my 'black box' testing with ngramtokenizer seems to agree with
> what
> you found - see my latest posts over the last couple of days about this.
>
> Have you tried searching for 'do' or 'ni' or any search term with size =
> minGramSize ? I've found that Solr matches results just fine then.
>
> > I can see in the Analysis tool that the tokenizer generates the right
> > tokens, but then when solr searches it tries to match the exact Phrase
> > instead of the tokens.
>
> +1
>
> B
>
> _________________________
> {Beto|Norberto|Numard} Meijome
>
> "Some cause happiness wherever they go; others, whenever they go."
>  Oscar Wilde
>
> I speak for myself, not my employer. Contents may be hot. Slippery when
> wet.
> Reading disclaimers makes you go blind. Writing them is worse. You have
> been
> Warned.
>

Re: NGramTokenizer issue

Reply via email to