Re: Word Delimiter struggles

Shalin Shekhar Mangar Sat, 17 Jan 2009 02:09:25 -0800

Hi Dave,

A quick experimentation found the following fieldtypes to be successful with
your queries. Add one as a copyField to the other and search on both:


<fieldtype name="wdf_wordparts" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
preserveOriginal="1"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

    <fieldtype name="wdf_catenatewords" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

I added the following test to TestWordDelimiterFilter.java

public void testDave() {

    assertU(adoc("id", "191",
            "wdf_preserve", "phpGroupWare"));
    assertU(commit());

    assertQ("preserving original word",
            req("wdf_preserve:PHPGroupWare")
            , "//resu...@numfound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:phpGroupWare wdf_catenatewords:phpGroupWare")
            , "//resu...@numfound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:PHPGroupware wdf_catenatewords:PHPGroupware")
            , "//resu...@numfound=1]"
    );
    assertQ("preserving original word",
            req("wdf_wordparts:phpGroupware wdf_catenatewords:phpGroupware")
            , "//resu...@numfound=1]"
    );
    assertQ("preserving original word",
            req("wdf_wordparts:phpgroupware wdf_catenatewords:phpgroupware")
            , "//resu...@numfound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:(php groupware) wdf_catenatewords:(php
groupware)")
            , "//resu...@numfound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:(php group ware) wdf_catenatewords:(php group
ware)")
            , "//resu...@numfound=1]"
    );

    assertQ("preserving original word",
            req("wdf_wordparts:(PHPGroup ware) wdf_catenatewords:(PHPGroup
ware)")
            , "//resu...@numfound=1]"
    );

  }

I'll let someone else comment if there is an easier way to do this (without
two fields).

On Sat, Jan 17, 2009 at 3:06 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Sorry I typed without thinking too much. Please disregard my previous mail.
>
> I'll run a few tests and let you know.
>
>
> On Sat, Jan 17, 2009 at 2:46 PM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
>> Hi Dave,
>>
>> There is an attribute on the WordDelimiterFactory preserveOriginal="true"
>> which should keep the original string. I think if you keep LowerCaseFilter
>> before WordDelimiterFactory with the preserveOriginal setting, it should do
>> what you have outlined.
>>
>>
>> On Sat, Jan 17, 2009 at 8:57 AM, David Shettler <dshett...@gmail.com>wrote:
>>
>>> This has likely been covered, and I've tried searching through the
>>> archives, but having trouble finding an answer.
>>>
>>> On OSVDB.org, if you search for:
>>>
>>> title:PHPGroupWare
>>>
>>> You get...nothing
>>>
>>> if you search for:
>>>
>>> title:phpGroupWare
>>>
>>> (which is how the entry is indexed originally), you get a match of
>>> course.
>>>
>>> same with phpgroupware
>>>
>>> If I get rid of word delimiter, then things are fine, unless you want
>>> to search for PHP GroupWare and get a match...
>>>
>>> Basically, I need to get a match on any of these searches:
>>>
>>> PHPGroupWare
>>> PHPGroupware
>>> phpGroupware
>>> phpGroupWare
>>> phpgroupware
>>> php groupware
>>> php group ware
>>> PHPGroup ware
>>>
>>> etc.
>>>
>>> We've been dealing with this problem for about 36 months now, but
>>> there has to be a better way...or am I dreaming? :)
>>>
>>> Can anyone suggestion a schema that would accommodate this?  I've
>>> tried every combination of word delimiter that I can think of, but I'm
>>> no expert on the topic.
>>>
>>> I can also manipulate input prior to search and indexing if you can
>>> think of a way there.  It's wanting the best of select from LIKE, and
>>> solr's voodoo...perhaps I'm wanting too much!
>>>
>>> Cheers,
>>>
>>> Dave
>>> OSVDB.org
>>>
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Word Delimiter struggles

Reply via email to