While testing my code I discovered that my copyField with PatternTokenize
does not do what I want.  This is what I am indexing into Solr:

<field name="title">2.0|Solr In Action</field>

My copyField is simply:

   <copyField source="title" dest="titleRaw"/>

field titleRaw is of type title_raw:

    <fieldType name="title_raw" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.PatternTokenizerFactory" pattern="[^#]*#(.*)"
group="1"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
    </fieldType>

For my example input "Solr in Action" is indexed into the titleRaw field
without the payload.  But the payload is still stored.  So when I retrieve
the field titleRaw I still get back "2.0|Solr in Action" where what I really
want is just "Solr in Action".

Is it possible to have the copyField strip off the payload while it is
copying since doing it in the analysis phrase is too late?  Or should I
start looking into using UpdateProcessors as Chris had suggested?

Bill

On Fri, Aug 21, 2009 at 12:04 PM, Bill Au <bill.w...@gmail.com> wrote:

> I ended up not using an XML attribute for the payload since I need to
> return the payload in query response.  So I ended up going with:
>
> <field name="title">2.0|Solr In Action</field>
>
> My payload is numeric so I can pick a non-numeric delimiter (ie '|').
> Putting the payload in front means I don't have to worry about the delimiter
> appearing in the value.  The payload is required in my case so I can simply
> look for the first occurrence of the delimiter and ignore the possibility of
> the delimiter appearing in the value.
>
> I ended up writing a custom Tokenizer and a copy field with a
> PatternTokenizerFactory to filter out the delimiter and payload.  That's is
> straight forward in terms of implementation.  On top of that I can still use
> the CSV loader, which I really like because of its speed.
>
> Bill.
>
> On Thu, Aug 20, 2009 at 10:36 PM, Chris Hostetter <
> hossman_luc...@fucit.org> wrote:
>
>>
>> : of the field are correct but the delimiter and payload are stored so
>> they
>> : appear in the response also.  Here is an example:
>>         ...
>> : I am thinking maybe I can do this instead when indexing:
>> :
>> : XML for indexing:
>> : <field name="title" payload="2.0">Solr In Action</field>
>> :
>> : This will simplify indexing as I don't have to repeat the payload for
>> each
>>
>> but now you're into a custom request handler for the updates to deal with
>> the custom XML attribute so you can't use DIH, or CSV loading.
>>
>> It seems like it might be simpler have two new (generic) UpdateProcessors:
>> one that can clone fieldA into fieldB, and one that can do regex mutations
>> on fieldB ... neither needs to know about payloads at all, but the first
>> can made a copy of "2.0|Solr In Action" and the second can strip off the
>> "2.0|" from the copy.
>>
>> then you can write a new NumericPayloadRegexTokenizer that takes in two
>> regex expressions -- one that knows how to extract the payload from a
>> piece of input, and one that specifies the tokenization.
>>
>> those three classes seem easier to implemnt, easier to maintain, and more
>> generally reusable then a custom xml request handler for your updates.
>>
>>
>> -Hoss
>>
>>
>

Reply via email to