My language might be a bit off (I am saying "string" when I probably mean
"text" in the context of solr), but I'm pretty sure that my story is
unwavering ;)

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` int(11)

So, imagine that we have 1000 entries come in where "data" above is exactly
the same for all 1000 entries, but user_id is different (id and created
being different is irrelevant).  I am thinking that prior to inserting into
mysql, I should be able to concatenate the user_ids together with
whitespace and then insert them into something like:

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` blob

Then on solr's end it will treat the user_id as Text and parse it (I want
to say tokenize, but maybe my language is incorrect here?).

Then when I search

user_id:2002+AND+created:[${**from}+TO+${until}]+data:"more"

I want to be sure that if I look for user_id "2002", I will get data that
only has a value "2002" in the user_id column and that a separate user with
id "20" cannot accidentally pull data for user_id "2002" as a result of a
fuzzy (my language ok?) match of 20 against (20)02.

Current schema definition:

 <field name="user_id" type="int" indexed="true" stored="true"/>

New schema definition:

    <field name="user_id" type="user_id_string" indexed="true"
stored="true"/>
...
    <fieldType name="user_id_string" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"
maxTokenLength="120"/>
      </analyzer>
    </fieldType>

I am obviously not a 1337 solr haxor :P

Why do this?  We have a lot of data coming in and I want to compact it as
best as I can.

Regards,
Nate





On Fri, Jun 7, 2013 at 1:23 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> To be clear, one normally doesn't do queries on portions of an "ID" -
> usually it is one integrated string.
>
> Further strings are definitely NOT tokenized in Solr.
>
> Your story keeps changing, which is why I have to keep hedging my answers.
>
> At least with your latest store, your user_id should be a text/TextField
> so that it will be tokenized. A query for "2002" will
> match on complete tokens, not parts of tokens. If you want to match
> exactly on the full user_id, use a quoted phrase for the full user_id.
>
> But... I still have to hedge, because you refer to "a string of
> concatenated user id values". You seem to have two distinct definitions for
> user id.
>
> So, until you disclose all of your requirements and your data model,
> including a clarification about user id vs. "a string of concatenated user
> id values", I can't answer your question definitively, other than "Maybe,
> depending on what you really mean by user id."
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: z z
> Sent: Friday, June 07, 2013 12:11 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Schema Change: Int -> String (i am the original poster, new
> email address)
>
> The unique key is an auto-incremented int in the db.  Sorry for having
> given the impression that user_id is the unique key per document.  This is
> a table of events that are happening as users interact with our system.
> It just so happens that we were inserting individual records for each user
> before we even began to think about using something like Solr.  Now,
> however, it seems to me that we should be able to ask questions like "give
> me all records for user "2002" that have this string value "more" in data2,
> across this time stamp range [ .... ].  Several simultaneously inserted
> rows into the db are exactly the same aside from the user_ids.  I just want
> to know beforehand if I can still maintain exact matches for a user if the
> user_id becomes a string of concatenated user id values.
>
> From what you are saying it sounds like the "user_id_str" is really all I
> need.  It is tokenized and allows for partial searches.  I just want to
> make sure that "2002 15000 45" when tokenized doesn't allow "20" to
> partially match the token "2002".
>
> On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky <j...@basetechnology.com>*
> *wrote:
>
>  In that case, you will need to keep two copies of the user ID, one which
>> is a single, complete string, and one which is a tokenized field
>> text/TextField so that you can do a keyword search against it. Use the
>> string/StrField as the main copy and then use a <copyField> directive in
>> the schema to copy from the main copy to the other copy.
>>
>> So, maybe "user_id" is the full unique key - you would have to specify,
>> the full exact key to query against it, or use wildcards for partial
>> matches, and "user" or "user_id_str" would be the tokenized text version
>> that would allow a simple search by partial value, such as "2002".
>>
>> Even so, I'm still not convinced that you have given us your complete
>> requirements. Is the user_id in fact the unique key for the documents?
>>
>>
>>
>

Reply via email to