Re: Range query on a substring.

Roman Chyla Tue, 16 Jul 2013 14:41:19 -0700

On Tue, Jul 16, 2013 at 5:08 PM, Marcin Rzewucki <mrzewu...@gmail.com>wrote:


> Hi guys,
>
> First of all, thanks for your response.
>
> Jack: Data structure was created some time ago and this is a new
> requirement in my project. I'm trying to find a solution. I wouldn't like
> to split multivalued field into N similar records varying in this
> particular field only. That could impact performance and imply more changes
> in backend architecture as well. I'd prefer to create yet another
> collection and use pseudo-joins...
>
> Roman: Your ideas seem to be much closer to what I'm looking for. However,
> the following syntax: "text (1|2|3)" does not work for me. Are you sure it
> works like OR inside a regexp ?
>

I wasn't clear, sorry: the "text (1|1|3)" is a result of the term expansion
- you can see something like that when you look at debugQuery=true output
after you sent "phrase quer*" - lucene will search for the variants by
enumerating the possible alternatives, hence "phrase (token|token|token)"

it is possible to construct such a query manually, it depends on your
application

one more thing: the term expansion depends on the type of the field (ie.
expanding string field is different from the int field type), yet you could
very easily write a small processor that looks at the range values and
treats them as numbers (*after* they were parsed by the qparser, but
*before* they were built into a query - hmmm, now when I think of it...
your values will be indexed as strings, so you have to search/expand into
string byterefs - it's doable, just wanted to point out this detail - in
normal situations, SOLR will be building query tokens using the string/text
field, because your field will be of that type)

roman



> By the way: Honestly, I have one more requirement for which I would have to
> extend Solr query syntax. Basically, it should be possible to do some math
> on few fields and do range query on the result (without indexing it,
> because a combination of different fields is allowed). I'd like to spend
> some time on ANTLR and the new way of parsing you mentioned. I will let you
> know if it was useful for me. Thanks.
>
> Kind regards.
>
>
> On 16 July 2013 20:07, Roman Chyla <roman.ch...@gmail.com> wrote:
>
> > Well, I think this is slightly too categorical - a range query on a
> > substring can be thought of as a simple range query. So, for example the
> > following query:
> >
> > "lucene 1*"
> >
> > becomes behind the scenes: "lucene (10|11|12|13|14|1abcd)"
> >
> > the issue there is that it is a string range, but it is a range query -
> it
> > just has to be indexed in a clever way
> >
> > So, Marcin, you still have quite a few options besides the strict boolean
> > query model
> >
> > 1. have a special tokenizer chain which creates one token out of these
> > groups (eg. "some text prefix_1") and search for "some text prefix_*"
> [and
> > do some post-filtering if necessary]
> > 2. another version, using regex /some text (1|2|3...)/ - you got the idea
> > 3. construct the lucene multi-term range query automatically, in your
> > qparser - to produce a phrase query "lucene (10|11|12|13|14)"
> > 4. use payloads to index your integer at the position of "some text" and
> > then retrieve only "some text" where the payload is in range x-y - an
> > example is here, look at getPayloadQuery()
> >
> >
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java-
> > but this is more complex situation and if you google, you will find a
> > better description
> > 5. use a qparser that is able to handle nested search and analysis at the
> > same time - eg. your query is: field:"some text" NEAR1 field:[0 TO 10] -
> i
> > know about a parser that can handle this and i invite others to check it
> > out (yeah, JIRA tickets need reviewers ;-))
> > https://issues.apache.org/jira/browse/LUCENE-5014
> >
> > there might be others i forgot, but it is certainly doable; but as Jack
> > points out, you may want to stop for a moment to reflect whether it is
> > necessary
> >
> > HTH,
> >
> >   roman
> >
> >
> > On Tue, Jul 16, 2013 at 8:35 AM, Jack Krupansky <j...@basetechnology.com
> > >wrote:
> >
> > > Sorry, but you are basically misusing Solr (and multivalued fields),
> > > trying to take a "shortcut" to avoid a proper data model.
> > >
> > > To properly use Solr, you need to put each of these multivalued field
> > > values in a separate Solr document, with a "text" field and a "value"
> > > field. Then, you can query:
> > >
> > >    text:"some text" AND value:[min-value TO max-value]
> > >
> > > Exactly how you should restructure your data model is dependent on all
> of
> > > your other requirements.
> > >
> > > You may be able to simply flatten your data.
> > >
> > > You may be able to use a simple join operation.
> > >
> > > Or, maybe you need to do a multi-step query operation if you data is
> > > sufficiently complex.
> > >
> > > If you want to keep your multivalued field in its current form for
> > display
> > > purposes or keyword search, or exact match search, fine, but your
> stated
> > > goal is inconsistent with the Semantics of Solr and Lucene.
> > >
> > > To be crystal clear, there is no such thing as "a range query on a
> > > substring" in Solr or Lucene.
> > >
> > > -- Jack Krupansky
> > >
> > > -----Original Message----- From: Marcin Rzewucki
> > > Sent: Tuesday, July 16, 2013 5:13 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Range query on a substring.
> > >
> > >
> > > By multivalued I meant an array of values. For example:
> > > <arr name="myfield">
> > >  <str>text1 (X)</str>
> > >  <str>text2 (Y)</str>
> > > </arr>
> > >
> > > I'd like to avoid spliting it as you propose. I have 2.3mn collection
> > with
> > > pretty large records (few hundreds fields and more per record).
> > Duplicating
> > > them would impact performance.
> > >
> > > Regards.
> > >
> > >
> > >
> > > On 16 July 2013 10:26, Oleg Burlaca <oburl...@gmail.com> wrote:
> > >
> > >  Ah, you mean something like this:
> > >> record:
> > >> Id=10, text =  "this is a text N1 (X), another text N2 (Y), text N3
> (Z)"
> > >> Id=11, text =  "this is a text N1 (W), another text N2 (Q), third text
> > >> (M)"
> > >>
> > >> and you need to search for: "text N1" and X < B ?
> > >> How big is the core? the first thing that comes to my mind, again, at
> > >> indexing level,
> > >> split the text into pieces and index it in solr like this:
> > >>
> > >> record_id | text      | value
> > >> 10           | text N1 | X
> > >> 10           | text N2 | Y
> > >> 10           | text N3 | Z
> > >>
> > >> does it help?
> > >>
> > >>
> > >>
> > >> On Tue, Jul 16, 2013 at 10:51 AM, Marcin Rzewucki <
> mrzewu...@gmail.com
> > >> >wrote:
> > >>
> > >> > Hi Oleg,
> > >> > It's a multivalued field and it won't be easier to query when I
> split
> > >> this
> > >> > field into text and numbers. I may get wrong results.
> > >> >
> > >> > Regards.
> > >> >
> > >> >
> > >> > On 16 July 2013 09:35, Oleg Burlaca <oburl...@gmail.com> wrote:
> > >> >
> > >> > > IMHO the number(s) should be extracted and stored in separate
> > columns
> > >> in
> > >> > > SOLR at indexing time.
> > >> > >
> > >> > > --
> > >> > > Oleg
> > >> > >
> > >> > >
> > >> > > On Tue, Jul 16, 2013 at 10:12 AM, Marcin Rzewucki <
> > >> mrzewu...@gmail.com
> > >> > > >wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > I have a problem (wonder if it is possible to solve it at all)
> > with
> > >> the
> > >> > > > following query. There are documents with a field which
> contains a
> > >> text
> > >> > > and
> > >> > > > a number in brackets, eg.
> > >> > > >
> > >> > > > myfield: this is a text (number)
> > >> > > >
> > >> > > > There might be some other documents with the same text but
> > different
> > >> > > number
> > >> > > > in brackets.
> > >> > > > I'd like to find documents with the given text say "this is a
> > text"
> > >> and
> > >> > > > "number" between A and B. Is it possible in Solr ? Any ideas ?
> > >> > > >
> > >> > > > Kind regards.
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> > >
> >
>

Re: Range query on a substring.

Reply via email to