On Tue, Jul 16, 2013 at 5:08 PM, Marcin Rzewucki <mrzewu...@gmail.com>wrote:
> Hi guys, > > First of all, thanks for your response. > > Jack: Data structure was created some time ago and this is a new > requirement in my project. I'm trying to find a solution. I wouldn't like > to split multivalued field into N similar records varying in this > particular field only. That could impact performance and imply more changes > in backend architecture as well. I'd prefer to create yet another > collection and use pseudo-joins... > > Roman: Your ideas seem to be much closer to what I'm looking for. However, > the following syntax: "text (1|2|3)" does not work for me. Are you sure it > works like OR inside a regexp ? > I wasn't clear, sorry: the "text (1|1|3)" is a result of the term expansion - you can see something like that when you look at debugQuery=true output after you sent "phrase quer*" - lucene will search for the variants by enumerating the possible alternatives, hence "phrase (token|token|token)" it is possible to construct such a query manually, it depends on your application one more thing: the term expansion depends on the type of the field (ie. expanding string field is different from the int field type), yet you could very easily write a small processor that looks at the range values and treats them as numbers (*after* they were parsed by the qparser, but *before* they were built into a query - hmmm, now when I think of it... your values will be indexed as strings, so you have to search/expand into string byterefs - it's doable, just wanted to point out this detail - in normal situations, SOLR will be building query tokens using the string/text field, because your field will be of that type) roman > By the way: Honestly, I have one more requirement for which I would have to > extend Solr query syntax. Basically, it should be possible to do some math > on few fields and do range query on the result (without indexing it, > because a combination of different fields is allowed). I'd like to spend > some time on ANTLR and the new way of parsing you mentioned. I will let you > know if it was useful for me. Thanks. > > Kind regards. > > > On 16 July 2013 20:07, Roman Chyla <roman.ch...@gmail.com> wrote: > > > Well, I think this is slightly too categorical - a range query on a > > substring can be thought of as a simple range query. So, for example the > > following query: > > > > "lucene 1*" > > > > becomes behind the scenes: "lucene (10|11|12|13|14|1abcd)" > > > > the issue there is that it is a string range, but it is a range query - > it > > just has to be indexed in a clever way > > > > So, Marcin, you still have quite a few options besides the strict boolean > > query model > > > > 1. have a special tokenizer chain which creates one token out of these > > groups (eg. "some text prefix_1") and search for "some text prefix_*" > [and > > do some post-filtering if necessary] > > 2. another version, using regex /some text (1|2|3...)/ - you got the idea > > 3. construct the lucene multi-term range query automatically, in your > > qparser - to produce a phrase query "lucene (10|11|12|13|14)" > > 4. use payloads to index your integer at the position of "some text" and > > then retrieve only "some text" where the payload is in range x-y - an > > example is here, look at getPayloadQuery() > > > > > https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java- > > but this is more complex situation and if you google, you will find a > > better description > > 5. use a qparser that is able to handle nested search and analysis at the > > same time - eg. your query is: field:"some text" NEAR1 field:[0 TO 10] - > i > > know about a parser that can handle this and i invite others to check it > > out (yeah, JIRA tickets need reviewers ;-)) > > https://issues.apache.org/jira/browse/LUCENE-5014 > > > > there might be others i forgot, but it is certainly doable; but as Jack > > points out, you may want to stop for a moment to reflect whether it is > > necessary > > > > HTH, > > > > roman > > > > > > On Tue, Jul 16, 2013 at 8:35 AM, Jack Krupansky <j...@basetechnology.com > > >wrote: > > > > > Sorry, but you are basically misusing Solr (and multivalued fields), > > > trying to take a "shortcut" to avoid a proper data model. > > > > > > To properly use Solr, you need to put each of these multivalued field > > > values in a separate Solr document, with a "text" field and a "value" > > > field. Then, you can query: > > > > > > text:"some text" AND value:[min-value TO max-value] > > > > > > Exactly how you should restructure your data model is dependent on all > of > > > your other requirements. > > > > > > You may be able to simply flatten your data. > > > > > > You may be able to use a simple join operation. > > > > > > Or, maybe you need to do a multi-step query operation if you data is > > > sufficiently complex. > > > > > > If you want to keep your multivalued field in its current form for > > display > > > purposes or keyword search, or exact match search, fine, but your > stated > > > goal is inconsistent with the Semantics of Solr and Lucene. > > > > > > To be crystal clear, there is no such thing as "a range query on a > > > substring" in Solr or Lucene. > > > > > > -- Jack Krupansky > > > > > > -----Original Message----- From: Marcin Rzewucki > > > Sent: Tuesday, July 16, 2013 5:13 AM > > > To: solr-user@lucene.apache.org > > > Subject: Re: Range query on a substring. > > > > > > > > > By multivalued I meant an array of values. For example: > > > <arr name="myfield"> > > > <str>text1 (X)</str> > > > <str>text2 (Y)</str> > > > </arr> > > > > > > I'd like to avoid spliting it as you propose. I have 2.3mn collection > > with > > > pretty large records (few hundreds fields and more per record). > > Duplicating > > > them would impact performance. > > > > > > Regards. > > > > > > > > > > > > On 16 July 2013 10:26, Oleg Burlaca <oburl...@gmail.com> wrote: > > > > > > Ah, you mean something like this: > > >> record: > > >> Id=10, text = "this is a text N1 (X), another text N2 (Y), text N3 > (Z)" > > >> Id=11, text = "this is a text N1 (W), another text N2 (Q), third text > > >> (M)" > > >> > > >> and you need to search for: "text N1" and X < B ? > > >> How big is the core? the first thing that comes to my mind, again, at > > >> indexing level, > > >> split the text into pieces and index it in solr like this: > > >> > > >> record_id | text | value > > >> 10 | text N1 | X > > >> 10 | text N2 | Y > > >> 10 | text N3 | Z > > >> > > >> does it help? > > >> > > >> > > >> > > >> On Tue, Jul 16, 2013 at 10:51 AM, Marcin Rzewucki < > mrzewu...@gmail.com > > >> >wrote: > > >> > > >> > Hi Oleg, > > >> > It's a multivalued field and it won't be easier to query when I > split > > >> this > > >> > field into text and numbers. I may get wrong results. > > >> > > > >> > Regards. > > >> > > > >> > > > >> > On 16 July 2013 09:35, Oleg Burlaca <oburl...@gmail.com> wrote: > > >> > > > >> > > IMHO the number(s) should be extracted and stored in separate > > columns > > >> in > > >> > > SOLR at indexing time. > > >> > > > > >> > > -- > > >> > > Oleg > > >> > > > > >> > > > > >> > > On Tue, Jul 16, 2013 at 10:12 AM, Marcin Rzewucki < > > >> mrzewu...@gmail.com > > >> > > >wrote: > > >> > > > > >> > > > Hi, > > >> > > > > > >> > > > I have a problem (wonder if it is possible to solve it at all) > > with > > >> the > > >> > > > following query. There are documents with a field which > contains a > > >> text > > >> > > and > > >> > > > a number in brackets, eg. > > >> > > > > > >> > > > myfield: this is a text (number) > > >> > > > > > >> > > > There might be some other documents with the same text but > > different > > >> > > number > > >> > > > in brackets. > > >> > > > I'd like to find documents with the given text say "this is a > > text" > > >> and > > >> > > > "number" between A and B. Is it possible in Solr ? Any ideas ? > > >> > > > > > >> > > > Kind regards. > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > > >