The standard tokenizer treats underscore as a valid token character, not a delimiter.

The word delimiter filter will treat underscore as a delimiter though.

Make sure your query-time WDF does not have preserveOriginal="1" - but the index-time WDF should have preserveOriginal="1". Otherwise, the query phrase will generate an extra token which will participate in the matching and might cause a mismatch.

-- Jack Krupansky

-----Original Message----- From: Paul Rogers
Sent: Monday, August 4, 2014 5:55 PM
To: solr-user@lucene.apache.org
Subject: Re: How to search for phrase "IAE_UPC_0001"

Hi Guys

Thanks for the replies.  I've had a look at the WordDelimiterFilterFactory
and the Term Info for the url field.  It seems that all the terms exist and
I now understand that each url is being broken up using the delimiters
specified.  But I think I'm still missing something.

Am I correct in assuming the minus sign (-) is also a delimiter?

If so why then does  url:"IAE-UPC-0001" return a result (when the url
contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
(when the url contains the substring IAE_UPC_0001)?

Secondly if the url has indeed been broken into the terms IAE UPC and 0001
why do all the searches suggested or tried succeed when the delimiter is a
minus sign (-) but not when the delimiter is an underscore (_), returning
zero matches?

Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
looking for is the three terms?

Many thanks for any enlightenment.

P




On 4 August 2014 01:33, Harald Kirsch <harald.kir...@raytion.com> wrote:

This all depends on how the tokenizers take your URLs apart. To quickly
see what ended up in the index, go to a core in the UI, select Schema
Browser, select the field containing your URLs, click on "Load Term Info".

In your case, for the field holding the URL you could try to switch to a
tokenizer that defines tokens as a sequence of alphanumeric characters,
roughly [a-z0-9]+ plus diacritics. In particular punctuation and separation
characters like dash, underscore, slash, dot and the like would never be
part of a token, i.e. they don't make a difference.

Then you can search the url parts with a phrase query (
https://cwiki.apache.org/confluence/display/solr/The+
Standard+Query+Parser#TheStandardQueryParser-
SpecifyingTermsfortheStandardQueryParserwhich) like

 url:"IAE-UPC-0001"

In the same way as during indexing, the dashes are removed to end up with
three tokens, namely IAE, UPC and 0001. Further they have to be in that
order. Naturally this will then match anything like:

  "IAE_UPC_0001"
  "IAE UPC 0001"
  "IAE/UPC+0001"
  "IAE\UPC\0001"
  "IAE.UPC,0001"

Depending on how your URLs are structured, there is the chance for false
positives, of course.

The Really Good Thing here is, that you don't need to use wildcards.

I have not yet looked at the wildcard-queries implementation in
Solr/Lucene, but with the  commercial search engines I know, they are a
great way to loose the confidence of your users, because they just don't
work as expected by anyone not knowing the implementation. Either they
deliver only partial results or they kill the performance or they even go
OOM. If Solr committers have not done something really ingenious,
Solr/Lucene does have the same problems.

Harald.






On 31.07.2014 18:31, Paul Rogers wrote:

Hi Guys

I have a Solr application searching on data uploaded by Nutch. The search
I wish to carry out is for a particular document reference contained
within
the "url" field, e.g. IAE-UPC-0001.

The problem is is that the file names that comprise the url's are not
consistent, so a url might contain the reference as IAE-UPC-0001 or
IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
but
not both.

I have created the query (in the solr admin interface):

url:"IAE-UPC-0001"

which works (returning the single expected document), as do:

url:"IAE*UPC*0001"
url:"IAE?UPC?0001"

when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
a delimiter).

However:

url:"IAE_UPC_0001"
url:"IAE*UPC*0001"
url:"IAE?UPC?0001"

do not work (returning zero documents) when the doc ref is in the format
IAE_UPC_0001 (ie using the underscore character as the delimiter).

I'm assuming the underscore is a special character but have tried looking
at the solr wiki but can't find anything to say what the problem is. Also
the minus sign also has a specific meaning but is nullified by adding the
quotes.

Can anyone suggest what I'm doing wrong?

Many thanks

Paul


--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com


Reply via email to