I'd pull Nutch out of the mix here as a test. Create
some test docs (use the exampleDocs directory?) and
go from there at least long enough to insure that Solr
does what you expect if the data gets there properly.

You can set this up in about 10 minutes, and test it
in about 15 more. May save you endless hours.

Because you're conflating two issues here:
1> whether Nutch is sending the data
2> whether Solr is indexing and searching as you expect.

Some of the Solr/Lucene analysis chains do transformations
that may not be what you assume, particularly things
like StandardTokenizer and WordDelimiterFilterFactory.

So I'd take the time to see that the values you're dealing
with are behaving as you expect. The admin/analysis page
will help you a _lot_ here.

Best,
Erick




On Mon, Aug 18, 2014 at 7:16 AM, Paul Rogers <paul.roge...@gmail.com> wrote:
> Hi Guys
>
> I've been checking into this further and have deleted the index a couple of
> times and rebuilt it with the suggestions you've supplied.
>
> I had a bit of an epiphany last week and decided to check if the document I
> was searching for was actually in the index (did this by doing a *.* query
> to a file and grep'ing for the 'IAE_UPC_0001@ string).  It seems it isn't!!
> Not sure if it was in the original index or not, tho' I suspect not.
>
> As far as I can see anything with the reference in the form IAE_UPC_####
> has not been indexed while those with the reference in the form
> IAE-UPC-#### has.  Not sure if that's a coincidence or not.
>
> Need to see if I can get the docs into the index and then check if the
> search works or not.  Will see if the guys on the Nutch list can shed any
> light.
>
> All the best.
>
> P
>
>
> On 4 August 2014 17:09, Jack Krupansky <j...@basetechnology.com> wrote:
>
>> The standard tokenizer treats underscore as a valid token character, not a
>> delimiter.
>>
>> The word delimiter filter will treat underscore as a delimiter though.
>>
>> Make sure your query-time WDF does not have preserveOriginal="1" - but the
>> index-time WDF should have preserveOriginal="1". Otherwise, the query
>> phrase will generate an extra token which will participate in the matching
>> and might cause a mismatch.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Paul Rogers
>> Sent: Monday, August 4, 2014 5:55 PM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to search for phrase "IAE_UPC_0001"
>>
>> Hi Guys
>>
>> Thanks for the replies.  I've had a look at the WordDelimiterFilterFactory
>> and the Term Info for the url field.  It seems that all the terms exist and
>> I now understand that each url is being broken up using the delimiters
>> specified.  But I think I'm still missing something.
>>
>> Am I correct in assuming the minus sign (-) is also a delimiter?
>>
>> If so why then does  url:"IAE-UPC-0001" return a result (when the url
>> contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
>> (when the url contains the substring IAE_UPC_0001)?
>>
>> Secondly if the url has indeed been broken into the terms IAE UPC and 0001
>> why do all the searches suggested or tried succeed when the delimiter is a
>> minus sign (-) but not when the delimiter is an underscore (_), returning
>> zero matches?
>>
>> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
>> looking for is the three terms?
>>
>> Many thanks for any enlightenment.
>>
>> P
>>
>>
>>
>>
>> On 4 August 2014 01:33, Harald Kirsch <harald.kir...@raytion.com> wrote:
>>
>>  This all depends on how the tokenizers take your URLs apart. To quickly
>>> see what ended up in the index, go to a core in the UI, select Schema
>>> Browser, select the field containing your URLs, click on "Load Term Info".
>>>
>>> In your case, for the field holding the URL you could try to switch to a
>>> tokenizer that defines tokens as a sequence of alphanumeric characters,
>>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and
>>> separation
>>> characters like dash, underscore, slash, dot and the like would never be
>>> part of a token, i.e. they don't make a difference.
>>>
>>> Then you can search the url parts with a phrase query (
>>> https://cwiki.apache.org/confluence/display/solr/The+
>>> Standard+Query+Parser#TheStandardQueryParser-
>>> SpecifyingTermsfortheStandardQueryParserwhich) like
>>>
>>>  url:"IAE-UPC-0001"
>>>
>>> In the same way as during indexing, the dashes are removed to end up with
>>> three tokens, namely IAE, UPC and 0001. Further they have to be in that
>>> order. Naturally this will then match anything like:
>>>
>>>   "IAE_UPC_0001"
>>>   "IAE UPC 0001"
>>>   "IAE/UPC+0001"
>>>   "IAE\UPC\0001"
>>>   "IAE.UPC,0001"
>>>
>>> Depending on how your URLs are structured, there is the chance for false
>>> positives, of course.
>>>
>>> The Really Good Thing here is, that you don't need to use wildcards.
>>>
>>> I have not yet looked at the wildcard-queries implementation in
>>> Solr/Lucene, but with the  commercial search engines I know, they are a
>>> great way to loose the confidence of your users, because they just don't
>>> work as expected by anyone not knowing the implementation. Either they
>>> deliver only partial results or they kill the performance or they even go
>>> OOM. If Solr committers have not done something really ingenious,
>>> Solr/Lucene does have the same problems.
>>>
>>> Harald.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 31.07.2014 18:31, Paul Rogers wrote:
>>>
>>>  Hi Guys
>>>>
>>>> I have a Solr application searching on data uploaded by Nutch.  The
>>>> search
>>>> I wish to carry out is for a particular document reference contained
>>>> within
>>>> the "url" field, e.g. IAE-UPC-0001.
>>>>
>>>> The problem is is that the file names that comprise the url's are not
>>>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>>>> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
>>>> but
>>>> not both.
>>>>
>>>> I have created the query (in the solr admin interface):
>>>>
>>>> url:"IAE-UPC-0001"
>>>>
>>>> which works (returning the single expected document), as do:
>>>>
>>>> url:"IAE*UPC*0001"
>>>> url:"IAE?UPC?0001"
>>>>
>>>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign
>>>> as
>>>> a delimiter).
>>>>
>>>> However:
>>>>
>>>> url:"IAE_UPC_0001"
>>>> url:"IAE*UPC*0001"
>>>> url:"IAE?UPC?0001"
>>>>
>>>> do not work (returning zero documents) when the doc ref is in the format
>>>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>>>>
>>>> I'm assuming the underscore is a special character but have tried looking
>>>> at the solr wiki but can't find anything to say what the problem is. Also
>>>> the minus sign also has a specific meaning but is nullified by adding the
>>>> quotes.
>>>>
>>>> Can anyone suggest what I'm doing wrong?
>>>>
>>>> Many thanks
>>>>
>>>> Paul
>>>>
>>>>
>>>>  --
>>> Harald Kirsch
>>> Raytion GmbH
>>> Kaiser-Friedrich-Ring 74
>>> 40547 Duesseldorf
>>> Fon +49 211 53883-216
>>> Fax +49-211-550266-19
>>> http://www.raytion.com
>>>
>>>
>>

Reply via email to