RE: Problems with hyphen in JSR-170 XPath query using jcr:contains

Dunstall, Christopher Wed, 03 Nov 2010 22:56:53 -0700

Hi Ard,

I've returned to this problem after some time away from it...


If you recall; I have 2 users, Sophie-Anne and Sophie.

//*...@sling:resourceType='sakai/user-home' and 
(jcr:contains(public/*/*/*/*/*,'*Sophie*') or 
jcr:contains(public/*/*/*/*,'*Sophie*') or 
jcr:contains(public/*/*/*,'*Sophie*') or jcr:contains(public/*/*,'*Sophie*') or 
jcr:contains(public/*,'*Sophie*') or jcr:contains(pages/*/*/*/*/*,'*Sophie*') 
or jcr:contains(pages/*/*/*/*,'*Sophie*') or 
jcr:contains(pages/*/*/*,'*Sophie*') or jcr:contains(pages/*/*,'*Sophie*') or 
jcr:contains(pages/*,'*Sophie*'))] order by @jcr:score descending

Returns both users.

//*...@sling:resourceType='sakai/user-home' and 
(jcr:contains(public/*/*/*/*/*,'*Sophie-Anne*') or 
jcr:contains(public/*/*/*/*,'*Sophie-Anne*') or 
jcr:contains(public/*/*/*,'*Sophie-Anne*') or 
jcr:contains(public/*/*,'*Sophie-Anne*') or 
jcr:contains(public/*,'*Sophie-Anne*') or 
jcr:contains(pages/*/*/*/*/*,'*Sophie-Anne*') or 
jcr:contains(pages/*/*/*/*,'*Sophie-Anne*') or 
jcr:contains(pages/*/*/*,'*Sophie-Anne*') or 
jcr:contains(pages/*/*,'*Sophie-Anne*') or 
jcr:contains(pages/*,'*Sophie-Anne*'))] order by @jcr:score descending

Returns neither user.

Are you able to tell me how I can see the actual query being passed to Lucene? 
I need to see how the query is being interpreted and executed on lucene.

The Analyzer method was of no use to me, btw.

Regards,

Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst, NSW

Ph: 02 63384818 | Fax: 02 63384181


-----Original Message-----
From: Ard Schrijvers [mailto:[email protected]]
Sent: Friday, 3 September 2010 5:45 PM
To: [email protected]; [email protected]
Subject: Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Hello Wilson,

On Thu, Sep 2, 2010 at 6:11 PM, H. Wilson <[email protected]> wrote:

> Some successful queries I ran in my unit tests (out of the 1200+ test
> queries I have ...) (all of these were tried once as shown and once as
> "string".toLowerCase() )
>
>   .North.South.East.West*
>   .North.South.East.West-*
>   .North.South.East.West-Land
>   *West-Land
>   .North*
>
>
> Unsuccessful include:
>
>   .North.South.East.West-Lan?
>   .North.South.East.West Land

I didn't look at code, but I think the analyzer part is just fine. I
suspect the jackrabbit queryparser to mangle dashes and spaces. I am
how ever not sure how you could avoid this. I'd have to look into it.
Though, you might want to check the JackrabbitQueryParser what it
makes of your ' .North.South.East.West-Lan?' or
'.North.South.East.West Land'

Regards Ard

>
>
> Good Luck!
>
> *H. Wilson*
>
>
> On 09/02/2010 12:28 AM, Dunstall, Christopher wrote:
>>
>> Just to be clear, the Lowercase Filter makes it even worse, as searching
>> for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the
>> filter, you actually got the record.
>>
>> Chris Dunstall | Service Support - Applications
>> Technology Integration/OLE Virtual Team
>> Division of Information Technology | Charles Sturt University | Bathurst,
>> NSW
>>
>> Ph: 02 63384818 | Fax: 02 63384181
>>
>>
>> -----Original Message-----
>> From: Dunstall, Christopher [mailto:[email protected]]
>> Sent: Thursday, 2 September 2010 2:19 PM
>> To: [email protected]
>> Subject: RE: Problems with hyphen in JSR-170 XPath query using
>> jcr:contains
>>
>> I've got the customised Analyzer and Tokenizer working, but it seems I'm
>> back at square one, maybe even further back because now it looks like it's
>> being case sensitive.
>>
>> My Analyzer:
>>
>> public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
>>   private static final Logger LOGGER =
>> LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
>>
>>   public TokenStream tokenStream(String field, final Reader reader) {
>>     LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ?
>> reader.toString() : "") + "]");
>>
>>     TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
>>     return keywordTokenStream;
>>     //return (new LowerCaseFilter(keywordTokenStream));
>>   }
>> }
>>
>> My HyphenKeywordTokenizer class is practically a direct copy of
>> KeywordTokenizer, where it emits the entire input as a single token.  As you
>> can see above, I'm not using the lower case filter, just to see what
>> happens.
>>
>> Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named
>> 'Bob' 'Arlington-Smythe'.
>>
>> A search for 'Sophie-Anne' produces the user's record, however, a search
>> for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now,
>> even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query
>> now?> From what H. Wilson has found, it doesn't look like it will solve the
>> problem.
>>
>> The query being used is:
>> //*...@sling:resourceType="sakai/user-profile" and (jcr:contains(.,
>> 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score
>> descending]
>>
>>
>> Chris Dunstall | Service Support - Applications
>> Technology Integration/OLE Virtual Team
>> Division of Information Technology | Charles Sturt University | Bathurst,
>> NSW
>>
>> Ph: 02 63384818 | Fax: 02 63384181
>>
>>
>> -----Original Message-----
>> From: H. Wilson [mailto:[email protected]]
>> Sent: Wednesday, 1 September 2010 6:47 AM
>> To: [email protected]
>> Subject: Re: Problems with hyphen in JSR-170 XPath query using
>> jcr:contains
>>
>>
>> On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>>>>
>>>> Given the following parameters in the repository:
>>>>
>>>>    .North.South.East.WestLand
>>>>    .North.South.East.West_Land
>>>>    .North.South.East.West Land    //yes that's a space
>>>>
>>>> The following exact name, case sensitive queries worked as expected for
>>>> each
>>>> of the three parameters:
>>>>
>>>>    filter.orJCRExpression ("jcr:like(@" + srchField
>>>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case
>>>> sens.
>>>
>>> jcr:like does not depend on any analyser but on the stored field, so
>>> this is not strange that it still works.
>>
>> I expected this too, I just try to be as thorough as possible when
>> posting anywhere. I am disappointed enough I haven't figured this out on
>> my own.
>>>>
>>>> The following exact name query, case insensitive, worked for only the
>>>> parameter with a fullName with a whitespace character:
>>>>
>>>>    filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>
>>>> The following exact name queries, case insensitive, stopped working for
>>>> the
>>>> fullnames WITHOUT a whitespace character:
>>>>
>>>>    filter.addContains ( srchField,
>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>
>>>> Again, the only change I made was to the analyzer, I didn't remove my
>>>> "workaround" yet, and I just want to confirm I properly changed the
>>>> analyzer
>>>> to figure out how the tokens were working. Oh I should note, the output
>>>> from
>>>> the Analyzer only showed one Token per field, which I believe is what we
>>>> were looking for. Which leaves me as perplexed as before.
>>>>
>>>> LowerCaseKeywordAnalyzer.java:
>>>>
>>>>    ...
>>>>
>>>>    public TokenStream tokenStream ( String field, final Reader reader  )
>>>> {
>>>>             System.out.println ("TOKEN STREAM for field: " + field);
>>>>             TokenStream keywordTokenStream = super.tokenStream (field,
>>>> reader);
>>>>
>>>>         //changed for testing
>>>>             TokenStream lowerCaseStream =  new LowerCaseFilter (
>>>> keywordTokenStream ) ;
>>>>             final Token reusableToken = new Token();
>>>>             try {
>>>>                 Token mytoken = lowerCaseStream.next (reusableToken);
>>>>                 while ( mytoken != null  ) {
>>>>                     System.out.println ("[" + mytoken.term() + "]");
>>>>                     mytoken = lowerCaseStream.next (mytoken);
>>>>                 }
>>>>                 //lowerCaseStream.reset();  //uncommenting this did not
>>>> change results.
>>>>             }
>>>>             catch  (IOException ioe) {
>>>>                 System.err.println ("ERROR: " + ioe.toString());
>>>>             }
>>>>
>>> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
>>> on the keywordTokenStream before using it again.
>>>
>>> Regards Ard
>>>
>>>>             return (new LowerCaseFilter ( keywordTokenStream ) );
>>>>         }
>>>>
>>>>    ...
>>
>> I was real excited when I saw your email this morning. However,
>> resetting keywordTokenStream as the last line in the "try" resulted in
>> no change. I also tried uncommenting the lowerCaseStream.reset line in
>> an act of desperation with no difference. I must be missing something
>> completely obvious at this point... look at a problem too long and the
>> obvious fails to jump out at you...
>>
>> H. Wilson
>>>>
>>>> Thanks.
>>>>
>>>> H. Wilson
>>>>
>>>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>>>>
>>>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<[email protected]>
>>>>> wrote:
>>>>>>
>>>>>>   Ard,
>>>>>>
>>>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>>>> think
>>>>>> I was too worn out from my week and too excited to have code that
>>>>>> "worked"
>>>>>> to notice the obvious... this must be a workaround. However, I will
>>>>>> need
>>>>>> a
>>>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>>>> really
>>>>>> understood how to use it properly. Could you give me a clear list of
>>>>>> steps,
>>>>>> or point me to a resource I missed, on how I would go about inspecting
>>>>>> tokens during insert/search? Thanks.
>>>>>
>>>>> I'd just print them to your console with Token#term() or use a
>>>>> debugger . If you do that during indexing and searching, I think you
>>>>> must see some difference in the token that explains *why* Lucene
>>>>> doesn't find a hit for your usecase with spaces.
>>>>>
>>>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>>>> as the field value prefixing: It is unfortunate and not completely
>>>>> necessary any more but has some historical reasons from Lucene back in
>>>>> the days when it could not handle very many unique fieldnames
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>> H. Wilson
>>>>>>
>>>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<[email protected]>
>>>>>>>   wrote:
>>>>>>>>
>>>>>>>>   OK, well I got the spaces part figured out, and will post it for
>>>>>>>> anyone
>>>>>>>> who
>>>>>>>> needs it. Putting quotes around the spaces unfortunately did not
>>>>>>>> work.
>>>>>>>>   During testing, I determined that if you performed the following
>>>>>>>> query
>>>>>>>> for
>>>>>>>> the exact fullName property:
>>>>>>>>
>>>>>>>>     filter.addContains ( @fullName,
>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Land"));
>>>>>>>>
>>>>>>>> It would return nothing. But tweak it a little and add a wildcard,
>>>>>>>> and
>>>>>>>> it
>>>>>>>> would return results:
>>>>>>>>
>>>>>>>>    filter.addContains ( @fullName,
>>>>>>>>    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Lan*"));
>>>>>>>
>>>>>>> This does not make sense...see below
>>>>>>>
>>>>>>>> But since I did not want to throw in wild cards where they might not
>>>>>>>> be
>>>>>>>> wanted, if a search string contained spaces, did not contain wild
>>>>>>>> cards
>>>>>>>> and
>>>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>>>> fn:lower-case.
>>>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>>>> for
>>>>>>>> case sensitive and case insensitive searching) .
>>>>>>>>
>>>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>>>> performCaseSensitiveSearch,
>>>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>>>> fullName
>>>>>>>>
>>>>>>>>    .....
>>>>>>>>
>>>>>>>>    if ( performCaseSensitiveSearch) {
>>>>>>>>
>>>>>>>>        //jcr:like for case sensitive
>>>>>>>>        filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>>>
>>>>>>>>    }
>>>>>>>>    else {
>>>>>>>>
>>>>>>>>        //only use fn:lower-case if there is spaces, with NO wild
>>>>>>>> cards
>>>>>>>>
>>>>>>>>        if ( searchTerm.contains (" ")&&         !searchTerm.contains
>>>>>>>> ("*")&&
>>>>>>>>   !searchTerm.contains ("?") ) {
>>>>>>>>
>>>>>>>>            filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>>>>
>>>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>>>
>>>>>>>>        }
>>>>>>>>
>>>>>>>>        else {
>>>>>>>>
>>>>>>>>            //jcr:contains for case insensitive
>>>>>>>>            filter.addContains ( srchField,
>>>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>>>
>>>>>>>>        }
>>>>>>>>
>>>>>>>>    }
>>>>>>>
>>>>>>> This seems to me a workaround around the real problem, because, it
>>>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>>>> indexing (just store something) and during searching: just search in
>>>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>>>> something with Text.escapeIllegalXpathSearchChars though it seems
>>>>>>> that
>>>>>>> it should leave spaces untouched
>>>>>>>
>>>>>>> Regards Ard
>>>>>>>
>>>>>>>
>>>>>>>>    ....
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Hope that helps anyone who needs it.
>>>>>>>>
>>>>>>>> H. Wilson
>>>>>>>>
>>>>>>>>>> OK so it looks like I have one other issue. Using the
>>>>>>>>>> configuration
>>>>>>>>>> as
>>>>>>>>>> posted below and sticking to my previous examples, with the
>>>>>>>>>> addition
>>>>>>>>>> of
>>>>>>>>>> one
>>>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>>>
>>>>>>>>>>    .North.South.East.WestLand
>>>>>>>>>>    .North.South.East.West_Land
>>>>>>>>>>    .North.South.East.West Land    //yes that's a space
>>>>>>>>>>
>>>>>>>>>> ...using a jcr:contains, with exact name search with NO wild
>>>>>>>>>> cards:
>>>>>>>>>> the
>>>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>>>
>>>>>>>>>>    filter.addContains(@fullName,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>>>> Land") +"'));
>>>>>>>>>
>>>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am
>>>>>>>>> not
>>>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>>>> though
>>>>>>>>>
>>>>>>>>> Regards Ard
>>>>>>>>>
>>>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>>>> creating
>>>>>>>>>> one token, plus combined with escaping the Illegal Characters
>>>>>>>>>> (i.e.
>>>>>>>>>> spaces),
>>>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>>>
>>>>>>>>>> H. Wilson
>

RE: Problems with hyphen in JSR-170 XPath query using jcr:contains

Reply via email to