Hi Ard, I've returned to this problem after some time away from it...
If you recall; I have 2 users, Sophie-Anne and Sophie. //*...@sling:resourceType='sakai/user-home' and (jcr:contains(public/*/*/*/*/*,'*Sophie*') or jcr:contains(public/*/*/*/*,'*Sophie*') or jcr:contains(public/*/*/*,'*Sophie*') or jcr:contains(public/*/*,'*Sophie*') or jcr:contains(public/*,'*Sophie*') or jcr:contains(pages/*/*/*/*/*,'*Sophie*') or jcr:contains(pages/*/*/*/*,'*Sophie*') or jcr:contains(pages/*/*/*,'*Sophie*') or jcr:contains(pages/*/*,'*Sophie*') or jcr:contains(pages/*,'*Sophie*'))] order by @jcr:score descending Returns both users. //*...@sling:resourceType='sakai/user-home' and (jcr:contains(public/*/*/*/*/*,'*Sophie-Anne*') or jcr:contains(public/*/*/*/*,'*Sophie-Anne*') or jcr:contains(public/*/*/*,'*Sophie-Anne*') or jcr:contains(public/*/*,'*Sophie-Anne*') or jcr:contains(public/*,'*Sophie-Anne*') or jcr:contains(pages/*/*/*/*/*,'*Sophie-Anne*') or jcr:contains(pages/*/*/*/*,'*Sophie-Anne*') or jcr:contains(pages/*/*/*,'*Sophie-Anne*') or jcr:contains(pages/*/*,'*Sophie-Anne*') or jcr:contains(pages/*,'*Sophie-Anne*'))] order by @jcr:score descending Returns neither user. Are you able to tell me how I can see the actual query being passed to Lucene? I need to see how the query is being interpreted and executed on lucene. The Analyzer method was of no use to me, btw. Regards, Chris Dunstall | Service Support - Applications Technology Integration/OLE Virtual Team Division of Information Technology | Charles Sturt University | Bathurst, NSW Ph: 02 63384818 | Fax: 02 63384181 -----Original Message----- From: Ard Schrijvers [mailto:[email protected]] Sent: Friday, 3 September 2010 5:45 PM To: [email protected]; [email protected] Subject: Re: Problems with hyphen in JSR-170 XPath query using jcr:contains Hello Wilson, On Thu, Sep 2, 2010 at 6:11 PM, H. Wilson <[email protected]> wrote: > Some successful queries I ran in my unit tests (out of the 1200+ test > queries I have ...) (all of these were tried once as shown and once as > "string".toLowerCase() ) > > .North.South.East.West* > .North.South.East.West-* > .North.South.East.West-Land > *West-Land > .North* > > > Unsuccessful include: > > .North.South.East.West-Lan? > .North.South.East.West Land I didn't look at code, but I think the analyzer part is just fine. I suspect the jackrabbit queryparser to mangle dashes and spaces. I am how ever not sure how you could avoid this. I'd have to look into it. Though, you might want to check the JackrabbitQueryParser what it makes of your ' .North.South.East.West-Lan?' or '.North.South.East.West Land' Regards Ard > > > Good Luck! > > *H. Wilson* > > > On 09/02/2010 12:28 AM, Dunstall, Christopher wrote: >> >> Just to be clear, the Lowercase Filter makes it even worse, as searching >> for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the >> filter, you actually got the record. >> >> Chris Dunstall | Service Support - Applications >> Technology Integration/OLE Virtual Team >> Division of Information Technology | Charles Sturt University | Bathurst, >> NSW >> >> Ph: 02 63384818 | Fax: 02 63384181 >> >> >> -----Original Message----- >> From: Dunstall, Christopher [mailto:[email protected]] >> Sent: Thursday, 2 September 2010 2:19 PM >> To: [email protected] >> Subject: RE: Problems with hyphen in JSR-170 XPath query using >> jcr:contains >> >> I've got the customised Analyzer and Tokenizer working, but it seems I'm >> back at square one, maybe even further back because now it looks like it's >> being case sensitive. >> >> My Analyzer: >> >> public class HyphenKeywordAnalyzer extends KeywordAnalyzer { >> private static final Logger LOGGER = >> LoggerFactory.getLogger(HyphenKeywordAnalyzer.class); >> >> public TokenStream tokenStream(String field, final Reader reader) { >> LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ? >> reader.toString() : "") + "]"); >> >> TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader); >> return keywordTokenStream; >> //return (new LowerCaseFilter(keywordTokenStream)); >> } >> } >> >> My HyphenKeywordTokenizer class is practically a direct copy of >> KeywordTokenizer, where it emits the entire input as a single token. As you >> can see above, I'm not using the lower case filter, just to see what >> happens. >> >> Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named >> 'Bob' 'Arlington-Smythe'. >> >> A search for 'Sophie-Anne' produces the user's record, however, a search >> for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now, >> even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query >> now?> From what H. Wilson has found, it doesn't look like it will solve the >> problem. >> >> The query being used is: >> //*...@sling:resourceType="sakai/user-profile" and (jcr:contains(., >> 'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score >> descending] >> >> >> Chris Dunstall | Service Support - Applications >> Technology Integration/OLE Virtual Team >> Division of Information Technology | Charles Sturt University | Bathurst, >> NSW >> >> Ph: 02 63384818 | Fax: 02 63384181 >> >> >> -----Original Message----- >> From: H. Wilson [mailto:[email protected]] >> Sent: Wednesday, 1 September 2010 6:47 AM >> To: [email protected] >> Subject: Re: Problems with hyphen in JSR-170 XPath query using >> jcr:contains >> >> >> On 08/31/2010 03:05 AM, Ard Schrijvers wrote: >>>> >>>> Given the following parameters in the repository: >>>> >>>> .North.South.East.WestLand >>>> .North.South.East.West_Land >>>> .North.South.East.West Land //yes that's a space >>>> >>>> The following exact name, case sensitive queries worked as expected for >>>> each >>>> of the three parameters: >>>> >>>> filter.orJCRExpression ("jcr:like(@" + srchField >>>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')"); //case >>>> sens. >>> >>> jcr:like does not depend on any analyser but on the stored field, so >>> this is not strange that it still works. >> >> I expected this too, I just try to be as thorough as possible when >> posting anywhere. I am disappointed enough I haven't figured this out on >> my own. >>>> >>>> The following exact name query, case insensitive, worked for only the >>>> parameter with a fullName with a whitespace character: >>>> >>>> filter.addJCRExpression ("fn:lower-case(@"+srchField+") = >>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'"); >>>> >>>> The following exact name queries, case insensitive, stopped working for >>>> the >>>> fullnames WITHOUT a whitespace character: >>>> >>>> filter.addContains ( srchField, >>>> Text.escapeIllegalXpathSearchChars(searchTerm)); >>>> >>>> Again, the only change I made was to the analyzer, I didn't remove my >>>> "workaround" yet, and I just want to confirm I properly changed the >>>> analyzer >>>> to figure out how the tokens were working. Oh I should note, the output >>>> from >>>> the Analyzer only showed one Token per field, which I believe is what we >>>> were looking for. Which leaves me as perplexed as before. >>>> >>>> LowerCaseKeywordAnalyzer.java: >>>> >>>> ... >>>> >>>> public TokenStream tokenStream ( String field, final Reader reader ) >>>> { >>>> System.out.println ("TOKEN STREAM for field: " + field); >>>> TokenStream keywordTokenStream = super.tokenStream (field, >>>> reader); >>>> >>>> //changed for testing >>>> TokenStream lowerCaseStream = new LowerCaseFilter ( >>>> keywordTokenStream ) ; >>>> final Token reusableToken = new Token(); >>>> try { >>>> Token mytoken = lowerCaseStream.next (reusableToken); >>>> while ( mytoken != null ) { >>>> System.out.println ("[" + mytoken.term() + "]"); >>>> mytoken = lowerCaseStream.next (mytoken); >>>> } >>>> //lowerCaseStream.reset(); //uncommenting this did not >>>> change results. >>>> } >>>> catch (IOException ioe) { >>>> System.err.println ("ERROR: " + ioe.toString()); >>>> } >>>> >>> It's a stream!! So, your keywordTokenStream is now empty. Call reset() >>> on the keywordTokenStream before using it again. >>> >>> Regards Ard >>> >>>> return (new LowerCaseFilter ( keywordTokenStream ) ); >>>> } >>>> >>>> ... >> >> I was real excited when I saw your email this morning. However, >> resetting keywordTokenStream as the last line in the "try" resulted in >> no change. I also tried uncommenting the lowerCaseStream.reset line in >> an act of desperation with no difference. I must be missing something >> completely obvious at this point... look at a problem too long and the >> obvious fails to jump out at you... >> >> H. Wilson >>>> >>>> Thanks. >>>> >>>> H. Wilson >>>> >>>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote: >>>>> >>>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<[email protected]> >>>>> wrote: >>>>>> >>>>>> Ard, >>>>>> >>>>>> You are absolutely right.. and this didn't make sense to me either. I >>>>>> think >>>>>> I was too worn out from my week and too excited to have code that >>>>>> "worked" >>>>>> to notice the obvious... this must be a workaround. However, I will >>>>>> need >>>>>> a >>>>>> little guidance on how to inspect the tokens. I have Luke, but never >>>>>> really >>>>>> understood how to use it properly. Could you give me a clear list of >>>>>> steps, >>>>>> or point me to a resource I missed, on how I would go about inspecting >>>>>> tokens during insert/search? Thanks. >>>>> >>>>> I'd just print them to your console with Token#term() or use a >>>>> debugger . If you do that during indexing and searching, I think you >>>>> must see some difference in the token that explains *why* Lucene >>>>> doesn't find a hit for your usecase with spaces. >>>>> >>>>> Luke is hard to use for the multi-index jackrabbit indexing, as well >>>>> as the field value prefixing: It is unfortunate and not completely >>>>> necessary any more but has some historical reasons from Lucene back in >>>>> the days when it could not handle very many unique fieldnames >>>>> >>>>> Regards Ard >>>>> >>>>>> H. Wilson >>>>>> >>>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote: >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>> OK, well I got the spaces part figured out, and will post it for >>>>>>>> anyone >>>>>>>> who >>>>>>>> needs it. Putting quotes around the spaces unfortunately did not >>>>>>>> work. >>>>>>>> During testing, I determined that if you performed the following >>>>>>>> query >>>>>>>> for >>>>>>>> the exact fullName property: >>>>>>>> >>>>>>>> filter.addContains ( @fullName, >>>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West >>>>>>>> Land")); >>>>>>>> >>>>>>>> It would return nothing. But tweak it a little and add a wildcard, >>>>>>>> and >>>>>>>> it >>>>>>>> would return results: >>>>>>>> >>>>>>>> filter.addContains ( @fullName, >>>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West >>>>>>>> Lan*")); >>>>>>> >>>>>>> This does not make sense...see below >>>>>>> >>>>>>>> But since I did not want to throw in wild cards where they might not >>>>>>>> be >>>>>>>> wanted, if a search string contained spaces, did not contain wild >>>>>>>> cards >>>>>>>> and >>>>>>>> the user was not concerned with case sensitivity, I used the >>>>>>>> fn:lower-case. >>>>>>>> So I ended up with the following excerpt (our clients wanted options >>>>>>>> for >>>>>>>> case sensitive and case insensitive searching) . >>>>>>>> >>>>>>>> public OurParameter[] getOurParameters (boolean >>>>>>>> performCaseSensitiveSearch, >>>>>>>> String searchTerm, String srchField ) { //srchField in this case was >>>>>>>> fullName >>>>>>>> >>>>>>>> ..... >>>>>>>> >>>>>>>> if ( performCaseSensitiveSearch) { >>>>>>>> >>>>>>>> //jcr:like for case sensitive >>>>>>>> filter.orJCRExpression ("jcr:like(@" + srchField +", >>>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')"); >>>>>>>> >>>>>>>> } >>>>>>>> else { >>>>>>>> >>>>>>>> //only use fn:lower-case if there is spaces, with NO wild >>>>>>>> cards >>>>>>>> >>>>>>>> if ( searchTerm.contains (" ")&& !searchTerm.contains >>>>>>>> ("*")&& >>>>>>>> !searchTerm.contains ("?") ) { >>>>>>>> >>>>>>>> filter.addJCRExpression ("fn:lower-case(@"+srchField+") = >>>>>>>> >>>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'"); >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> else { >>>>>>>> >>>>>>>> //jcr:contains for case insensitive >>>>>>>> filter.addContains ( srchField, >>>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm)); >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> } >>>>>>> >>>>>>> This seems to me a workaround around the real problem, because, it >>>>>>> just doesn't make sense to me. Can you inspect the tokens that are >>>>>>> created by your analyser. Make sure you inspect the tokens during >>>>>>> indexing (just store something) and during searching: just search in >>>>>>> the property. I am quite sure you'll see the issue then. Perhaps >>>>>>> something with Text.escapeIllegalXpathSearchChars though it seems >>>>>>> that >>>>>>> it should leave spaces untouched >>>>>>> >>>>>>> Regards Ard >>>>>>> >>>>>>> >>>>>>>> .... >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> Hope that helps anyone who needs it. >>>>>>>> >>>>>>>> H. Wilson >>>>>>>> >>>>>>>>>> OK so it looks like I have one other issue. Using the >>>>>>>>>> configuration >>>>>>>>>> as >>>>>>>>>> posted below and sticking to my previous examples, with the >>>>>>>>>> addition >>>>>>>>>> of >>>>>>>>>> one >>>>>>>>>> with whitespace. With the following three in our repository: >>>>>>>>>> >>>>>>>>>> .North.South.East.WestLand >>>>>>>>>> .North.South.East.West_Land >>>>>>>>>> .North.South.East.West Land //yes that's a space >>>>>>>>>> >>>>>>>>>> ...using a jcr:contains, with exact name search with NO wild >>>>>>>>>> cards: >>>>>>>>>> the >>>>>>>>>> first two return properly, but the last one yields no result. >>>>>>>>>> >>>>>>>>>> filter.addContains(@fullName, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West >>>>>>>>>> Land") +"')); >>>>>>>>> >>>>>>>>> I think the space in a contains is seen as an AND by the >>>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am >>>>>>>>> not >>>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works >>>>>>>>> though >>>>>>>>> >>>>>>>>> Regards Ard >>>>>>>>> >>>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be >>>>>>>>>> creating >>>>>>>>>> one token, plus combined with escaping the Illegal Characters >>>>>>>>>> (i.e. >>>>>>>>>> spaces), >>>>>>>>>> shouldn't this search work? Thanks again. >>>>>>>>>> >>>>>>>>>> H. Wilson >
