Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

H. Wilson Fri, 03 Sep 2010 08:52:05 -0700

Now this is interesting. Since today is a busy day for me, I "cheated"and quickly copied the JackrabbitQueryParser method "parse" into one ofmy unit tests so I could compare the differences in the Strings at thebeginning of the method and at the end. Given the following query Strings:


   ".North.South.East.West*"
   ".North.South.East.West-*"
   ".North.South.East.West-Lan?"
   ".North.South.East.West-Land"
   ".North.South.East.West Land"  //space
   ".North.South.East.West_Land"

None showed any change at the end of the method. When I tested themethod Text.escapeIllegalXpathSearchChars with the same query strings,all also returned the same except the one with the trailing QuestionMark - It was escaped :


   ".North.South.East.West-Lan\?"

So I looked up the Text class and found this comment in the javadoc:

   "Escapes illegal XPath search characters at the end of a string.
   Example:
   A search string like 'test?' will run into a ParseException
   documented in http://issues.apache.org/jira/browse/JCR-1248";

Following the link through did not really help. It makes it sound likethis is considered resolved as of 1.5.0. ( I am on 2.0.0).


*H. Wilson*
R&D Software Systems, Inc.


On 09/03/2010 03:45 AM, Ard Schrijvers wrote:

Hello Wilson,

On Thu, Sep 2, 2010 at 6:11 PM, H. Wilson<[email protected]>  wrote:

Some successful queries I ran in my unit tests (out of the 1200+ test
queries I have ...) (all of these were tried once as shown and once as
"string".toLowerCase() )

   .North.South.East.West*
   .North.South.East.West-*
   .North.South.East.West-Land
   *West-Land
   .North*


Unsuccessful include:

   .North.South.East.West-Lan?
   .North.South.East.West Land

I didn't look at code, but I think the analyzer part is just fine. I
suspect the jackrabbit queryparser to mangle dashes and spaces. I am
how ever not sure how you could avoid this. I'd have to look into it.
Though, you might want to check the JackrabbitQueryParser what it
makes of your ' .North.South.East.West-Lan?' or
'.North.South.East.West Land'

Regards Ard


Good Luck!

*H. Wilson*


On 09/02/2010 12:28 AM, Dunstall, Christopher wrote:

Just to be clear, the Lowercase Filter makes it even worse, as searching
for 'Arlington-Smythe' or 'Sophie-Anne' returns nothing, whereas without the
filter, you actually got the record.

Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst,
NSW

Ph: 02 63384818 | Fax: 02 63384181


-----Original Message-----
From: Dunstall, Christopher [mailto:[email protected]]
Sent: Thursday, 2 September 2010 2:19 PM
To: [email protected]
Subject: RE: Problems with hyphen in JSR-170 XPath query using
jcr:contains

I've got the customised Analyzer and Tokenizer working, but it seems I'm
back at square one, maybe even further back because now it looks like it's
being case sensitive.

My Analyzer:

public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
   private static final Logger LOGGER =
LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);

   public TokenStream tokenStream(String field, final Reader reader) {
     LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ?
reader.toString() : "") + "]");

     TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
     return keywordTokenStream;
     //return (new LowerCaseFilter(keywordTokenStream));
   }
}

My HyphenKeywordTokenizer class is practically a direct copy of
KeywordTokenizer, where it emits the entire input as a single token.  As you
can see above, I'm not using the lower case filter, just to see what
happens.

Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named
'Bob' 'Arlington-Smythe'.

A search for 'Sophie-Anne' produces the user's record, however, a search
for 'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now,
even 'Sophie' or 'Sophie*'. Should I be using double quotes in the query
now?>   From what H. Wilson has found, it doesn't look like it will solve the
problem.

The query being used is:
//*...@sling:resourceType="sakai/user-profile" and (jcr:contains(.,
'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score
descending]


Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst,
NSW

Ph: 02 63384818 | Fax: 02 63384181


-----Original Message-----
From: H. Wilson [mailto:[email protected]]
Sent: Wednesday, 1 September 2010 6:47 AM
To: [email protected]
Subject: Re: Problems with hyphen in JSR-170 XPath query using
jcr:contains


On 08/31/2010 03:05 AM, Ard Schrijvers wrote:

Given the following parameters in the repository:

    .North.South.East.WestLand
    .North.South.East.West_Land
    .North.South.East.West Land    //yes that's a space

The following exact name, case sensitive queries worked as expected for
each
of the three parameters:

    filter.orJCRExpression ("jcr:like(@" + srchField
+",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case
sens.

jcr:like does not depend on any analyser but on the stored field, so
this is not strange that it still works.

I expected this too, I just try to be as thorough as possible when
posting anywhere. I am disappointed enough I haven't figured this out on
my own.

The following exact name query, case insensitive, worked for only the
parameter with a fullName with a whitespace character:

    filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
'"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");

The following exact name queries, case insensitive, stopped working for
the
fullnames WITHOUT a whitespace character:

    filter.addContains ( srchField,
Text.escapeIllegalXpathSearchChars(searchTerm));

Again, the only change I made was to the analyzer, I didn't remove my
"workaround" yet, and I just want to confirm I properly changed the
analyzer
to figure out how the tokens were working. Oh I should note, the output
from
the Analyzer only showed one Token per field, which I believe is what we
were looking for. Which leaves me as perplexed as before.

LowerCaseKeywordAnalyzer.java:

    ...

    public TokenStream tokenStream ( String field, final Reader reader  )
{
             System.out.println ("TOKEN STREAM for field: " + field);
             TokenStream keywordTokenStream = super.tokenStream (field,
reader);

         //changed for testing
             TokenStream lowerCaseStream =  new LowerCaseFilter (
keywordTokenStream ) ;
             final Token reusableToken = new Token();
             try {
                 Token mytoken = lowerCaseStream.next (reusableToken);
                 while ( mytoken != null  ) {
                     System.out.println ("[" + mytoken.term() + "]");
                     mytoken = lowerCaseStream.next (mytoken);
                 }
                 //lowerCaseStream.reset();  //uncommenting this did not
change results.
             }
             catch  (IOException ioe) {
                 System.err.println ("ERROR: " + ioe.toString());
             }

It's a stream!! So, your keywordTokenStream is now empty. Call reset()
on the keywordTokenStream before using it again.

Regards Ard

             return (new LowerCaseFilter ( keywordTokenStream ) );
         }

    ...

I was real excited when I saw your email this morning. However,
resetting keywordTokenStream as the last line in the "try" resulted in
no change. I also tried uncommenting the lowerCaseStream.reset line in
an act of desperation with no difference. I must be missing something
completely obvious at this point... look at a problem too long and the
obvious fails to jump out at you...

H. Wilson

Thanks.

H. Wilson

On 08/30/2010 09:38 AM, Ard Schrijvers wrote:

On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<[email protected]>
wrote:

   Ard,

You are absolutely right.. and this didn't make sense to me either. I
think
I was too worn out from my week and too excited to have code that
"worked"
to notice the obvious... this must be a workaround. However, I will
need
a
little guidance on how to inspect the tokens. I have Luke, but never
really
understood how to use it properly. Could you give me a clear list of
steps,
or point me to a resource I missed, on how I would go about inspecting
tokens during insert/search? Thanks.

I'd just print them to your console with Token#term() or use a
debugger . If you do that during indexing and searching, I think you
must see some difference in the token that explains *why* Lucene
doesn't find a hit for your usecase with spaces.

Luke is hard to use for the multi-index jackrabbit indexing, as well
as the field value prefixing: It is unfortunate and not completely
necessary any more but has some historical reasons from Lucene back in
the days when it could not handle very many unique fieldnames

Regards Ard

H. Wilson

On 08/30/2010 03:30 AM, Ard Schrijvers wrote:

Hello,

On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<[email protected]>
   wrote:

   OK, well I got the spaces part figured out, and will post it for
anyone
who
needs it. Putting quotes around the spaces unfortunately did not
work.
   During testing, I determined that if you performed the following
query
for
the exact fullName property:

     filter.addContains ( @fullName,
'"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
Land"));

It would return nothing. But tweak it a little and add a wildcard,
and
it
would return results:

    filter.addContains ( @fullName,
    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
Lan*"));

This does not make sense...see below

But since I did not want to throw in wild cards where they might not
be
wanted, if a search string contained spaces, did not contain wild
cards
and
the user was not concerned with case sensitivity, I used the
fn:lower-case.
So I ended up with the following excerpt (our clients wanted options
for
case sensitive and case insensitive searching) .

public OurParameter[] getOurParameters (boolean
performCaseSensitiveSearch,
String searchTerm, String srchField ) { //srchField in this case was
fullName

    .....

    if ( performCaseSensitiveSearch) {

        //jcr:like for case sensitive
        filter.orJCRExpression ("jcr:like(@" + srchField +",
'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");

    }
    else {

        //only use fn:lower-case if there is spaces, with NO wild
cards

        if ( searchTerm.contains (" ")&&           !searchTerm.contains
("*")&&
   !searchTerm.contains ("?") ) {

            filter.addJCRExpression ("fn:lower-case(@"+srchField+") =

'"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");

        }

        else {

            //jcr:contains for case insensitive
            filter.addContains ( srchField,
Text.escapeIllegalXpathSearchChars(searchTerm));

        }

    }

This seems to me a workaround around the real problem, because, it
just doesn't make sense to me. Can you inspect the tokens that are
created by your analyser. Make sure you inspect the tokens during
indexing (just store something) and during searching: just search in
the property. I am quite sure you'll see the issue then. Perhaps
something with Text.escapeIllegalXpathSearchChars though it seems
that
it should leave spaces untouched

Regards Ard

    ....

}


Hope that helps anyone who needs it.

H. Wilson

OK so it looks like I have one other issue. Using the
configuration
as
posted below and sticking to my previous examples, with the
addition
of
one
with whitespace. With the following three in our repository:

    .North.South.East.WestLand
    .North.South.East.West_Land
    .North.South.East.West Land    //yes that's a space

...using a jcr:contains, with exact name search with NO wild
cards:
the
first two return properly, but the last one yields no result.

    filter.addContains(@fullName,




'"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
Land") +"'));

I think the space in a contains is seen as an AND by the
Jackrabbit/Lucene QueryParser. I should test this however as I am
not
sure. Perhaps you can put quotes around it, not sure if that works
though

Regards Ard

According to the Lucene documentation, KeywordAnalyzer should be
creating
one token, plus combined with escaping the Illegal Characters
(i.e.
spaces),
shouldn't this search work? Thanks again.

H. Wilson

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Reply via email to