Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

H. Wilson Thu, 26 Aug 2010 13:23:16 -0700


On 08/26/2010 12:57 PM, Ard Schrijvers wrote:

Hello Wilson et al,
In that case, sry for my late help. I am not always in a position to
take time to help. Also, query expansion with wildcard searching is
imo not Lucene's best part. Anyway, for those interested, I could try
to dig up some mails I send internally in the past: It is something
that is hard to grasp without having some Lucene background though

No need to apologize. I was tempted to bump it after a month, but Iwasn't sure if that violated forum etiquette. I hope the OP today isgetting as much out of this as I am!

Yes, this is how I meant it, with the analyser part.
I meant this that you would need this *only* if you also want the
original 'free text indexing' of the property. Thus, if you would like
to index some property both as the original jackrabbit indexing, but
you also want a KeyWord like one, you need the property twice...but,
normally, you don't need this.
You're welcome.


Thank you for reporting back that it works.

Regards Ard

OK so it looks like I have one other issue. Using the configuration asposted below and sticking to my previous examples, with the addition ofone with whitespace. With the following three in our repository:


   .North.South.East.WestLand
   .North.South.East.West_Land
   .North.South.East.West Land    //yes that's a space

...using a jcr:contains, with exact name search with NO wild cards: thefirst two return properly, but the last one yields no result.


   filter.addContains(@fullName, 
'"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West 
Land") +"'));

According to the Lucene documentation, KeywordAnalyzer should becreating one token, plus combined with escaping the Illegal Characters(i.e. spaces), shouldn't this search work? Thanks again.


H. Wilson

H. Wilson

repository.xml (modified both SearchIndex tags to include an
indexingConfiguration):

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">

....
<param name="indexingConfiguration"
value="${rep.home}/indexing_configuration.xml"/>

</SearchIndex>

indexing_configuration.xml:

<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0";>
     <analyzers>
         <analyzer
class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
             <property>fullName</property>
         </analyzer>
     </analyzers>
</configuration>

LowerCaseKeywordAnalyzer.java:

package org.mycompany.lucene.analysis;
     import java.io.Reader;
     import org.apache.lucene.analysis.KeywordAnalyzer;
     import org.apache.lucene.analysis.LowerCaseFilter;
     import org.apache.lucene.analysis.TokenStream;

public class LowerCaseKeywordAnalyzer extends KeywordAnalyzer {

     public TokenStream tokenStream ( String field, final Reader reader  ) {
         TokenStream keywordTokenStream = super.tokenStream (field, reader);
         return ( new LowerCaseFilter ( keywordTokenStream ) );
     }
}

Our search class has a method which then does the following:

public OurParameter[] getOurParameters (String searchTerm, String srchField
) { //srchField in this case was fullName

TransientRepository repository = new TransientRepository ( OUR_REPO_CONFIG,
OUR_REPO_LOCATION);
Session session = repository.login ();
List<Class>  classes = new ArrayList<Class>();
classes.add (OurParameter.class);
Mapper mapper = new AnnotationMapperImpl (classes);
ObjectContentManager ocm = new ObjectContentManagerImpl (session, mapper);
queryManager = ocm.getQueryManager();
FilterImpl filter = (FilterImpl)queryManager.createFilter
(OurParameter.class);
filter.addContains ( srchField,
org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(searchTerm).replaceAll
("'","''"));
// (that last was replace all single ticks with two ticks, I honestly can't
remember why though)
Query query = queryManager.createQuery (filter);
Collection<OurParameter>  resultsCollection =
(Collection<OurParameter>)ocm.getObjects(query);

//convert to an array, do some other stuff, and return...

}


On 08/26/2010 10:42 AM, Ard Schrijvers wrote:

On Thu, Aug 26, 2010 at 3:53 PM, H. Wilson<[email protected]>  wrote:

  Ard,

I have this same problem, however my scenario involves underscores rather
than hyphens. Although since Chris seems to be seeing the same exact

It is because hyphens just as underscores are tokens the Standard
Lucene Analyzer splits on. This combined with query expansion that
happens for wildcard searches in lucene causes your issuess:

behavior as I was, I imagine we are both stuck on the same issue. After
scouring the forums for the solution, and not seeing your mentioned
solution, I actually posted my problem as detailed as possible here (
http://markmail.org/message/yh72wqd5b2hbr3j6 ) and received no response.
jcr:like was not an option for me, in this case, as our client wanted the
option for case-insensitive searches. Is there any chance you could please
narrow down where-about the post was which already covered this? Thanks for

I can't seem to find my post again. But, I'll give you a quite simple
solution:

If you want to have the normal indexing of the property for normal
searching, but also want to have the yyy* option, you need to
duplicate the property also in another property. If your property,
like

.North.South.East.WestLand

is only needed for the one you describe with wildcard searching, you
only need it once. Now, suppose, your property is called myProp.

To your configuration.xml add:

<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0";>
   <analyzers>
         <analyzer
class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
             <property>myProp</property>
         </analyzer>
   </analyzers>
</configuration>

Your LowerCaseKeywordAnalyzer is very simple: it extends
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html
and in the method

  TokenStream tokenStream(String fieldName,Reader reader)

after calling the super, you invoke Lucene's LowerCaseFilter.

That is all (after you do a re-index of your repository). Since now a
-, or _ or ~ or whatever is not seen as a token to split on, but you
still use lowercase filter, you can do exactly what you want.

Do the words need the be split on spaces however? No problem, just add
a WhiteSpaceTokenizer from lucene. It is actually pretty simple,

Hope this helps,

Regards Ard

your time.

*H. Wilson*


On 08/26/2010 04:59 AM, Ard Schrijvers wrote:

Hello,

You can search the archives (mail from me) for wildcard searching
things related below. There was someone having similar issues. I
explained the wildcard difficulties. Take a look at jcr:like for your
usecases

Regards Ard

On Thu, Aug 26, 2010 at 10:19 AM, Dunstall, Christopher
<[email protected]>    wrote:

Hi all,

I'm having some trouble with an XPath query, where I'm searching for
users with hyphens in their name.

I'm using:
jcr:contains(*/*/*,'query')

And it returns some odd results.

I have two users, Sophie-Allen and Sophie-Anne. When I search for
'sophie', I get back users back. Ok, fine, but if I search for 'sophie-a'
(with the hyphen escaped as 'sophie\-a' as per the JSR-170 Spec) I get zero
results returned.  Oddly, if I search for either 'sophie-allen' or
'sophie-anne' I get the respective user details back fine. Shouldn't I get
both users back when escaping the hyphen? Have I missed something in the
spec?

One other odd thing is the addition of an asterisk (*).  Searching for
'soph' and 'soph*' return the same result (both users), but if I search for
'sophie-allen*', I get zero results, unlike when searching for just
'sophie-allen'. Searching for 'sophie-a*' has the same result as without the
asterisk, i.e. nothing.

The JSR-170 spec doesn't say anything (that I can find) but is the
asterisk a wildcard in the jcr:contains function or does it serve some other
purpose?

Your assistance is greatly appreciated,

Regards,

Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst,
NSW, Australia

Ph: 02 63384818 | Fax: 02 63384181

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Reply via email to