Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

H. Wilson Fri, 27 Aug 2010 07:36:02 -0700

 Chris,

I think I can answer this one, (I'm sure Ard will confirm), but backwhen I was trying to get this working, one of things I saw was on this page:


http://wiki.apache.org/jackrabbit/IndexingConfiguration

...near the bottom it talks about setting Analyzers for properties inthe indexing_configuration. I think what it is getting at is, since youneed it on all properties, you might not need the indexingConfig, andyou can just add the line:

<param name="analyzer"value="org.apache.lucene.analysis.WhitespaceAnalyzer"/>

to your SearchIndex targets in your repository.xml, modifying theAnalyzer to the one which suites you.


H. Wilson


On 08/27/2010 08:27 AM, Dunstall, Christopher wrote:

Ard,

In indexing_configuration.xml, where you named the property where the
analyzer is used (e.g. FullName), how to I set it so that it's used on all
properties of a node?  As previously said, I'm using jcr:contains because I
need to search all parts of the node, so the analyzer needs to have effect
on all properties.

Regards,

Chris


On 27/08/10 2:22 AM, "H. Wilson"<[email protected]>  wrote:

   Finally! I have been hacking away at this here and there for months,
trying all different analyzers or not-using analyzers and modifying my
queries all to no avail! Since I always like precise examples when I am
searching forums, I will post my (nearly) exact solution both for others
and so that Ard might verify that this was indeed what he meant.

Ard, I was hoping you could embellish a little on why we would duplicate
the property? (I didn't actually do it to get this working perfectly)
You lost me a little there, was it for efficiency? Thanks for everything!

H. Wilson

repository.xml (modified both SearchIndex tags to include an
indexingConfiguration):

     <SearchIndex
     class="org.apache.jackrabbit.core.query.lucene.SearchIndex">

         ....
         <param name="indexingConfiguration"
         value="${rep.home}/indexing_configuration.xml"/>

     </SearchIndex>


indexing_configuration.xml:

     <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0";>
     <analyzers>
     <analyzer
     class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
     <property>fullName</property>
     </analyzer>
     </analyzers>
     </configuration>


LowerCaseKeywordAnalyzer.java:

     package org.mycompany.lucene.analysis;
          import java.io.Reader;
          import org.apache.lucene.analysis.KeywordAnalyzer;
          import org.apache.lucene.analysis.LowerCaseFilter;
          import org.apache.lucene.analysis.TokenStream;

     public class LowerCaseKeywordAnalyzer extends KeywordAnalyzer {

          public TokenStream tokenStream ( String field, final Reader
     reader  ) {
              TokenStream keywordTokenStream = super.tokenStream (field,
     reader);
              return ( new LowerCaseFilter ( keywordTokenStream ) );
          }
     }


Our search class has a method which then does the following:

     public OurParameter[] getOurParameters (String searchTerm, String
     srchField ) { //srchField in this case was fullName

         TransientRepository repository = new TransientRepository (
         OUR_REPO_CONFIG, OUR_REPO_LOCATION);
         Session session = repository.login ();
         List<Class>  classes = new ArrayList<Class>();
         classes.add (OurParameter.class);
         Mapper mapper = new AnnotationMapperImpl (classes);
         ObjectContentManager ocm = new ObjectContentManagerImpl
         (session, mapper);
         queryManager = ocm.getQueryManager();
         FilterImpl filter = (FilterImpl)queryManager.createFilter
         (OurParameter.class);
         filter.addContains ( srchField,

org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(searchTerm).repl
aceAll
         ("'","''"));
         // (that last was replace all single ticks with two ticks, I
         honestly can't remember why though)
         Query query = queryManager.createQuery (filter);
         Collection<OurParameter>  resultsCollection =
         (Collection<OurParameter>)ocm.getObjects(query);

         //convert to an array, do some other stuff, and return...

     }



On 08/26/2010 10:42 AM, Ard Schrijvers wrote:

On Thu, Aug 26, 2010 at 3:53 PM, H. Wilson<[email protected]>   wrote:

   Ard,

I have this same problem, however my scenario involves underscores rather
than hyphens. Although since Chris seems to be seeing the same exact

It is because hyphens just as underscores are tokens the Standard
Lucene Analyzer splits on. This combined with query expansion that
happens for wildcard searches in lucene causes your issuess:

behavior as I was, I imagine we are both stuck on the same issue. After
scouring the forums for the solution, and not seeing your mentioned
solution, I actually posted my problem as detailed as possible here (
http://markmail.org/message/yh72wqd5b2hbr3j6 ) and received no response.
jcr:like was not an option for me, in this case, as our client wanted the
option for case-insensitive searches. Is there any chance you could please
narrow down where-about the post was which already covered this? Thanks for

I can't seem to find my post again. But, I'll give you a quite simple
solution:

If you want to have the normal indexing of the property for normal
searching, but also want to have the yyy* option, you need to
duplicate the property also in another property. If your property,
like

.North.South.East.WestLand

is only needed for the one you describe with wildcard searching, you
only need it once. Now, suppose, your property is called myProp.

To your configuration.xml add:

<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0";>
    <analyzers>
          <analyzer
class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
              <property>myProp</property>
          </analyzer>
    </analyzers>
</configuration>

Your LowerCaseKeywordAnalyzer is very simple: it extends
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAna
lyzer.html
and in the method

   TokenStream tokenStream(String fieldName,Reader reader)

after calling the super, you invoke Lucene's LowerCaseFilter.

That is all (after you do a re-index of your repository). Since now a
-, or _ or ~ or whatever is not seen as a token to split on, but you
still use lowercase filter, you can do exactly what you want.

Do the words need the be split on spaces however? No problem, just add
a WhiteSpaceTokenizer from lucene. It is actually pretty simple,

Hope this helps,

Regards Ard

your time.

*H. Wilson*


On 08/26/2010 04:59 AM, Ard Schrijvers wrote:

Hello,

You can search the archives (mail from me) for wildcard searching
things related below. There was someone having similar issues. I
explained the wildcard difficulties. Take a look at jcr:like for your
usecases

Regards Ard

On Thu, Aug 26, 2010 at 10:19 AM, Dunstall, Christopher
<[email protected]>     wrote:

Hi all,

I'm having some trouble with an XPath query, where I'm searching for
users with hyphens in their name.

I'm using:
jcr:contains(*/*/*,'query')

And it returns some odd results.

I have two users, Sophie-Allen and Sophie-Anne. When I search for
'sophie', I get back users back. Ok, fine, but if I search for 'sophie-a'
(with the hyphen escaped as 'sophie\-a' as per the JSR-170 Spec) I get
zero
results returned.  Oddly, if I search for either 'sophie-allen' or
'sophie-anne' I get the respective user details back fine. Shouldn't I get
both users back when escaping the hyphen? Have I missed something in the
spec?

One other odd thing is the addition of an asterisk (*).  Searching for
'soph' and 'soph*' return the same result (both users), but if I search
for
'sophie-allen*', I get zero results, unlike when searching for just
'sophie-allen'. Searching for 'sophie-a*' has the same result as without
the
asterisk, i.e. nothing.

The JSR-170 spec doesn't say anything (that I can find) but is the
asterisk a wildcard in the jcr:contains function or does it serve some
other
purpose?

Your assistance is greatly appreciated,

Regards,

Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst,
NSW, Australia

Ph: 02 63384818 | Fax: 02 63384181

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Reply via email to