Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Ard Schrijvers Fri, 27 Aug 2010 07:38:57 -0700

On Fri, Aug 27, 2010 at 4:35 PM, H. Wilson <[email protected]> wrote:
>  Chris,
>
> I think I can answer this one, (I'm sure Ard will confirm), but back when I
> was trying to get this working, one of things I saw was on this page:
>
> http://wiki.apache.org/jackrabbit/IndexingConfiguration
>
> ...near the bottom it talks about setting Analyzers for properties in the
> indexing_configuration. I think what it is getting at is, since you need it
> on all properties, you might not need the indexingConfig, and you can just
> add the line:
>
> <param name="analyzer"
> value="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
>
> to your SearchIndex targets in your repository.xml, modifying the Analyzer
> to the one which suites you.


That is correct. However, I doubt whether you would want to have this
analyser for all your content :-)

Regards Ard

>
> H. Wilson
>
>
> On 08/27/2010 08:27 AM, Dunstall, Christopher wrote:
>>
>> Ard,
>>
>> In indexing_configuration.xml, where you named the property where the
>> analyzer is used (e.g. FullName), how to I set it so that it's used on all
>> properties of a node?  As previously said, I'm using jcr:contains because
>> I
>> need to search all parts of the node, so the analyzer needs to have effect
>> on all properties.
>>
>> Regards,
>>
>> Chris
>>
>>
>> On 27/08/10 2:22 AM, "H. Wilson"<[email protected]>  wrote:
>>
>>>   Finally! I have been hacking away at this here and there for months,
>>> trying all different analyzers or not-using analyzers and modifying my
>>> queries all to no avail! Since I always like precise examples when I am
>>> searching forums, I will post my (nearly) exact solution both for others
>>> and so that Ard might verify that this was indeed what he meant.
>>>
>>> Ard, I was hoping you could embellish a little on why we would duplicate
>>> the property? (I didn't actually do it to get this working perfectly)
>>> You lost me a little there, was it for efficiency? Thanks for everything!
>>>
>>> H. Wilson
>>>
>>> repository.xml (modified both SearchIndex tags to include an
>>> indexingConfiguration):
>>>
>>>     <SearchIndex
>>>     class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>>>
>>>         ....
>>>         <param name="indexingConfiguration"
>>>         value="${rep.home}/indexing_configuration.xml"/>
>>>
>>>     </SearchIndex>
>>>
>>>
>>> indexing_configuration.xml:
>>>
>>>     <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0";>
>>>     <analyzers>
>>>     <analyzer
>>>     class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
>>>     <property>fullName</property>
>>>     </analyzer>
>>>     </analyzers>
>>>     </configuration>
>>>
>>>
>>> LowerCaseKeywordAnalyzer.java:
>>>
>>>     package org.mycompany.lucene.analysis;
>>>          import java.io.Reader;
>>>          import org.apache.lucene.analysis.KeywordAnalyzer;
>>>          import org.apache.lucene.analysis.LowerCaseFilter;
>>>          import org.apache.lucene.analysis.TokenStream;
>>>
>>>     public class LowerCaseKeywordAnalyzer extends KeywordAnalyzer {
>>>
>>>          public TokenStream tokenStream ( String field, final Reader
>>>     reader  ) {
>>>              TokenStream keywordTokenStream = super.tokenStream (field,
>>>     reader);
>>>              return ( new LowerCaseFilter ( keywordTokenStream ) );
>>>          }
>>>     }
>>>
>>>
>>> Our search class has a method which then does the following:
>>>
>>>     public OurParameter[] getOurParameters (String searchTerm, String
>>>     srchField ) { //srchField in this case was fullName
>>>
>>>         TransientRepository repository = new TransientRepository (
>>>         OUR_REPO_CONFIG, OUR_REPO_LOCATION);
>>>         Session session = repository.login ();
>>>         List<Class>  classes = new ArrayList<Class>();
>>>         classes.add (OurParameter.class);
>>>         Mapper mapper = new AnnotationMapperImpl (classes);
>>>         ObjectContentManager ocm = new ObjectContentManagerImpl
>>>         (session, mapper);
>>>         queryManager = ocm.getQueryManager();
>>>         FilterImpl filter = (FilterImpl)queryManager.createFilter
>>>         (OurParameter.class);
>>>         filter.addContains ( srchField,
>>>
>>>
>>> org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(searchTerm).repl
>>> aceAll
>>>         ("'","''"));
>>>         // (that last was replace all single ticks with two ticks, I
>>>         honestly can't remember why though)
>>>         Query query = queryManager.createQuery (filter);
>>>         Collection<OurParameter>  resultsCollection =
>>>         (Collection<OurParameter>)ocm.getObjects(query);
>>>
>>>         //convert to an array, do some other stuff, and return...
>>>
>>>     }
>>>
>>>
>>>
>>> On 08/26/2010 10:42 AM, Ard Schrijvers wrote:
>>>>
>>>> On Thu, Aug 26, 2010 at 3:53 PM, H. Wilson<[email protected]>   wrote:
>>>>>
>>>>>   Ard,
>>>>>
>>>>> I have this same problem, however my scenario involves underscores
>>>>> rather
>>>>> than hyphens. Although since Chris seems to be seeing the same exact
>>>>
>>>> It is because hyphens just as underscores are tokens the Standard
>>>> Lucene Analyzer splits on. This combined with query expansion that
>>>> happens for wildcard searches in lucene causes your issuess:
>>>>
>>>>> behavior as I was, I imagine we are both stuck on the same issue. After
>>>>> scouring the forums for the solution, and not seeing your mentioned
>>>>> solution, I actually posted my problem as detailed as possible here (
>>>>> http://markmail.org/message/yh72wqd5b2hbr3j6 ) and received no
>>>>> response.
>>>>> jcr:like was not an option for me, in this case, as our client wanted
>>>>> the
>>>>> option for case-insensitive searches. Is there any chance you could
>>>>> please
>>>>> narrow down where-about the post was which already covered this? Thanks
>>>>> for
>>>>
>>>> I can't seem to find my post again. But, I'll give you a quite simple
>>>> solution:
>>>>
>>>> If you want to have the normal indexing of the property for normal
>>>> searching, but also want to have the yyy* option, you need to
>>>> duplicate the property also in another property. If your property,
>>>> like
>>>>
>>>> .North.South.East.WestLand
>>>>
>>>> is only needed for the one you describe with wildcard searching, you
>>>> only need it once. Now, suppose, your property is called myProp.
>>>>
>>>> To your configuration.xml add:
>>>>
>>>> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0";>
>>>>    <analyzers>
>>>>          <analyzer
>>>> class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
>>>>              <property>myProp</property>
>>>>          </analyzer>
>>>>    </analyzers>
>>>> </configuration>
>>>>
>>>> Your LowerCaseKeywordAnalyzer is very simple: it extends
>>>>
>>>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAna
>>>> lyzer.html
>>>> and in the method
>>>>
>>>>   TokenStream tokenStream(String fieldName,Reader reader)
>>>>
>>>> after calling the super, you invoke Lucene's LowerCaseFilter.
>>>>
>>>> That is all (after you do a re-index of your repository). Since now a
>>>> -, or _ or ~ or whatever is not seen as a token to split on, but you
>>>> still use lowercase filter, you can do exactly what you want.
>>>>
>>>> Do the words need the be split on spaces however? No problem, just add
>>>> a WhiteSpaceTokenizer from lucene. It is actually pretty simple,
>>>>
>>>> Hope this helps,
>>>>
>>>> Regards Ard
>>>>
>>>>> your time.
>>>>>
>>>>> *H. Wilson*
>>>>>
>>>>>
>>>>> On 08/26/2010 04:59 AM, Ard Schrijvers wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> You can search the archives (mail from me) for wildcard searching
>>>>>> things related below. There was someone having similar issues. I
>>>>>> explained the wildcard difficulties. Take a look at jcr:like for your
>>>>>> usecases
>>>>>>
>>>>>> Regards Ard
>>>>>>
>>>>>> On Thu, Aug 26, 2010 at 10:19 AM, Dunstall, Christopher
>>>>>> <[email protected]>     wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm having some trouble with an XPath query, where I'm searching for
>>>>>>> users with hyphens in their name.
>>>>>>>
>>>>>>> I'm using:
>>>>>>> jcr:contains(*/*/*,'query')
>>>>>>>
>>>>>>> And it returns some odd results.
>>>>>>>
>>>>>>> I have two users, Sophie-Allen and Sophie-Anne. When I search for
>>>>>>> 'sophie', I get back users back. Ok, fine, but if I search for
>>>>>>> 'sophie-a'
>>>>>>> (with the hyphen escaped as 'sophie\-a' as per the JSR-170 Spec) I
>>>>>>> get
>>>>>>> zero
>>>>>>> results returned.  Oddly, if I search for either 'sophie-allen' or
>>>>>>> 'sophie-anne' I get the respective user details back fine. Shouldn't
>>>>>>> I get
>>>>>>> both users back when escaping the hyphen? Have I missed something in
>>>>>>> the
>>>>>>> spec?
>>>>>>>
>>>>>>> One other odd thing is the addition of an asterisk (*).  Searching
>>>>>>> for
>>>>>>> 'soph' and 'soph*' return the same result (both users), but if I
>>>>>>> search
>>>>>>> for
>>>>>>> 'sophie-allen*', I get zero results, unlike when searching for just
>>>>>>> 'sophie-allen'. Searching for 'sophie-a*' has the same result as
>>>>>>> without
>>>>>>> the
>>>>>>> asterisk, i.e. nothing.
>>>>>>>
>>>>>>> The JSR-170 spec doesn't say anything (that I can find) but is the
>>>>>>> asterisk a wildcard in the jcr:contains function or does it serve
>>>>>>> some
>>>>>>> other
>>>>>>> purpose?
>>>>>>>
>>>>>>> Your assistance is greatly appreciated,
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Chris Dunstall | Service Support - Applications
>>>>>>> Technology Integration/OLE Virtual Team
>>>>>>> Division of Information Technology | Charles Sturt University |
>>>>>>> Bathurst,
>>>>>>> NSW, Australia
>>>>>>>
>>>>>>> Ph: 02 63384818 | Fax: 02 63384181
>>>>>>>
>>
>

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Reply via email to