Re: jena-text indexing fields with KeywordAnalyzer

Osma Suominen Fri, 14 Mar 2014 00:30:07 -0700

Hi Paul!

On 14/03/14 02:51, Paul Tyson wrote:

I just tried out the jena-text indexing and query capabilities of jena
2.11. Great stuff, but the property values I indexed contain part
numbers that frequently contain hyphens. Apparently Lucene's
StandardAnalyzer tokenizes on hyphens, so my initial search results were
quite puzzling.

Yes, StandardAnalyzer is "smart" for many scenarios but not good foreverything.

However, even with the limited results, I can see that the text queries
are much faster than strstarts() or regex() filters on the same property
values. So I would like to try indexing the property values using
Lucene's KeywordAnalyzer. I think I can see in the code how this could
be easily done.

Searching using an index is typically much faster than filters, becausethe text index will directly give you (at least approximately) the hitsyou need, whereas a filter requires traversing through a lot more rowsand throwing most of them away.

Has anyone else encountered this problem? Have I missed some other way
to improve response time for a filtered string search, or overestimated
the possible performance improvement? (I'm new to Lucene.) Would the
developers consider an enhancement to make this option configurable in
the text assembler?

It's of course possible to just replace StandardAnalyzer withKeywordAnalyzer in the code and compile your own modified jena-text.Making it configurable would require some more work...

However, another possible solution is to switch to the Solr backend alsosupported by jena-text. Then you can configure all fields exactly as youlike using Solr's schema.xml configuration file [1].


-Osma

[1] http://wiki.apache.org/solr/SchemaXml

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: jena-text indexing fields with KeywordAnalyzer

Reply via email to