Hi Paul! On 14/03/14 02:51, Paul Tyson wrote:
I just tried out the jena-text indexing and query capabilities of jena 2.11. Great stuff, but the property values I indexed contain part numbers that frequently contain hyphens. Apparently Lucene's StandardAnalyzer tokenizes on hyphens, so my initial search results were quite puzzling.
Yes, StandardAnalyzer is "smart" for many scenarios but not good for everything.
However, even with the limited results, I can see that the text queries are much faster than strstarts() or regex() filters on the same property values. So I would like to try indexing the property values using Lucene's KeywordAnalyzer. I think I can see in the code how this could be easily done.
Searching using an index is typically much faster than filters, because the text index will directly give you (at least approximately) the hits you need, whereas a filter requires traversing through a lot more rows and throwing most of them away.
Has anyone else encountered this problem? Have I missed some other way to improve response time for a filtered string search, or overestimated the possible performance improvement? (I'm new to Lucene.) Would the developers consider an enhancement to make this option configurable in the text assembler?
It's of course possible to just replace StandardAnalyzer with KeywordAnalyzer in the code and compile your own modified jena-text. Making it configurable would require some more work...
However, another possible solution is to switch to the Solr backend also supported by jena-text. Then you can configure all fields exactly as you like using Solr's schema.xml configuration file [1].
-Osma [1] http://wiki.apache.org/solr/SchemaXml -- Osma Suominen D.Sc. (Tech), Information Systems Specialist National Library of Finland P.O. Box 26 (Teollisuuskatu 23) 00014 HELSINGIN YLIOPISTO Tel. +358 50 3199529 [email protected] http://www.nationallibrary.fi
