The topic of this email was first mentioned in the recent thread about
JENA-1620 and Query Timeouts but is really a separate topic and
potentially gets a little complicated.
It used to be the case that JenaText supported querying of a Lucene text
index where the index was created independently of Jena and then made
available to JenaText via the dataset configuration. Is this still the
case?
Up until Jena 3.9.0 definitely, and I suspect 3.12.0 - I have not
confirmed this yet, it was possible to express text queries with field
names and they worked.
We have a Fuseki system in production (5+ years) that has "its own
mechanism"* for building a multi-field lucene index that is then queried
using JenaText. Those queries specify lucene field names as in the
example I gave in the earler thread:
[[
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX text: <http://jena.apache.org/text#>
PREFIX ppd: <http://landregistry.data.gov.uk/def/ppi/>
PREFIX lrcommon: <http://landregistry.data.gov.uk/def/common/>
SELECT * {
?ppd_propertyAddress
text:query ( "street: the" 3000000 ) .
} LIMIT 1
]]
You can try it on a system running Fuseki 3.9.0 here:
http://landregistry.data.gov.uk/app/qonsole
In a recent test with Jena 3.13.0-SNAPSHOT (from pull request #595)
installed in the dev version of that system, the query fails with a
query parse error. Do I need to do some extra configuration to get this
to work - e.g. specify a specific text query parser?
Rereading the Jena Text documentation I find:
[[
As mentioned earlier, the text index uses thenative Lucene query
language
<http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>;
however, there are important constraints on how the Lucene query
language is used within jena-text. In particular,/explicit/references to
Lucene|Fields|with the|query string|*are not*supported. So how are
Lucene queries that would otherwise refer to multiple|Fields|expressed?
]]
The text goes on to explain the issues around the fact that JenaText
indexes each triple/quad as a separate document.
I have couched my question so far in terms of querying an externally
built text index, because that is the simplest, and possibly most
compelling way to ask the question and suggest not disallowing the use
of lucene fields in text queries. Not supported for creating indexes is
not the same as not supported for querying indexes.
I am (naively?) hoping that restoring the functionality to allow
specifying lucene field names in a text query is a quick fix for someone
familiar with the code. I am not familiar with the code, but am willing
to help where I can.
In the interests of full disclosure however, I should say that the
reason we have our own mechanism for building the text index is exactly
the one given in the JenaText documentation. We needed an index where
multiple properties of the same resource were indexed as a single
document. I would be happy to discuss this further - why the solution
indicated in the JenaText documentation didn't work for us and whether
there is way to construct a general purpose JenaText solution that
would. But there is a lot of potential for complexity there - and the
gears for a new Jena release are beginning to turn and I have been
hoping to deploy this new release when it becomes available.
Brian
* in fact we use JenaText with a custom TextDocProducer implementation.
--
------------------------------------------------------------------------
Brian McBride
[email protected]
Epimorphics Ltd www.epimorphics.com
Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Tel: 01275 399069
Epimorphics Ltd. is a limited company registered in England (number 7016688)
Registered address: Court Lodge, 105 High Street, Portishead, Bristol
BS20 6PT, UK