The topic of this email was first mentioned in the  recent thread about JENA-1620 and Query Timeouts but is really a separate topic and potentially gets a little complicated.

It used to be the case that JenaText supported querying of a Lucene text index where the index was created independently of Jena and then made available to JenaText via the dataset configuration.  Is this still the case?

Up until Jena 3.9.0 definitely, and I suspect 3.12.0 - I have not confirmed this yet, it was possible to express text queries with field names and they worked.

We have a Fuseki system in production (5+ years) that has "its own mechanism"* for building a multi-field lucene index that is then queried using JenaText.  Those queries specify lucene field names as in the example I gave in the earler thread:

[[

PREFIX  xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX  text: <http://jena.apache.org/text#>
PREFIX  ppd: <http://landregistry.data.gov.uk/def/ppi/>
PREFIX  lrcommon: <http://landregistry.data.gov.uk/def/common/>
SELECT *  {
  ?ppd_propertyAddress
      text:query            ( "street:  the" 3000000 ) .
} LIMIT 1

]]

You can try it on a system running Fuseki 3.9.0  here:

http://landregistry.data.gov.uk/app/qonsole

In a recent test with Jena 3.13.0-SNAPSHOT (from pull request #595) installed in the dev version of that system, the query fails with a query parse error.  Do I need to do some extra configuration to get this to work - e.g. specify a specific text query parser?

Rereading the Jena Text documentation I find:

[[

As mentioned earlier, the text index uses thenative Lucene query language <http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>; however, there are important constraints on how the Lucene query language is used within jena-text. In particular,/explicit/references to Lucene|Fields|with the|query string|*are not*supported. So how are Lucene queries that would otherwise refer to multiple|Fields|expressed?

]]

The text goes on to explain the issues around the fact that JenaText indexes each triple/quad as a separate document.

I have couched my question so far in terms of querying an externally built text index, because that is the simplest, and possibly most compelling way to ask the question and suggest not disallowing the use of lucene fields in text queries.  Not supported for creating indexes is not the same as not supported for querying indexes.

I am (naively?) hoping that restoring the functionality to allow specifying lucene field names in a text query is a quick fix for someone familiar with the code.  I am not familiar with the code, but am willing to help where I can.

In the interests of full disclosure however, I should say that the reason we have our own mechanism for building the text index is exactly the one given in the JenaText documentation.  We needed an index where multiple properties of the same resource were indexed as a single document.  I would be happy to discuss this further - why the solution indicated in the JenaText documentation didn't work for us and whether there is way to construct a general purpose JenaText solution that would.  But there is a lot of potential for complexity there - and the gears for a new Jena release are beginning to turn and I have been hoping to deploy this new release when it becomes available.

Brian

* in fact we use JenaText with a custom TextDocProducer implementation.

--
------------------------------------------------------------------------

Brian McBride
[email protected]

Epimorphics Ltd www.epimorphics.com
Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Tel: 01275 399069

Epimorphics Ltd. is a limited company registered in England (number 7016688)
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT, UK

Reply via email to