Re: jena-text limit by language and/or named graph

Andy Seaborne Mon, 18 Nov 2013 10:31:16 -0800

Hi there,

There could also be a separate "language" field so that the Lucenesearch has a "lang:" field.

It's a trade off as in the other thread on string prefixing. Doing asearch on a word, getting it regardless of language and then filteringon language. Hopefully, the index is reasonably specific that the twostage process benefits from the text:query to generate only a fewpossibilities.

At the point of execution, its possible to find out which graph thepattern is for so graph specific is possible. The tradeoff is size ofindex - by adding more details, its more powerful to search but indexsize grows which can slow things down.

It would help if the analyzer were configurable; that's a fairlyessential starting point.I thought there was a JIRA waiting forcontributions but I can't find it but then I'm on the end of a phonehotspot connection ATM.

It's probably that the design of way to make it useable, e.g. saneconfiguration, that's key as much as implementation.


The module is jena-text

https://svn.apache.org/repos/asf/jena/trunk/jena-text/

        Andy

On 18/11/13 07:07, Osma Suominen wrote:

Hi!

Currently jena-text stores only two things about the indexed resources:
their URI, and the literal values of the indexed properties that it has
been configured to look for.


This means that later on it is impossible to limit the text:query
results by language. For example, when searching in a multilingual
dataset, you can search for { ?s text:query "gift" }, and then get
results like this:

ex:Gift rdfs:label "gift"@en .
ex:Poison rdfs:label "Gift"@de .

I'd like to have a way of restricting the hits by language tag at
text:query time, e.g. using the syntax { ?s text:query "gift"@en }.

But with the current index structure this is impossible. Is there a way
to easily implement this? For example, there could be separate fields
for each language, so the index could have fields like uri, text_en,
text_de. Then you could search either using the above syntax (with
language tag in the query literal) or explicitly as { ?s text:query
"text_en:gift" }.


Another similar problem is that the jena-text index is shared for all
named graphs. So if there are different resources in the named graphs,
you cannot match just one of the graphs but instead you will get matches
for all of them mixed up, which could be many more than what you are
interested in.

I'm not entirely sure how to improve on the situation, as "being" in a
specific named graph is a triple-level property and the same resource
could potentially be described in many named graphs. However, I think it
could still be possible to add e.g. a "graph" field into the index
listing all the named graphs in which the resource has been mentioned
(in the triples that affect the index). Then you could query e.g. like
this: { ?s text:query "text:gift graph:http://example.com/mygraph"; }. Do
you think this would be a workable idea?


If you think either of these ideas is sound, I'm willing to write
patches to implement these. I develop an application [1] that makes
heavy use of jena-text, named graphs, and multilingual RDF data, and
currently its performance is limited by these issues.

-Osma


[1] http://code.google.com/p/onki-light/

Re: jena-text limit by language and/or named graph

Reply via email to