Re: jena-text limit by language and/or named graph

Osma Suominen Tue, 26 Nov 2013 05:32:25 -0800

Hi Andy!

Thanks for your response. Indeed, I hadn't realized that jena-textindexes on the triple level - I actually thought it worked at theentity/resource level (one Lucene/Solr document per RDF entity).However, looking at the code, there is some code for indexing at theentity level that but that code is unused. So it would actually bepretty easy to add lang and/or graph fields into the index, becausethose are defined on the triple level.

How about adding optional support for this into jena-text? There couldbe new configuration options so you could do something like this:


<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:languageField    "lang" ;
    text:graphField       "graph" ;
    text:defaultField     "text" ;
    text:map (
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

Without the languageField and graphField properties, there would be noindexing of language/graph information and thus no cost in index sizecompared to the current situation.

At query time, graph context information could be used to narrow thesearch when it is available and a graphField is defined in theconfiguration. Similarly for language, so you could do searches like

{ ?s text:query "gift lang:en" }.

Does this sound like a sane plan? If it does, I can look at trying toimplement it sometime in the next couple of months.


-Osma

18.11.2013 20:25, Andy Seaborne wrote:

Hi there,

There could also be a separate "language" field so that the Lucene
search has a "lang:" field.

It's a trade off as in the other thread on string prefixing.  Doing a
search on a word, getting it regardless of language and then filtering
on language. Hopefully, the index is reasonably specific that the two
stage process benefits from the text:query to generate only a few
possibilities.


At the point of execution, its possible to find out which graph the
pattern is for so graph specific is possible.  The tradeoff is size of
index - by adding more details, its more powerful to search but index
size grows which can slow things down.

It would help if the analyzer were configurable; that's a fairly
essential starting point.I thought there was a JIRA waiting for
contributions but I can't find it but then I'm on the end of a phone
hotspot connection ATM.

It's probably that the design of way to make it useable, e.g. sane
configuration, that's key as much as implementation.

The module is jena-text

https://svn.apache.org/repos/asf/jena/trunk/jena-text/

     Andy

On 18/11/13 07:07, Osma Suominen wrote:

Hi!

Currently jena-text stores only two things about the indexed resources:
their URI, and the literal values of the indexed properties that it has
been configured to look for.


This means that later on it is impossible to limit the text:query
results by language. For example, when searching in a multilingual
dataset, you can search for { ?s text:query "gift" }, and then get
results like this:

ex:Gift rdfs:label "gift"@en .
ex:Poison rdfs:label "Gift"@de .

I'd like to have a way of restricting the hits by language tag at
text:query time, e.g. using the syntax { ?s text:query "gift"@en }.

But with the current index structure this is impossible. Is there a way
to easily implement this? For example, there could be separate fields
for each language, so the index could have fields like uri, text_en,
text_de. Then you could search either using the above syntax (with
language tag in the query literal) or explicitly as { ?s text:query
"text_en:gift" }.


Another similar problem is that the jena-text index is shared for all
named graphs. So if there are different resources in the named graphs,
you cannot match just one of the graphs but instead you will get matches
for all of them mixed up, which could be many more than what you are
interested in.

I'm not entirely sure how to improve on the situation, as "being" in a
specific named graph is a triple-level property and the same resource
could potentially be described in many named graphs. However, I think it
could still be possible to add e.g. a "graph" field into the index
listing all the named graphs in which the resource has been mentioned
(in the triples that affect the index). Then you could query e.g. like
this: { ?s text:query "text:gift graph:http://example.com/mygraph"; }. Do
you think this would be a workable idea?


If you think either of these ideas is sound, I'm willing to write
patches to implement these. I develop an application [1] that makes
heavy use of jena-text, named graphs, and multilingual RDF data, and
currently its performance is limited by these issues.

-Osma


[1] http://code.google.com/p/onki-light/



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: jena-text limit by language and/or named graph

Reply via email to