Re: jena-text limit by language and/or named graph

Osma Suominen Tue, 03 Dec 2013 21:42:54 -0800

Hi all,

there's been no replies so far to my suggestion for jena-textenhancements that I'd like to implement to get better performance whenthere are many named graphs. Should I maybe post this to jena-dev instead?


-Osma

29.11.2013 14:02, Osma Suominen kirjoitti:

Hi Andy!

Should this be per map entry/ per predicate?  I don't know which is
best - whether a index-wide configuration or whether it might be
some predicates are indexed one way and some another.


For now, I think this can be global, i.e. not possible to set per
predicate.

(and if there is no lang, presumably "") .


Probably yes, though I'll defer the lang discussion for now and
concentrate on getting the graph information into the index first
because that is more critical for me - I have dozens of graphs, but only
a few languages in each graph.

Sounds sane.


Great!

What would the query predicate in SPARQL look like?


For the graph part, I think there is no need to introduce any new
syntax. Simply having the text:query within the context of a specific
graph should be enough, i.e. this should work:

GRAPH <http://example.com/mygraph> {
   ?s text:query "keyword" .
}

For the language part, I'm not so sure, but I'll defer the discussion
for now.

If it all defaults back to the current mode of operations, we have a
non-disturptive upgrade path which would better if possible.  It's a
change of disk-format which is always more of an issue for existing
use.


Yes, that is my intent, to not disrupt existing use in any way.

Attached is a first draft patch which is my attempt at adding graph
information to the index, iff graphField has been set in the config
file, as in the attached config file.

With this patch, you can use a query such as this:

SELECT ?s {
   ?s text:query '+res* +graph:"http\\://example.com/graphA"' .
}

and you will only get results from within the specified graph. This is
obviously a bit awkward since you have to know the name of the graph
field, and also the URI quoting is ugly. But at least it proves that the
graph information was successfully stored in the index and can be used
for retrieval.

However, I couldn't figure out how to get the URI of the current graph
at query time so that an explicit "graph:" query part could be avoided.

An ExecutionContext is passed to TextQueryPF methods and it has a
getActiveGraph() method which looks promising. But neither the Graph
interface nor the GraphBase implementation seem to be aware of the URI
(or Node in general) they are identified by. The only (possible,
untested) way that I could think of would be to also call
ExecutionContext.getDataset(); then call DatasetGraph.listGraphNodes();
and for each of the Nodes, call DatasetGraph.getGraph(node) and see if
the result matches the Graph that getActiveGraph() returned. But this
seems awfully inefficient, especially if there are lots of graphs. Is
there a better way to find out the URI of the current graph within
TextQueryPF methods?

Finally some misc notes:
- TextDocProducerEntities seems to be unused - not touched
- TextDocProducerTriples.[qQ]uadsToTriples is unused - not touched
- TextIndexLucene.get$ - it seems a bit stupid to use a QueryParser
   when you could directly create a Query programmatically - not touched
- I think get$ was broken anyway because it doesn't take into account
   that the query is tokenized by StandardAnalyzer - but this should now
   be fixed as a side effect of using PerFieldAnalyzerWrapper
- I made similar changes in TextIndexSolr as in TextIndexLucene, but
   have so far tested only the Lucene part

-Osma



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: jena-text limit by language and/or named graph

Reply via email to