Hi there,
There could also be a separate "language" field so that the Lucene
search has a "lang:" field.
It's a trade off as in the other thread on string prefixing. Doing a
search on a word, getting it regardless of language and then filtering
on language. Hopefully, the index is reasonably specific that the two
stage process benefits from the text:query to generate only a few
possibilities.
At the point of execution, its possible to find out which graph the
pattern is for so graph specific is possible. The tradeoff is size of
index - by adding more details, its more powerful to search but index
size grows which can slow things down.
It would help if the analyzer were configurable; that's a fairly
essential starting point.I thought there was a JIRA waiting for
contributions but I can't find it but then I'm on the end of a phone
hotspot connection ATM.
It's probably that the design of way to make it useable, e.g. sane
configuration, that's key as much as implementation.
The module is jena-text
https://svn.apache.org/repos/asf/jena/trunk/jena-text/
Andy
On 18/11/13 07:07, Osma Suominen wrote:
Hi!
Currently jena-text stores only two things about the indexed resources:
their URI, and the literal values of the indexed properties that it has
been configured to look for.
This means that later on it is impossible to limit the text:query
results by language. For example, when searching in a multilingual
dataset, you can search for { ?s text:query "gift" }, and then get
results like this:
ex:Gift rdfs:label "gift"@en .
ex:Poison rdfs:label "Gift"@de .
I'd like to have a way of restricting the hits by language tag at
text:query time, e.g. using the syntax { ?s text:query "gift"@en }.
But with the current index structure this is impossible. Is there a way
to easily implement this? For example, there could be separate fields
for each language, so the index could have fields like uri, text_en,
text_de. Then you could search either using the above syntax (with
language tag in the query literal) or explicitly as { ?s text:query
"text_en:gift" }.
Another similar problem is that the jena-text index is shared for all
named graphs. So if there are different resources in the named graphs,
you cannot match just one of the graphs but instead you will get matches
for all of them mixed up, which could be many more than what you are
interested in.
I'm not entirely sure how to improve on the situation, as "being" in a
specific named graph is a triple-level property and the same resource
could potentially be described in many named graphs. However, I think it
could still be possible to add e.g. a "graph" field into the index
listing all the named graphs in which the resource has been mentioned
(in the triples that affect the index). Then you could query e.g. like
this: { ?s text:query "text:gift graph:http://example.com/mygraph" }. Do
you think this would be a workable idea?
If you think either of these ideas is sound, I'm willing to write
patches to implement these. I develop an application [1] that makes
heavy use of jena-text, named graphs, and multilingual RDF data, and
currently its performance is limited by these issues.
-Osma
[1] http://code.google.com/p/onki-light/