Re: LowerCaseKeywordAnalyzer

Osma Suominen Tue, 16 Jun 2015 00:11:08 -0700

Hi Todd!

Great that you got it working!

For the record, the default field is queried when you don't specify aproperty in the text:query. E.g. { ?s text:query 'word' } uses thedefault field.


-Osma


On 15/06/15 19:20, Todd Detwiler wrote:

And that worked, thanks! I changed the Lucene field of each of the 3
properties I was indexing on (and changed the default to one of those 3,
though I can't see that it matters what the default is). I still don't
understand how the Lucene fields are related to SPARQL queries (as all
fields seem to be queried), but hey, it works, so I'm happy. For the
reference of others, here is the relevant bit of my ttl file:

:entMap a text:EntityMap ;
     text:entityField      "uri" ;
     text:defaultField     "pref" ;
     text:map (
          [ text:field "pref" ;
              text:predicate fma:preferred_name;
              text:analyzer [
                a text:LowerCaseKeywordAnalyzer
            ]
          ]
          [ text:field "syn" ;
              text:predicate fma:synonym;
              text:analyzer [
                a text:LowerCaseKeywordAnalyzer
             ]
          ]
          [ text:field "noneng" ;
              text:predicate fma:non-English_equivalent;
              text:analyzer [
                a text:LowerCaseKeywordAnalyzer
             ]
          ]
          ) ;
     text:queryAnalyzer [
         a text:LowerCaseKeywordAnalyzer
     ] .

-Todd

Landon Todd Detwiler
Structural Informatics Group (SIG)
University of Washington

phone: 206-351-7721

On 6/15/15 9:11 AM, Todd Detwiler wrote:

Thanks Osma,
The search is presently matching even if none of the fields begin with
"cor", but at least one has a word within it that starts with "cor".
I'll try some of the simpler configurations that you suggest. If I'm
honest, I don't really know what the Lucene field is for, that is why
I mapped them all to the text field. I was guessing that the text
field was the one accessible from queries (as the field is not
actually specified in the query itself). I can certainly break them
up. I'll let you know how it turns out.
Todd

Landon Todd Detwiler
Structural Informatics Group (SIG)
University of Washington

phone: 206-351-7721

On 6/15/15 7:07 AM, Osma Suominen wrote:

Hi Todd!

Okay so the problem is not the Fuseki version. I just wanted to check
that first.

I notice that you have mapped several properties to the same field
name "text". Is it possible that among the values of those
properties, there is one beginning with "cortex"? That could explain
why you are getting a hit.

Could you try a simpler configuration first, using only one property
(e.g. fma:preferred_name) and one Lucene field (e.g. "pref"). If you
get that working properly, you can try adding more properties. I
would store their values in separate Lucene fields (e.g. "synonym"
and "noneng"). Putting many property values in the same field may
give surprising results, as the values will be mixed up and
concatenated, and there is no way to tell which property was
originally used for which value.

You can also use the Luke tool to inspect the tokens stored in the
Lucene index, though you may have to hunt around a bit to find a
version compatible with the Lucene version used by Jena.

-Osma

15.06.2015, 05:01, Todd Detwiler kirjoitti:

Hmm, so there must be something wrong with my configuration then. I'm
using Fuseki 2.0.0. Any thoughts on what I might be doing wrong?
Thanks,
Todd

Landon Todd Detwiler
Structural Informatics Group (SIG)
University of Washington

phone: 206-351-7721

On 6/11/15 12:27 AM, Osma Suominen wrote:

Hi Todd!

Your understanding of LowerCaseKeywordAnalyzer is correct. "cor*"
shouldn't match "Anterior superficial cortex proper of left lens".

Which version of Fuseki are you using? LowerCaseKeywordAnalyzer is a
fairly recent addition (IIRC it arrived in Fuseki 1.1.1).

-Osma

11.06.2015, 02:31, Todd Detwiler kirjoitti:

I'm having difficulty getting the text indexer to use the
LowerCaseKeywordAnalyzer. I was someone might be able to suggest
what I
am doing wrong. Here are my details:

1. My dataset is in TDB
2. I am building a Lucene text index
3. I index based on multiple ontology properties
4. I am serving both via Fuseki
5. I am connecting from a remote application to the Fuseki service to
answer SPARQL queries.

TDB and Fuseki are running fine and accessible. The index exists
and it
will answer text queries (and the index appears to cover all of the
properties that I included). But, the results seem consistent with
the
StandardAnalyzer, not a keyword analyzer. So, first, let me tell you
what I am expecting and what I am seeing:

Classes in my ontology have multiple label fields. I am indexing
on all
of them. Here is an example, a value from one of the fields indexed,
"Anterior superficial cortex proper of left lens". The relevant
portion
of my query looks like this: ?s text:query (?prop "cor*"). To me that
should match results that start with "cor". The standard indexer
would
divide the value into individual words, "anterior", "superficial",
...
Because one of those tokens matches (cortex) I would expect a search
hit. But, if I use a keyword analyzer, it should consider the entire
label as a single token. And, therefore, it should NOT match
(since if
does not start with "cor"). But that isn't what I am seeing.

Am I misunderstanding how the keyword analyzer is supposed to work?

I build my index like this:
java -cp $FUSEKI_HOME/fuseki-server.jar jena.textindexer
--desc=fuseki-assembler.ttl

and my assembler looks like this:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix fma:      <http://purl.org/sig/ont/fma/> .
@prefix :        <http://localhost/jena_example/#> .

[] rdf:type fuseki:Server ;
    fuseki:services (
      :service_text_tdb
    ) .

## Example of a TDB dataset and text index
## Initialize TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

## Initialize text query
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------
## This URI must be fixed - it's used to assemble the text dataset.

:text_dataset rdf:type     text:TextDataset ;
     text:dataset   :dataset ;
     text:index     :indexLucene ;
     .

# A TDB datset used for RDF storage
:dataset rdf:type      tdb:DatasetTDB ;
     tdb:location "/usr/local/tdb/fma" ;
     .

<#graph1> rdf:type tdb:GraphTDB ;
     tdb:dataset <#dataset> ;
     tdb:graphName <http://purl.org/sig/ont/fma.owl> ;
     .

# Text index description
:indexLucene a text:TextIndexLucene ;
     text:directory <file:Lucene> ;
     ##text:directory "mem" ;
     text:entityMap :entMap ;
     .

# Mapping in the index
# URI stored in field "uri"
# rdfs:label is mapped to field "text"
:entMap a text:EntityMap ;
     text:entityField      "uri" ;
     text:defaultField     "text" ;
     text:map (
          [ text:field "text" ;
              text:predicate fma:preferred_name;
              text:analyzer [
                a text:LowerCaseKeywordAnalyzer
            ]
          ]
          [ text:field "text" ;
              text:predicate fma:synonym;
              text:analyzer [
                a text:LowerCaseKeywordAnalyzer
             ]
          ]
          [ text:field "text" ;
              text:predicate fma:non-English_equivalent;
              text:analyzer [
                a text:LowerCaseKeywordAnalyzer
             ]
          ]
          ) ;
     text:queryAnalyzer [
         a text:LowerCaseKeywordAnalyzer
     ] .

:service_text_tdb rdf:type fuseki:Service ;
     rdfs:label                      "TDB/text service" ;
     fuseki:name                     "sig" ;
     fuseki:serviceQuery             "query" ;
     fuseki:serviceQuery             "sparql" ;
     fuseki:serviceUpdate            "update" ;
     fuseki:serviceUpload            "upload" ;
     fuseki:serviceReadGraphStore    "get" ;
     fuseki:serviceReadWriteGraphStore    "data" ;
     fuseki:dataset                  :text_dataset ;
     .



If anyone can spot what I am doing wrong, I'd really appreciate a
heads-up.

Thanks,
Todd



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: LowerCaseKeywordAnalyzer

Reply via email to