Re: How to do text search with Jena and Fuseki

Andy Seaborne Tue, 10 Nov 2015 14:58:39 -0800

I was trying to evaluate Jena+Fuseki for a project. The number of
triples that I put in Fuseki is 3161033. Our queries are of search
type, for example, given a search term/phrase get count of results,
first 20 results and some facets. All queries took between 3-10
seconds to execute, which was disappointing.

3 million triples. That's not very many. It will depend on how much isindexed into Lucene and what the query actually is but elsewhere I'veseen much larger datasets with text query running much faster.

There are lots of possible systems factors such as hardware, server orclient restarts (this java!) and how you ask the server query.


        Andy


On 10/11/15 14:51, Kamble, Ajay, Crest wrote:

Hello,

1. Setup for Free Text Search

        In assembler file I had to put two entries, 1 for TDB dataset and 1 for 
Lucene indexed. After this change I was able to do free text queries for my TDB 
dataset. However, I am not sure if this is the correct way.

        <#service> rdf:type fuseki:Service ;
         fuseki:name “mydb” ;# http://host:port/tdb
         fuseki:serviceQuery "query" ; # SPARQL query service
         fuseki:serviceQuery "sparql" ; # SPARQL query service
         fuseki:serviceUpdate "update" ; # SPARQL query service
         fuseki:serviceUpload "upload" ; # Non-SPARQL upload service
         fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store 
protocol (read and write)
         fuseki:dataset <#dataset> ;
         #fuseki:dataset :text_dataset ;
        .

         <#service_text_tdb> rdf:type fuseki:Service ;
         fuseki:name "fts" ; # http://host:port/tdb
         fuseki:serviceQuery "query" ; # SPARQL query service
         fuseki:serviceQuery "sparql" ; # SPARQL query service
         fuseki:serviceUpdate "update" ; # SPARQL query service
         fuseki:serviceUpload "upload" ; # Non-SPARQL upload service
         fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store 
protocol (read and write)
         #fuseki:dataset <#dataset> ;
         fuseki:dataset :text_dataset ;
        .

2. Performance

        I was trying to evaluate Jena+Fuseki for a project. The number of 
triples that I put in Fuseki is 3161033. Our queries are of search type, for 
example, given a search term/phrase get count of results, first 20 results and 
some facets. All queries took between 3-10 seconds to execute, which was 
disappointing.

To be fair, I do not have much knowledge and I have just done basic setup at 
this point.
        Are there any ways to get a better performance?
        Is the data size a problem here? The count of triples is only going to 
increase.
        Can it give better or comparable performance than Neo4J for same data?

Interestingly, free text search returned much earlier than other queries, it 
took roughly 1 second.

3. Other Triplestore

        What other triplestore can be used if high performance is required 
along with ability to do free text search?

-Ajay

On Nov 4, 2015, at 10:10 PM, Andy Seaborne <[email protected]> wrote:

On 04/11/15 16:11, Kamble, Ajay, Crest wrote:

I created text index with this command:

java -cp fuseki-server.jar jena.textindexer --desc=/tmp/fuseki-assembler.ttl


This must be done after you removed tdb:unionDefaultGraph

Then check the place where you have stored the text index (and check there are 
not two on your disk - you gave it a relative file name() and see if it has any 
data in it.


        Andy


-Regards
Ajay

On Nov 4, 2015, at 9:28 PM, Kamble, Ajay, Crest 
<[email protected]<mailto:[email protected]>> wrote:

Hi Andy,

Thanks for help. My server was able to access data after commenting 
‘tdb:unionDefaultGraph’.

But the free text search that I tried did not work. I tried following query but 
I got 0 results.

PREFIX text: <http://jena.apache.org/text#>

SELECT ?s
{
    ?s text:query 'gold' .
}

Is my configuration for text search correct. Also how do I specify 2 datasets 
in single service?

Here is snippet from configuration:

# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    .

# Mapping in the index
# URI stored in field "uri"
# rdfs:label is mapped to field "text"
<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;
    text:map (
         [ text:field "text" ; text:predicate no:name ]
         [ text:field "text" ; text:predicate no:alt-name ]
         [ text:field "text" ; text:predicate no:name ]
         [ text:field "text" ; text:predicate no:title ]
         [ text:field "text" ; text:predicate no:author ]
         [ text:field "text" ; text:predicate no:inventor ]
         ) .

[] rdf:type fuseki:Server ;
   # Server-wide context parameters can be given here.
   # For example, to set query timeouts: on a server-wide basis:
   # Format 1: "1000" -- 1 second timeout
   # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout 
to for rest of query.
   # See java doc for ARQ.queryTimeout
   # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;

   # Load custom code (rarely needed)
   # ja:loadClass "your.code.Class" ;

   # Services available.  Only explicitly listed services are configured.
   #  If there is a service description not linked from this list, it is 
ignored.
   fuseki:services (
     <#service>
     #<#service_text_tdb>
   ) .

<#service>  rdf:type fuseki:Service ;
    fuseki:name              “mydb" ;       # http://host:port/tdb
    fuseki:serviceQuery               "query" ;    # SPARQL query service
    fuseki:serviceQuery               "sparql" ;   # SPARQL query service
    fuseki:serviceUpdate              "update" ;   # SPARQL query service
    fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload service
    fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store 
protocol (read and write)
    fuseki:dataset           <#dataset> ;
    #fuseki:dataset                  :text_dataset ;
.

-Regards
Ajay

On Nov 4, 2015, at 7:54 PM, Andy Seaborne 
<[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:

On 04/11/15 14:11, Kamble, Ajay, Crest wrote:
That worked for me. Also the option is —config and not —conf.

--config and --conf are synomys.

And it's "-" or "--" but not the en-dash or em-dash character your email is 
putting in.


Fuseki starts but it does not read my existing data. If I execute simple query 
to get count of triples, I get 0. Also, Fuseki gives this warning - Dataset not 
found: No session.

Check the config file.

Try without "tdb:unionDefaultGraph true"


If I start Fuseki with —loc option and not —config, then it correctly reads all 
data and the same query gives correct count.

--loc is shorthand for TDB only, no text dataset, no default union graph.


Is there anything wrong with the way I have configured dataset in assembler 
file?

Also, do I need to create 2 different services for normal sprawl query and text 
search?

If the query has no text:query, it executes like a plain SPARQL query on the 
TDB datasets.

In other words, can I execute both types of queries in single console or not?

-Regards
Ajay


On Nov 4, 2015, at 7:35 PM, Andy Seaborne 
<[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:

On 04/11/15 13:59, Kamble, Ajay, Crest wrote:
Hi Andy,

I tried that but it did not work. I got another error,

fuseki-server --update —conf=/tmp/fuseki-assembler.ttl /mydb
Required: either --config=FILE or one of --mem, --file, --loc or --desc

fuseki-server --conf=/tmp/fuseki-assembler.ttl

The service name is in teh assembler file - you can't give it again on the 
command line.

Andy


-Regards
Ajay

On Nov 4, 2015, at 5:43 PM, Andy Seaborne 
<[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:

Change "--desc" to "--conf"

"--desc" works in the restricted case when there is one dataset description - 
but in this case there are two - the TDB dataset and the test dataset built over that.

Andy

On 04/11/15 12:10, Kamble, Ajay, Crest wrote:
Hi All,

1. Triplestore

I have an existing Triplestore that I setup by putting data in Fuseki. I used 
Java code to put all triples in Fuseki (here is url that I used - 
http://localhost:3030/mydb/data). Before starting loading of data I start 
Fuseki with this command:

fuseki-server --update --loc=/tmp/fuseki-tdb /mydb
(on Mac OS X).

My database is located at /tmp/fuseki-tdb

This setup works well and I can query all triples from console.

2. Free Text Search

I need to setup free text search on top of this Triplestore, so that normal 
Sparql queries and free text queries are both possible.

Here is the assembler file that I used.

@prefix :        <http://mydb.com/ns/dataset#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix no: <http://mydb.com/ns/concepts#> .
@prefix d: <http://mydb.com/ns/data#> .

## Example of a TDB dataset and text index
## Initialize TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

## Initialize text query
[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
# Solr index
text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------
## This URI must be fixed - it's used to assemble the text dataset.

:text_dataset rdf:type     text:TextDataset ;
    text:dataset   <#dataset> ;
    text:index     <#indexLucene> ;
    .   I was trying to evaluate Jena+Fuseki for a project. The number of 
triples that I put in Fuseki is 3161033. Our queries are of search type, for 
example, given a search term/phrase get count of results, first 20 results and 
some facets. All queries took between 3-10 seconds to execute, which was 
disappointing.

# A TDB datset used for RDF storage
<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location “/tmp/fuseki-tdb" ;
    tdb:unionDefaultGraph true ; # Optional
    .

# Text index description
<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    .

# Mapping in the index
# URI stored in field "uri"
# rdfs:label is mapped to field "text"
<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;
    text:map (
         [ text:field "text" ; text:predicate no:name ]
         [ text:field "text" ; text:predicate no:alt-name ]
         [ text:field "text" ; text:predicate no:name ]
         [ text:field "text" ; text:predicate no:title ]
         [ text:field "text" ; text:predicate no:author ]
         [ text:field "text" ; text:predicate no:inventor ]
         ) .

[] rdf:type fuseki:Server       I was trying to evaluate Jena+Fuseki for a 
project. The number of triples that I put in Fuseki is 3161033. Our queries are 
of search type, for example, given a search term/phrase get count of results, 
first 20 results and some facets. All queries took between 3-10 seconds to 
execute, which was disappointing.  ;
   # Server-wide context parameters can be given here.
   # For example, to set query timeouts: on a server-wide basis:
   # Format 1: "1000" -- 1 second timeout
   # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout 
to for rest of query.
   # See java doc for ARQ.queryTimeout
   # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;

   # Load custom code (rarely needed)
   # ja:loadClass "your.code.Class" ;

   # Services available.  Only explicitly listed services are configured.
   #  If there is a service description not linked from this list, it is 
ignored.
   fuseki:services (
     <#service>
     #<#service_text_tdb>
   ) .

<#service>  rdf:type fuseki:Service ;
    fuseki:name              “mydb" ;       # http://host:port/tdb
    fuseki:serviceQuery               "query" ;    # SPARQL query service
    fuseki:serviceQuery               "sparql" ;   # SPARQL query service
    fuseki:serviceUpdate              "update" ;   # SPARQL query service
    fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload service
    fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store 
protocol (read and write)
    fuseki:dataset           <#dataset> ;
    fuseki:dataset                  :text_dataset ;
.

With this assembler file, I start my server with following command,

fuseki-server --update 
--desc=/Users/kamb16/projects/nano/data/fuseki-assembler.ttl /mydb

I get following error,

com.hp.hpl.jena.sparql.ARQException: Found two matches: var ?root -> 
http://mydb.com/ns/dataset#text_dataset, file:///tmp/fuseki-assembler.ttl#dataset
at com.hp.hpl.jena.sparql.util.QueryExecUtils.getOne(QueryExecUtils.java:360)
at 
com.hp.hpl.jena.sparql.util.graph.GraphUtils.findRootByType(GraphUtils.java:194)
at 
com.hp.hpl.jena.sparql.core.assembler.AssemblerUtils.build(AssemblerUtils.java:91)
at arq.cmdline.ModAssembler.create(ModAssembler.java:68)
at arq.cmdline.ModDatasetAssembler.createDataset(ModDatasetAssembler.java:43)
at org.apache.jena.fuseki.FusekiCmd.processModulesAndArgs(FusekiCmd.java:307)
at arq.cmdline.CmdArgModule.process(CmdArgModule.java:50)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
at org.apache.jena.fuseki.FusekiCmd.main(FusekiCmd.java:166)

I do not understand how to fix this issue. Could you please help? I want to do 
regular Sparql queries as well as Free text search.

Regards,
Ajay

Re: How to do text search with Jena and Fuseki

Reply via email to