[ANN] Jena and text search

Andy Seaborne Thu, 11 Apr 2013 09:41:45 -0700

There is a new, experimental module - jena-text - for anyone interestedto try out.

This is a possible replacement for LARQ (whether to call it "LARQ2" orsomething else is for discussion). It is not compatible with current LARQ1.


== Features

* works in Fuseki, with assembler setup,
  without the need for additional java code.

* tracks additions to the dataset

* works with Lucene4, and with Solr4 for sharing
  the text index with non-SPARQL apps.

* simpler and smaller index design

== Documentation

http://jena.staging.apache.org/documentation/query/text-query.html

== Example query

# text search on rdfs:label for occurrences of "word"
# then retrieve the actual value from the RDF data
PREFIX :     <http://example/>
PREFIX text: <http://jena.apache.org/text#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT *
{ ?s text:query (rdfs:label 'word') ;
     rdfs:label ?label
}

== Download

It's available from the Apache snapshot maven repository:

https://repository.apache.org/content/repositories/snapshots/org/apache/jena/

Depends on Jena 2.10.1 SNAPSHOT.

SVN is currently:
https://svn.apache.org/repos/asf/jena/Experimental/jena-text/

== Fuseki

There is special build of Fuseki in the jena-text artifact area:

(you will need a copy of the pages/ directory from Fuseki distributionif you want the webpages as well)


There is an example of a Fuseki config at the end of this message.

== Notes

Currently, it does not expose the match score - the real requirement forthat we found is to retain ordering in text search results: score is apartial solution to that (two hits can have the same score). Maybe weneed a "row id".


Not tested heavily at scale.

Many thanks to Brian McBride (Epimorphics) who has contributed testing,bug fixes and generally made it better.

Comments and feedback especially welcome - easier to change thingsbefore first release when APIs become depended upon.


    Andy


## Example of a TDB dataset and text index published using Fuseki

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .

[] rdf:type fuseki:Server ;
   fuseki:services (
     <#service_text_tdb>
   ) .

# TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

# Text
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
#text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------

<#service_text_tdb> rdf:type fuseki:Service ;
    rdfs:label                      "TDB/text service" ;
    fuseki:name                     "ds" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceUpdate            "update" ;
    fuseki:serviceUpload            "upload" ;
    fuseki:serviceReadGraphStore    "get" ;
    fuseki:serviceReadWriteGraphStore    "data" ;
    fuseki:dataset                  <#text_dataset> ;
    .

<#text_dataset> rdf:type     text:TextDataset ;
    text:dataset   <#dataset> ;
    ##text:index   <#indexSolr> ;
    text:index     <#indexLucene> ;
    .

<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "DB" ;
    tdb:unionDefaultGraph true ;
    .

<#indexSolr> a text:TextIndexSolr ;
    #text:server <http://localhost:8983/solr/COLLECTION> ;
    text:server <embedded:SolrARQ> ;
    text:entityMap <#entMap> ;
    .

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:Lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    .

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ; ## Must be defined in the text:map
    text:map (
         # rdfs:label
         [ text:field "text" ; text:predicate rdfs:label ]
         ) .

[ANN] Jena and text search

Reply via email to