Re: Jena Text Search Help

Andy Seaborne Thu, 08 Aug 2013 14:57:09 -0700

On 06/08/13 17:18, Brad Moran wrote:

Ok, since I already have the TDB built, it seems the best plan would be to
create an assembler file and then use the jena.textindexes application.
Sorry, these are the namespaces:


     xmlns:mms="http://rdf.cdisc.org/mms#";
     xmlns="http://rdf.cdisc.org/sdtm-1-2/std#";
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
     xmlns:skos="http://www.w3.org/2004/02/skos/core#";
     xmlns:owl="http://www.w3.org/2002/07/owl#";
     xmlns:dc="http://purl.org/dc/elements/1.1/";
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#";
     xmlns:sdtms="http://rdf.cdisc.org/sdtm-1-2/schema#";
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#";
     xmlns:cts="http://rdf.cdisc.org/ct/schema#";
     xml:base="http://rdf.cdisc.org/sdtm-1-2/std";>

I have no experience with assembler files so I based mine off the example
on documentation. Does this look right?


Yes.

Starting with the working example and tweaking bit by bit (binarysearch!) until it is what you want is a good approach.


        Andy


@prefix :        <http://localhost/jena_example/#> .

@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .

@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .

@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .

@prefix text:    <http://jena.apache.org/text#> .

@prefix mms:     <http://rdf.cdisc.org/mms#> .


## Example of a TDB dataset and text index

## Initialize TDB

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .

tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .

tdb:GraphTDB    rdfs:subClassOf  ja:Model .


## Initialize text query

[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .

# A TextDataset is a regular dataset with a text index.

text:TextDataset      rdfs:subClassOf   ja:RDFDataset .

# Lucene index

text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .


## ---------------------------------------------------------------

## This URI must be fixed - it's used to assemble the text dataset.


:text_dataset rdf:type     text:TextDataset ;

     text:dataset   <#dataset> ;

     text:index     <#indexLucene> ;

     .


# A TDB dataset used for RDF storage

<#dataset> rdf:type      tdb:DatasetTDB ;

     tdb:location "tdb" ;

     tdb:unionDefaultGraph true ; # Optional

     .


# Text index description

<#indexLucene> a text:TextIndexLucene ;

     text:directory <file:luceneIndexes> ;


     text:entityMap <#entMap> ;

     .


# Mapping in the index

# URI stored in field "uri"

# rdfs:label is mapped to field "text"

<#entMap> a text:EntityMap ;

     text:entityField      "uri" ;

     text:defaultField     "text" ;

     text:map (

          [ text:field "text" ; text:predicate mms:dataElementName ]
          [text:field "text" ; text:predicate mms:dataElementDescription ]
          # the rest of the fields?

          ) .



On Tue, Aug 6, 2013 at 7:15 AM, Andy Seaborne <[email protected]> wrote:

On 05/08/13 21:49, Brad Moran wrote:

I have an existing Jena TDB based on this example RDF:

  ...

I have compiled a Jena TDB based on several of these RDF files so it is a
large TDB and have several SPARQL queries that work as desired. I am now
trying to implement a full text search on this TDB. I have downloaded the
Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
implement this text search through java code using the new Jena Text
Search
feature. This is my best attempt at solving the problem so far:

   public class TextSearchTest {
      public static void main(String[] args)
      {
          try{
              String DBDirectory = "tdb";

              // Construct the Lucene Index to be queried

              String indexDir = "luceneIndexes";
              File file = new File(indexDir);
              Directory dir = FSDirectory.open(file);

              // Create the in memory text index described
              Dataset ds1 = TDBFactory.createDataset(**DBDirectory);
              String uri = 
"<http://rdf.cdisc.org/mms#**dataElement<http://rdf.cdisc.org/mms#dataElement>

";

              String property = "<http://rdf.cdisc.org/mms#**
dataElementName <http://rdf.cdisc.org/mms#dataElementName>>";
              EntityDefinition entDef = new EntityDefinition(uri, property,
RDFS.Literal);//RDFS.label


This defines the text index to be working on a particular property.

You want to pass in a resource (Resource or Property object)  for
http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>here.




               // Construct the Lucene Index to be queried

              Dataset dataset = TextDatasetFactory.**createLucene(ds1,
dir,
entDef);


I hope you loaded the data into this dataset, not the underlying TDB one
because other wise the text indexer would not have seen the RDF triples to
index.

              // try query
              dataset.begin(ReadWrite.READ);
                  QueryExecution qExec = QueryExecutionFactory.create(
                          "PREFIX text: <http://jena.apache.org/text#>
PREFIX
mms: <http://rdf.cdisc.org/mms#> "
                          + "SELECT * WHERE{?s text:query
(mms:dataElementName 'AE')}", dataset);

                  ResultSet rs = qExec.execSelect();
                  ResultSetFormatter.out(rs);

              dataset.end();
          }
          catch(Exception e){
              System.out.println(e);
          }
      }
}


This results in: WARN  o.apache.jena.query.text.**TextQueryPF -
Predicate not
indexed: 
http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>


Because that field isn't being indexed.

You can have several fileds indexed if you .set the EntityDefinition with
additional predicates.


  and an empty result set is printed out by resultSetFormatter. It does not

seem to create an index for the TDB.


  I believe my problem occurs with my

EntityDefinition (mainly because I am not sure where the parameters
entityField, primaryField, and primaryPredicate should come from). Also in
the example code it seems a lucene index is created then the data is
loaded
by an assembler file. Maybe I am just implementing this wrong. So to try
to
wrap this up:

1. Do I need to use an assembler file?


No but it may be easier that way.


  2. Can I create an index from an existing TDB or do I need to create the

index as I create the TDB.


As the data is loaded.

There is a simple application 'jena.textindexer' which will create the
index from existing data.

http://jena.staging.apache.**org/documentation/query/text-**
query.html#building-a-text-**index<http://jena.staging.apache.org/documentation/query/text-query.html#building-a-text-index>


  3. Could you give me a description of the parameters of EntityDefintion

class and where they come from? (in the rdf maybe?)


Create Property object for 
http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>nad
 pass that in as the 3rd argument


  4. Any general advice on how I can solve this problem from my code.


I tried to be as specific as possible here in hopes that you may be able
to
guide me in the right direction. If I left anything out just let me out
and
hopefully I can explain better. Thanks.


minor in this case, but the data is incomplete RDF/XML, no namespaces, so
I didn't try using it.

Our mantra is "complete, minimal example".  Both "complete" and "minimal"
make it much, much easier to give good answers.

--Brad

         Andy

Re: Jena Text Search Help

Reply via email to