Re: Jena Text Search Help

Brad Moran Tue, 06 Aug 2013 09:21:10 -0700

Ok, since I already have the TDB built, it seems the best plan would be to
create an assembler file and then use the jena.textindexes application.
Sorry, these are the namespaces:


    xmlns:mms="http://rdf.cdisc.org/mms#";
    xmlns="http://rdf.cdisc.org/sdtm-1-2/std#";
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";
    xmlns:skos="http://www.w3.org/2004/02/skos/core#";
    xmlns:owl="http://www.w3.org/2002/07/owl#";
    xmlns:dc="http://purl.org/dc/elements/1.1/";
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#";
    xmlns:sdtms="http://rdf.cdisc.org/sdtm-1-2/schema#";
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#";
    xmlns:cts="http://rdf.cdisc.org/ct/schema#";
    xml:base="http://rdf.cdisc.org/sdtm-1-2/std";>

I have no experience with assembler files so I based mine off the example
on documentation. Does this look right?

@prefix :        <http://localhost/jena_example/#> .

@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .

@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .

@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .

@prefix text:    <http://jena.apache.org/text#> .

@prefix mms:     <http://rdf.cdisc.org/mms#> .


## Example of a TDB dataset and text index

## Initialize TDB

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .

tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .

tdb:GraphTDB    rdfs:subClassOf  ja:Model .


## Initialize text query

[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .

# A TextDataset is a regular dataset with a text index.

text:TextDataset      rdfs:subClassOf   ja:RDFDataset .

# Lucene index

text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .


## ---------------------------------------------------------------

## This URI must be fixed - it's used to assemble the text dataset.


:text_dataset rdf:type     text:TextDataset ;

    text:dataset   <#dataset> ;

    text:index     <#indexLucene> ;

    .


# A TDB dataset used for RDF storage

<#dataset> rdf:type      tdb:DatasetTDB ;

    tdb:location "tdb" ;

    tdb:unionDefaultGraph true ; # Optional

    .


# Text index description

<#indexLucene> a text:TextIndexLucene ;

    text:directory <file:luceneIndexes> ;


    text:entityMap <#entMap> ;

    .


# Mapping in the index

# URI stored in field "uri"

# rdfs:label is mapped to field "text"

<#entMap> a text:EntityMap ;

    text:entityField      "uri" ;

    text:defaultField     "text" ;

    text:map (

         [ text:field "text" ; text:predicate mms:dataElementName ]
         [text:field "text" ; text:predicate mms:dataElementDescription ]
         # the rest of the fields?

         ) .



On Tue, Aug 6, 2013 at 7:15 AM, Andy Seaborne <[email protected]> wrote:

> On 05/08/13 21:49, Brad Moran wrote:
>
>> I have an existing Jena TDB based on this example RDF:
>>
>>  ...
>
>
>> I have compiled a Jena TDB based on several of these RDF files so it is a
>> large TDB and have several SPARQL queries that work as desired. I am now
>> trying to implement a full text search on this TDB. I have downloaded the
>> Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
>> implement this text search through java code using the new Jena Text
>> Search
>> feature. This is my best attempt at solving the problem so far:
>>
>>   public class TextSearchTest {
>>      public static void main(String[] args)
>>      {
>>          try{
>>              String DBDirectory = "tdb";
>>
>>              // Construct the Lucene Index to be queried
>>
>>              String indexDir = "luceneIndexes";
>>              File file = new File(indexDir);
>>              Directory dir = FSDirectory.open(file);
>>
>>              // Create the in memory text index described
>>              Dataset ds1 = TDBFactory.createDataset(**DBDirectory);
>>              String uri = 
>> "<http://rdf.cdisc.org/mms#**dataElement<http://rdf.cdisc.org/mms#dataElement>
>> >";
>>              String property = "<http://rdf.cdisc.org/mms#**
>> dataElementName <http://rdf.cdisc.org/mms#dataElementName>>";
>>              EntityDefinition entDef = new EntityDefinition(uri, property,
>> RDFS.Literal);//RDFS.label
>>
>
> This defines the text index to be working on a particular property.
>
> You want to pass in a resource (Resource or Property object)  for
> http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>here.
>
>
>
>
>               // Construct the Lucene Index to be queried
>>              Dataset dataset = TextDatasetFactory.**createLucene(ds1,
>> dir,
>> entDef);
>>
>
> I hope you loaded the data into this dataset, not the underlying TDB one
> because other wise the text indexer would not have seen the RDF triples to
> index.
>
>
>
>>              // try query
>>              dataset.begin(ReadWrite.READ);
>>                  QueryExecution qExec = QueryExecutionFactory.create(
>>                          "PREFIX text: <http://jena.apache.org/text#>
>> PREFIX
>> mms: <http://rdf.cdisc.org/mms#> "
>>                          + "SELECT * WHERE{?s text:query
>> (mms:dataElementName 'AE')}", dataset);
>>
>>                  ResultSet rs = qExec.execSelect();
>>                  ResultSetFormatter.out(rs);
>>
>>              dataset.end();
>>          }
>>          catch(Exception e){
>>              System.out.println(e);
>>          }
>>      }
>> }
>>
>>
>> This results in: WARN  o.apache.jena.query.text.**TextQueryPF -
>> Predicate not
>> indexed: 
>> http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>
>>
>
> Because that field isn't being indexed.
>
> You can have several fileds indexed if you .set the EntityDefinition with
> additional predicates.
>
>
>  and an empty result set is printed out by resultSetFormatter. It does not
>> seem to create an index for the TDB.
>>
>
>  I believe my problem occurs with my
>> EntityDefinition (mainly because I am not sure where the parameters
>> entityField, primaryField, and primaryPredicate should come from). Also in
>> the example code it seems a lucene index is created then the data is
>> loaded
>> by an assembler file. Maybe I am just implementing this wrong. So to try
>> to
>> wrap this up:
>>
>> 1. Do I need to use an assembler file?
>>
>
> No but it may be easier that way.
>
>
>  2. Can I create an index from an existing TDB or do I need to create the
>> index as I create the TDB.
>>
>
> As the data is loaded.
>
> There is a simple application 'jena.textindexer' which will create the
> index from existing data.
>
> http://jena.staging.apache.**org/documentation/query/text-**
> query.html#building-a-text-**index<http://jena.staging.apache.org/documentation/query/text-query.html#building-a-text-index>
>
>
>  3. Could you give me a description of the parameters of EntityDefintion
>> class and where they come from? (in the rdf maybe?)
>>
>
> Create Property object for 
> http://rdf.cdisc.org/mms#**dataElementName<http://rdf.cdisc.org/mms#dataElementName>nad
>  pass that in as the 3rd argument
>
>
>  4. Any general advice on how I can solve this problem from my code.
>>
>> I tried to be as specific as possible here in hopes that you may be able
>> to
>> guide me in the right direction. If I left anything out just let me out
>> and
>> hopefully I can explain better. Thanks.
>>
>
> minor in this case, but the data is incomplete RDF/XML, no namespaces, so
> I didn't try using it.
>
> Our mantra is "complete, minimal example".  Both "complete" and "minimal"
> make it much, much easier to give good answers.
>
>
>> --Brad
>>
>>
>         Andy
>

Re: Jena Text Search Help

Reply via email to