Jena Text Search Help

Brad Moran Mon, 05 Aug 2013 13:51:44 -0700

I have an existing Jena TDB based on this example RDF:

   <mms:DataElement rdf:ID="DE.Intervention.--MODIFY">
    <sdtms:dataElementRole rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.SynonymQualifier"/>
    <sdtms:supportedBySEND rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean";
    >true</sdtms:supportedBySEND>
    <mms:ordinal rdf:datatype="
http://www.w3.org/2001/XMLSchema#positiveInteger";
    >2</mms:ordinal>
    <mms:dataElementDescription rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
    >If the value for --TRT is modified for coding purposes, then the
modified text is placed here.</mms:dataElementDescription>
    <mms:dataElementName rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
    >--MODIFY</mms:dataElementName>
    <sdtms:dataElementType rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.Character"/>
    <sdtms:supportedBySDTMIG rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean";
    >true</sdtms:supportedBySDTMIG>
    <mms:dataElementLabel rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
    >Modified Treatment Name</mms:dataElementLabel>
    <mms:dataElementType rdf:datatype="
http://www.w3.org/2001/XMLSchema#QName";
    >xsd:string</mms:dataElementType>
    <mms:context>
      <mms:VariableGrouping rdf:ID="InterventionVariables">
        <mms:contextLabel rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
        >Interventions Observation Class Variables</mms:contextLabel>
        <mms:ordinal rdf:datatype="
http://www.w3.org/2001/XMLSchema#positiveInteger";
        >1</mms:ordinal>
        <mms:context rdf:resource="#Model.SDTM-1-2"/>
      </mms:VariableGrouping>
    </mms:context>
    <sdtms:qualifies>
      <mms:DataElement rdf:ID="DE.Intervention.--TRT">
        <mms:dataElementName rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
        >--TRT</mms:dataElementName>
        <sdtms:dataElementRole rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.TopicVariable"/>
        <sdtms:dataElementType rdf:resource="
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.Character"/>
        <mms:dataElementDescription rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
        >The topic for the intervention observation, usually the verbatim
name of the treatment, drug, medicine, or therapy given during the dosing
interval          for the observation.</mms:dataElementDescription>
        <mms:dataElementLabel rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
        >Name of Treatment</mms:dataElementLabel>
        <sdtms:supportedBySDTMIG rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean";
        >true</sdtms:supportedBySDTMIG>
        <mms:context rdf:resource="#InterventionVariables"/>
        <mms:dataElementType rdf:datatype="
http://www.w3.org/2001/XMLSchema#QName";
        >xsd:string</mms:dataElementType>
        <sdtms:supportedBySEND rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean";
        >true</sdtms:supportedBySEND>
        <mms:ordinal rdf:datatype="
http://www.w3.org/2001/XMLSchema#positiveInteger";
        >1</mms:ordinal>
      </mms:DataElement>
    </sdtms:qualifies>
   </mms:DataElement>



This is one of two forms of rdf that is in the TDB, the second is:

   <mms:PermissibleValue rdf:ID="C81224.C81203">
    <mms:inValueDomain>
      <mms:EnumeratedValueDomain rdf:ID="C81224">
        <cts:cdiscDefinition rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
        >Derivation Type: Analysis value derivation
method.</cts:cdiscDefinition>
        <cts:nciPreferredTerm rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
        >CDISC ADaM Derivation Type Terminology</cts:nciPreferredTerm>
        <cts:nciCode rdf:datatype="http://www.w3.org/2001/XMLSchema#string";
        >C81224</cts:nciCode>
        <cts:cdiscSynonyms rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
        >Derivation Type</cts:cdiscSynonyms>
        <cts:cdiscSubmissionValue rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
        >DTYPE</cts:cdiscSubmissionValue>
        <cts:codelistName rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
        >Derivation Type</cts:codelistName>
        <cts:isExtensibleCodelist rdf:datatype="
http://www.w3.org/2001/XMLSchema#boolean";
        >true</cts:isExtensibleCodelist>
      </mms:EnumeratedValueDomain>
    </mms:inValueDomain>
    <cts:nciPreferredTerm rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
    >Worst Case Imputation Technique</cts:nciPreferredTerm>
    <cts:nciCode rdf:datatype="http://www.w3.org/2001/XMLSchema#string";
    >C81203</cts:nciCode>
    <cts:cdiscDefinition rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
    >Worst Case: A data imputation technique which populates missing values
with the worst possible outcome.</cts:cdiscDefinition>
    <cts:cdiscSubmissionValue rdf:datatype="
http://www.w3.org/2001/XMLSchema#string";
    >WC</cts:cdiscSubmissionValue>
  </mms:PermissibleValue>


I have compiled a Jena TDB based on several of these RDF files so it is a
large TDB and have several SPARQL queries that work as desired. I am now
trying to implement a full text search on this TDB. I have downloaded the
Jena 2.10.2 Snapshot jars and figured out my dependencies. I would like to
implement this text search through java code using the new Jena Text Search
feature. This is my best attempt at solving the problem so far:

 public class TextSearchTest {
    public static void main(String[] args)
    {
        try{
            String DBDirectory = "tdb";

            // Construct the Lucene Index to be queried

            String indexDir = "luceneIndexes";
            File file = new File(indexDir);
            Directory dir = FSDirectory.open(file);

            // Create the in memory text index described
            Dataset ds1 = TDBFactory.createDataset(DBDirectory);
            String uri = "<http://rdf.cdisc.org/mms#dataElement>";
            String property = "<http://rdf.cdisc.org/mms#dataElementName>";
            EntityDefinition entDef = new EntityDefinition(uri, property,
RDFS.Literal);//RDFS.label
            // Construct the Lucene Index to be queried
            Dataset dataset = TextDatasetFactory.createLucene(ds1, dir,
entDef);

            // try query
            dataset.begin(ReadWrite.READ);
                QueryExecution qExec = QueryExecutionFactory.create(
                        "PREFIX text: <http://jena.apache.org/text#> PREFIX
mms: <http://rdf.cdisc.org/mms#> "
                        + "SELECT * WHERE{?s text:query
(mms:dataElementName 'AE')}", dataset);

                ResultSet rs = qExec.execSelect();
                ResultSetFormatter.out(rs);

            dataset.end();
        }
        catch(Exception e){
            System.out.println(e);
        }
    }
}


This results in: WARN  o.apache.jena.query.text.TextQueryPF - Predicate not
indexed: http://rdf.cdisc.org/mms#dataElementName
and an empty result set is printed out by resultSetFormatter. It does not
seem to create an index for the TDB. I believe my problem occurs with my
EntityDefinition (mainly because I am not sure where the parameters
entityField, primaryField, and primaryPredicate should come from). Also in
the example code it seems a lucene index is created then the data is loaded
by an assembler file. Maybe I am just implementing this wrong. So to try to
wrap this up:

1. Do I need to use an assembler file?
2. Can I create an index from an existing TDB or do I need to create the
index as I create the TDB.
3. Could you give me a description of the parameters of EntityDefintion
class and where they come from? (in the rdf maybe?)
4. Any general advice on how I can solve this problem from my code.

I tried to be as specific as possible here in hopes that you may be able to
guide me in the right direction. If I left anything out just let me out and
hopefully I can explain better. Thanks.

--Brad

Jena Text Search Help

Reply via email to