Re: Configuring Jena TDB for a benchmark

Saud Aljaloud Sun, 20 Apr 2014 09:14:24 -0700

Thanks Paul and Andy,

> Why are you suggesting non-public?
The idea is that because we are benchmarking a number of triple stores, and our 
choice is to ask each of them about the best configurations privately for their 
own store, we want to reduce the amount of core information of our work being 
publicly available i.e: the queries or statistics about other stores, until we 
publish them later at once. 
This being said, We can discuss the general setup of Jena here.



> The benchmark driver in their SVN repository is somewhat ahead
> of the last formal release
Thanks for pointing this out.


> There isn't much on matching parts of strings in BSBM.
BSBM has got a number of literal predicates with a good/enough length, see the 
predicates within the assembler below.
or see,
http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/Dataset/index.html


Here are the steps we do to perform a test on a 1 million triples:

=======================
Jena Configurations:
1- edit fuseki-server:
JVM_ARGS=${JVM_ARGS:--Xmx20G} 


2- create an Assembler for Jena Text with Lucene "BSBM-fulltext-1.ttl" :


## Example of a TDB dataset and text index published using Fuseki
@prefix :        <http://localhost/jena_example/#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix rev: <http://purl.org/stuff/rev#> .
@prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> .
@prefix bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .

[] rdf:type fuseki:Server ;
   # Timeout - server-wide default: milliseconds.
   # Format 1: "1000" -- 1 second timeout
   # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout 
to for rest of query.
   # See java doc for ARQ.queryTimeout
   # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;
   # ja:loadClass "your.code.Class" ;

   fuseki:services (
     <#service_text_tdb>
   ) .

# TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

# Text
[] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
#text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------

<#service_text_tdb> rdf:type fuseki:Service ;
    rdfs:label                      "TDB/text service" ;
    fuseki:name                     "BSBM1M" ;
    fuseki:serviceQuery             "query" ;
    fuseki:serviceQuery             "sparql" ;
    fuseki:serviceUpdate            "update" ;
    fuseki:serviceUpload            "upload" ;
    fuseki:serviceReadGraphStore    "get" ;
    fuseki:serviceReadWriteGraphStore    "data" ;
    fuseki:dataset                  :text_dataset ;
    .

:text_dataset rdf:type     text:TextDataset ;
    text:dataset   <#dataset> ;
    ##text:index   <#indexSolr> ;
    text:index     <#indexLucene> ;
    .

<#dataset> rdf:type      tdb:DatasetTDB ;
    tdb:location "/home/path/apache-jena-2.11.1/data" ;
    #tdb:unionDefaultGraph true ;
    .

<#indexSolr> a text:TextIndexSolr ;
    #text:server <http://localhost:8983/solr/COLLECTION> ;
    text:server <embedded:SolrARQ> ;
    text:entityMap <#entMap> ;
    .

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:/home/path/apache-jena-2.11.1/lucene> ;
    ##text:directory "mem" ;
    text:entityMap <#entMap> ;
    .

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;        ## Should be defined in the text:map.
    text:map (
         # rdfs:label            
                 [ text:field "text" ; text:predicate rdfs:label ]
                 [ text:field "text" ; text:predicate rdfs:comment ]
                 [ text:field "text" ; text:predicate foaf:name ]
                 [ text:field "text" ; text:predicate  dc:title ]
                 [ text:field "text" ; text:predicate  rev:text ]
                
         ) .






=======================
Jena Test procedure with statistics for BSBM1M (one million triples): using a 
machine with specs [1,2]
1- load data:
./tdbloader2 --loc ../data/ ~/bsbmtools-0.2/dataset_1M.ttl
        15:25:24 -- 35 seconds
        Size: 137M      .



2- build jena text index:
java -cp fuseki-server.jar jena.textindexer --desc=BSBM-fulltext-1.ttl
        INFO 31123 (3112 per second)properties indexed (3112 per second 
overall) 
        INFO 72657 (5589 per second) properties indexed 
        Size: 17M       .

3- Flush OS memory and swap.

4- Run Server:
./fuseki-server --config=BSBM-fulltext-1.ttl


5- Run test using BSBM driver:
./testdriver -ucf usecases/literalSearch/fulltext/jena.txt -w 1000 -o 
Jena_1Client_BSBM1M.xml http://localhost:3030/BSBM1M/sparql


=======================



Any comments would be appreciated.


Many thanks,
Saud


[1] [2]2x AMD Opteron 4280 Processor (2.8GHz, 8C, 8M L2/8M L3 Cache, 95W), 
DDR3-1600 MHz 128GB Memory for 2CPU (8x16GB Quad Rank LV RDIMMs) 1066MHz 2x 
300GB, SAS 6Gbps, 3.5-in, 15K RPM Hard Drive (Hot-plug) SAS 6/iR Controller, 
For Hot Plug HDD Chassis No Optical Drive Redundant Power Supply (2 PSU) 500W 
2M Rack Power Cord C13/C14 12A iDRAC6 Enterprise Sliding Ready Rack Rails C11 
Hot-Swap - R0 for SAS 6iR, Min. 2 Max. 4 SAS/SATA Hot Plug Drives
[2] Red Hat Enterprise Linux Server release 6.4 (Santiago), java version 
"1.7.0_51", 




On 20 Apr 2014, at 11:49, Andy Seaborne <[email protected]> wrote:

> On 19/04/14 18:33, Saud Aljaloud wrote:
>> Dear Jena folks,
>> 
>> We are investigating how efficient different triple stores, including
>> Jena TDB, handle literal strings within SPARQL. To this end, We are
>> now working on benchmarking these triple stores against a set of
>> specific queries, using the Berlin Benchmark (BSBM) test driver [1],
>> dataset and matrices[2].
>> 
> 
> BSBM measures a certain kind of workload (actually, 2 kinds, the explore
> and BI).  The benchmark driver in their SVN repository is somewhat ahead
> of the last formal release.  You are actually benchmarking TDB+Fuseki,
> not TDB in isolation , because the work load has a significant
> proportion of network communication.
> 
> There isn't much on matching parts of strings in BSBM.
> 
> As Paul observes, a text index can make a big difference.
> 
>> We are using the latest Jena releases: Jena VERSION: 2.11.1,  Fuseki:
>> VERSION: 1.0.1.
>> 
>> To get the best out of Jena, we would like to ask your valuable
>> feedback and other optimisations that can boost the performance of
>> Jena. I should provide more info, but non-public communication with
>> someone/group from Jena who are willing to be directly contacted by
>> email is preferable.
> 
> Jena is an open source project and works in public.  I don't work offlist 
> unless there is a specific (usually, commercial) reason.
> 
> We can discuss TDB here. Not being a product, there is no reason not to 
> discuss both good and bad features here with the developers.  Why are you 
> suggesting non-public?
> 
> There are other benchmark frameworks: eg.
> http://www.slideshare.net/RobVesse/practical-sparql-benchmarking
> which may be easier to use for a new set of queries and data.
> 
>       Andy
> 
>> Configurations are going to be publicly
>> available later within the benchmark.
>> 
>> 
>> Kind Regards,
>> 
>> Saud
>> 
>> 
>> [1] http://sourceforge.net/projects/bsbmtools/ [2]
>> http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/
>> 
>> 
>> 
>> 
>

Re: Configuring Jena TDB for a benchmark

Reply via email to