Re: Configuring Jena TDB for a benchmark

Saud Al-Jaloud Mon, 21 Apr 2014 04:24:28 -0700

> Just use current releases.

We are using current releases, we are not looking for tuning systems but rather 
the right configs as this is some sort of an extension. Otherwise, some might 
argue that we were unfair/miss features for some stores over others. For 
example, buffer size or the way of building full text index etc. To some 
extend, we are trying to follow the same rule as BSBM, see 
http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/BenchmarkRules/


> No relationship to http://www.ldbc.eu/?
Nope, this is part of my PhD, Which is mainly about optimising regex within 
SPARQL, but I also look at fulltext search here.

> As it's fulltext, you have to use the custom features of each system so 
> comparing like-with-like is going to be hard.
Indeed, there are couple of challenges, that’s why I see it interesting area.

> ((That's really what stopped it getting standardized in SPARQL 1.1 - it's a 
> large piece of work (see xpath-full-text) and so it was this or most of the 
> other features.))
I’ve also read your post, 
http://mail-archives.apache.org/mod_mbox/jena-users/201306.mbox/%[email protected]%3E
Thanks for sharing such info. 
But, don’t you think that even a common syntax can make a huge difference, 
regardless of how stores internally implement it?


> 1 million?  Isn't that rather small?  The whole thing will fit in memory.  1m 
> BSBM fits completely in 1G RAM and TDB isn't very space efficient as it 
> trades it for direct memory mapping.
That was just a test for the purpose of these emails to make sure we are doing 
things right. the test will target 200M, maybe more.

> 3G should be enough.  20G will slow it down (a small amount given your 
> hardware) as much of TDB's memory usage is outside the Java heap.
I’ll take this for Jena.


> 20G will slow it down
Generally, I thought the max won’t affect the speed, as long as it doesn’t 
reach the max. this will reduce GCs being performed, isn’t it?


> (why not using SSD's?  They are common these days.  Does wonders for loading 
> speed!)
I’ve seen some stores recommending them. Unfortunately, I’ve got no control 
over this for now. 
Just out of curiosity, within Jena, do you think that the existing index 
structure i.e B+tree needs any changes to get the best of SSDs?


Cheers,
Saud

On 21 Apr 2014, at 10:38, Andy Seaborne <[email protected]> wrote:

> On 20/04/14 17:13, Saud Aljaloud wrote:
>> Thanks Paul and Andy,
>> 
>>> Why are you suggesting non-public?
>> The idea is that because we are benchmarking a number of triple
>> stores, and our choice is to ask each of them about the best
>> configurations privately for their own store, we want to reduce the
>> amount of core information of our work being publicly available i.e:
>> the queries or statistics about other stores, until we publish them
>> later at once. This being said, We can discuss the general setup of
>> Jena here.
> 
> Unclear what anyone would do with such information ahead of publication 
> unless it's to copy you and publish earlier.  Just use current releases.
> 
> No relationship to http://www.ldbc.eu/?
> 
> As it's fulltext, you have to use the custom features of each system so 
> comparing like-with-like is going to be hard.
> 
> Each custom extension is going to have assumptions on usage - for jena, you 
> can use Solr and have other applications going to the same index, it's not a 
> Jena specific structure anymore (LARQ was).  The text search languages have 
> different capabilties.
> 
> ((That's really what stopped it getting standardized in SPARQL 1.1 - it's a 
> large piece of work (see xpath-full-text) and so it was this or most of the 
> other features.))
> 
>>> The benchmark driver in their SVN repository is somewhat ahead of
>>> the last formal release
>> Thanks for pointing this out.
> 
> I run a modified version that runs TDB locally to the benchmark driver to 
> benchmark just TDB and not the protocol component.
> 
>>> There isn't much on matching parts of strings in BSBM.
>> BSBM has got a number of literal predicates with a good/enough
>> length, see the predicates within the assembler below. or see,
>> http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/Dataset/index.html
>> 
>> 
>> 
>> Here are the steps we do to perform a test on a 1 million triples:
> 
> 1 million?  Isn't that rather small?  The whole thing will fit in memory.  1m 
> BSBM fits completely in 1G RAM and TDB isn't very space efficient as it 
> trades it for direct memory mapping.
> 
>> 
>> ======================= Jena Configurations: 1- edit fuseki-server:
>> JVM_ARGS=${JVM_ARGS:--Xmx20G}
> 
> 3G should be enough.  20G will slow it down (a small amount given your 
> hardware) as much of TDB's memory usage is outside the Java heap.
> 
> (why not using SSD's?  They are common these days.  Does wonders for loading 
> speed!)
> 
> >
> > =======================
> > Jena Configurations:
> > 1- edit fuseki-server:
> > JVM_ARGS=${JVM_ARGS:--Xmx20G}
> >
> >
> > 2- create an Assembler for Jena Text with Lucene "BSBM-fulltext-1.ttl" :
> >
> >
> > ## Example of a TDB dataset and text index published using Fuseki
> > @prefix :        <http://localhost/jena_example/#> .
> > @prefix fuseki:  <http://jena.apache.org/fuseki#> .
> > @prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> > @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
> > @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
> > @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
> > @prefix text:    <http://jena.apache.org/text#> .
> > @prefix rev: <http://purl.org/stuff/rev#> .
> > @prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> .
> > @prefix bsbm-inst: 
> > <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/> .
> > @prefix dc: <http://purl.org/dc/elements/1.1/> .
> > @prefix foaf: <http://xmlns.com/foaf/0.1/> .
> >
> > [] rdf:type fuseki:Server ;
> >     # Timeout - server-wide default: milliseconds.
> >     # Format 1: "1000" -- 1 second timeout
> >     # Format 2: "10000,60000" -- 10s timeout to first result, then 60s 
> > timeout to for rest of query.
> >     # See java doc for ARQ.queryTimeout
> >     # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;
> >     # ja:loadClass "your.code.Class" ;
> >
> >     fuseki:services (
> >       <#service_text_tdb>
> >     ) .
> >
> > # TDB
> > [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
> > tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
> > tdb:GraphTDB    rdfs:subClassOf  ja:Model .
> >
> > # Text
> > [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
> > text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
> > #text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
> > text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
> >
> > ## ---------------------------------------------------------------
> >
> > <#service_text_tdb> rdf:type fuseki:Service ;
> >      rdfs:label                      "TDB/text service" ;
> >      fuseki:name                     "BSBM1M" ;
> >      fuseki:serviceQuery             "query" ;
> >      fuseki:serviceQuery             "sparql" ;
> >      fuseki:serviceUpdate            "update" ;
> >      fuseki:serviceUpload            "upload" ;
> >      fuseki:serviceReadGraphStore    "get" ;
> >      fuseki:serviceReadWriteGraphStore    "data" ;
> >      fuseki:dataset                  :text_dataset ;
> >      .
> >
> > :text_dataset rdf:type     text:TextDataset ;
> >      text:dataset   <#dataset> ;
> >      ##text:index   <#indexSolr> ;
> >      text:index     <#indexLucene> ;
> >      .
> >
> > <#dataset> rdf:type      tdb:DatasetTDB ;
> >      tdb:location "/home/path/apache-jena-2.11.1/data" ;
> >      #tdb:unionDefaultGraph true ;
> >      .
> >
> > <#indexSolr> a text:TextIndexSolr ;
> >      #text:server <http://localhost:8983/solr/COLLECTION> ;
> >      text:server <embedded:SolrARQ> ;
> >      text:entityMap <#entMap> ;
> >      .
> >
> > <#indexLucene> a text:TextIndexLucene ;
> >      text:directory <file:/home/path/apache-jena-2.11.1/lucene> ;
> >      ##text:directory "mem" ;
> >      text:entityMap <#entMap> ;
> >      .
> >
> > <#entMap> a text:EntityMap ;
> >      text:entityField      "uri" ;
> >      text:defaultField     "text" ;        ## Should be defined in the 
> > text:map.
> >      text:map (
> >           # rdfs:label
> >              [ text:field "text" ; text:predicate rdfs:label ]
> >              [ text:field "text" ; text:predicate rdfs:comment ]
> >              [ text:field "text" ; text:predicate foaf:name ]
> >              [ text:field "text" ; text:predicate  dc:title ]
> >              [ text:field "text" ; text:predicate  rev:text ]
> >             
> >           ) .
> >
> >
> >
> >
> >
> >
> > =======================
> > Jena Test procedure with statistics for BSBM1M (one million triples): using 
> > a machine with specs [1,2]
> > 1- load data:
> > ./tdbloader2 --loc ../data/ ~/bsbmtools-0.2/dataset_1M.ttl
> >     15:25:24 -- 35 seconds
> >     Size: 137M      .
> >
> >
> >
> > 2- build jena text index:
> > java -cp fuseki-server.jar jena.textindexer --desc=BSBM-fulltext-1.ttl
> >     INFO 31123 (3112 per second)properties indexed (3112 per second overall)
> >     INFO 72657 (5589 per second) properties indexed
> >     Size: 17M       .
> >
> > 3- Flush OS memory and swap.
> >
> > 4- Run Server:
> > ./fuseki-server --config=BSBM-fulltext-1.ttl
> >
> >
> > 5- Run test using BSBM driver:
> > ./testdriver -ucf usecases/literalSearch/fulltext/jena.txt -w 1000 -o 
> > Jena_1Client_BSBM1M.xml http://localhost:3030/BSBM1M/sparql
> >
> >
> > =======================
> >
> >
> >
> > Any comments would be appreciated.
> >
> >
> > Many thanks,
> > Saud
> >
> >
> 
> > [1] [2]2x AMD Opteron 4280 Processor (2.8GHz, 8C, 8M L2/8M L3 Cache, 95W), 
> > DDR3-1600 MHz 128GB Memory for 2CPU (8x16GB Quad Rank LV RDIMMs) 1066MHz 2x 
> > 300GB, SAS 6Gbps, 3.5-in, 15K RPM Hard Drive (Hot-plug) SAS 6/iR 
> > Controller, For Hot Plug HDD Chassis No Optical Drive Redundant Power 
> > Supply (2 PSU) 500W 2M Rack Power Cord C13/C14 12A iDRAC6 Enterprise 
> > Sliding Ready Rack Rails C11 Hot-Swap - R0 for SAS 6iR, Min. 2 Max. 4 
> > SAS/SATA Hot Plug Drives
> 
> > [2] Red Hat Enterprise Linux Server release 6.4 (Santiago), java version 
> > "1.7.0_51",
> >
> >
> >
> >

Re: Configuring Jena TDB for a benchmark

Reply via email to