Thanks Paul and Andy, > Why are you suggesting non-public? The idea is that because we are benchmarking a number of triple stores, and our choice is to ask each of them about the best configurations privately for their own store, we want to reduce the amount of core information of our work being publicly available i.e: the queries or statistics about other stores, until we publish them later at once. This being said, We can discuss the general setup of Jena here.
> The benchmark driver in their SVN repository is somewhat ahead > of the last formal release Thanks for pointing this out. > There isn't much on matching parts of strings in BSBM. BSBM has got a number of literal predicates with a good/enough length, see the predicates within the assembler below. or see, http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/Dataset/index.html Here are the steps we do to perform a test on a 1 million triples: ======================= Jena Configurations: 1- edit fuseki-server: JVM_ARGS=${JVM_ARGS:--Xmx20G} 2- create an Assembler for Jena Text with Lucene "BSBM-fulltext-1.ttl" : ## Example of a TDB dataset and text index published using Fuseki @prefix : <http://localhost/jena_example/#> . @prefix fuseki: <http://jena.apache.org/fuseki#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> . @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . @prefix text: <http://jena.apache.org/text#> . @prefix rev: <http://purl.org/stuff/rev#> . @prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> . @prefix bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/> . @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . [] rdf:type fuseki:Server ; # Timeout - server-wide default: milliseconds. # Format 1: "1000" -- 1 second timeout # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout to for rest of query. # See java doc for ARQ.queryTimeout # ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "10000" ] ; # ja:loadClass "your.code.Class" ; fuseki:services ( <#service_text_tdb> ) . # TDB [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset . tdb:GraphTDB rdfs:subClassOf ja:Model . # Text [] ja:loadClass "org.apache.jena.query.text.TextQuery" . text:TextDataset rdfs:subClassOf ja:RDFDataset . #text:TextIndexSolr rdfs:subClassOf text:TextIndex . text:TextIndexLucene rdfs:subClassOf text:TextIndex . ## --------------------------------------------------------------- <#service_text_tdb> rdf:type fuseki:Service ; rdfs:label "TDB/text service" ; fuseki:name "BSBM1M" ; fuseki:serviceQuery "query" ; fuseki:serviceQuery "sparql" ; fuseki:serviceUpdate "update" ; fuseki:serviceUpload "upload" ; fuseki:serviceReadGraphStore "get" ; fuseki:serviceReadWriteGraphStore "data" ; fuseki:dataset :text_dataset ; . :text_dataset rdf:type text:TextDataset ; text:dataset <#dataset> ; ##text:index <#indexSolr> ; text:index <#indexLucene> ; . <#dataset> rdf:type tdb:DatasetTDB ; tdb:location "/home/path/apache-jena-2.11.1/data" ; #tdb:unionDefaultGraph true ; . <#indexSolr> a text:TextIndexSolr ; #text:server <http://localhost:8983/solr/COLLECTION> ; text:server <embedded:SolrARQ> ; text:entityMap <#entMap> ; . <#indexLucene> a text:TextIndexLucene ; text:directory <file:/home/path/apache-jena-2.11.1/lucene> ; ##text:directory "mem" ; text:entityMap <#entMap> ; . <#entMap> a text:EntityMap ; text:entityField "uri" ; text:defaultField "text" ; ## Should be defined in the text:map. text:map ( # rdfs:label [ text:field "text" ; text:predicate rdfs:label ] [ text:field "text" ; text:predicate rdfs:comment ] [ text:field "text" ; text:predicate foaf:name ] [ text:field "text" ; text:predicate dc:title ] [ text:field "text" ; text:predicate rev:text ] ) . ======================= Jena Test procedure with statistics for BSBM1M (one million triples): using a machine with specs [1,2] 1- load data: ./tdbloader2 --loc ../data/ ~/bsbmtools-0.2/dataset_1M.ttl 15:25:24 -- 35 seconds Size: 137M . 2- build jena text index: java -cp fuseki-server.jar jena.textindexer --desc=BSBM-fulltext-1.ttl INFO 31123 (3112 per second)properties indexed (3112 per second overall) INFO 72657 (5589 per second) properties indexed Size: 17M . 3- Flush OS memory and swap. 4- Run Server: ./fuseki-server --config=BSBM-fulltext-1.ttl 5- Run test using BSBM driver: ./testdriver -ucf usecases/literalSearch/fulltext/jena.txt -w 1000 -o Jena_1Client_BSBM1M.xml http://localhost:3030/BSBM1M/sparql ======================= Any comments would be appreciated. Many thanks, Saud [1] [2]2x AMD Opteron 4280 Processor (2.8GHz, 8C, 8M L2/8M L3 Cache, 95W), DDR3-1600 MHz 128GB Memory for 2CPU (8x16GB Quad Rank LV RDIMMs) 1066MHz 2x 300GB, SAS 6Gbps, 3.5-in, 15K RPM Hard Drive (Hot-plug) SAS 6/iR Controller, For Hot Plug HDD Chassis No Optical Drive Redundant Power Supply (2 PSU) 500W 2M Rack Power Cord C13/C14 12A iDRAC6 Enterprise Sliding Ready Rack Rails C11 Hot-Swap - R0 for SAS 6iR, Min. 2 Max. 4 SAS/SATA Hot Plug Drives [2] Red Hat Enterprise Linux Server release 6.4 (Santiago), java version "1.7.0_51", On 20 Apr 2014, at 11:49, Andy Seaborne <[email protected]> wrote: > On 19/04/14 18:33, Saud Aljaloud wrote: >> Dear Jena folks, >> >> We are investigating how efficient different triple stores, including >> Jena TDB, handle literal strings within SPARQL. To this end, We are >> now working on benchmarking these triple stores against a set of >> specific queries, using the Berlin Benchmark (BSBM) test driver [1], >> dataset and matrices[2]. >> > > BSBM measures a certain kind of workload (actually, 2 kinds, the explore > and BI). The benchmark driver in their SVN repository is somewhat ahead > of the last formal release. You are actually benchmarking TDB+Fuseki, > not TDB in isolation , because the work load has a significant > proportion of network communication. > > There isn't much on matching parts of strings in BSBM. > > As Paul observes, a text index can make a big difference. > >> We are using the latest Jena releases: Jena VERSION: 2.11.1, Fuseki: >> VERSION: 1.0.1. >> >> To get the best out of Jena, we would like to ask your valuable >> feedback and other optimisations that can boost the performance of >> Jena. I should provide more info, but non-public communication with >> someone/group from Jena who are willing to be directly contacted by >> email is preferable. > > Jena is an open source project and works in public. I don't work offlist > unless there is a specific (usually, commercial) reason. > > We can discuss TDB here. Not being a product, there is no reason not to > discuss both good and bad features here with the developers. Why are you > suggesting non-public? > > There are other benchmark frameworks: eg. > http://www.slideshare.net/RobVesse/practical-sparql-benchmarking > which may be easier to use for a new set of queries and data. > > Andy > >> Configurations are going to be publicly >> available later within the benchmark. >> >> >> Kind Regards, >> >> Saud >> >> >> [1] http://sourceforge.net/projects/bsbmtools/ [2] >> http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/ >> >> >> >> >
