On 20/04/14 17:13, Saud Aljaloud wrote:
Thanks Paul and Andy,

Why are you suggesting non-public?
The idea is that because we are benchmarking a number of triple
stores, and our choice is to ask each of them about the best
configurations privately for their own store, we want to reduce the
amount of core information of our work being publicly available i.e:
the queries or statistics about other stores, until we publish them
later at once. This being said, We can discuss the general setup of
Jena here.

Unclear what anyone would do with such information ahead of publication unless it's to copy you and publish earlier. Just use current releases.

No relationship to http://www.ldbc.eu/?

As it's fulltext, you have to use the custom features of each system so comparing like-with-like is going to be hard.

Each custom extension is going to have assumptions on usage - for jena, you can use Solr and have other applications going to the same index, it's not a Jena specific structure anymore (LARQ was). The text search languages have different capabilties.

((That's really what stopped it getting standardized in SPARQL 1.1 - it's a large piece of work (see xpath-full-text) and so it was this or most of the other features.))

The benchmark driver in their SVN repository is somewhat ahead of
the last formal release
Thanks for pointing this out.

I run a modified version that runs TDB locally to the benchmark driver to benchmark just TDB and not the protocol component.

There isn't much on matching parts of strings in BSBM.
BSBM has got a number of literal predicates with a good/enough
length, see the predicates within the assembler below. or see,
http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/Dataset/index.html



Here are the steps we do to perform a test on a 1 million triples:

1 million? Isn't that rather small? The whole thing will fit in memory. 1m BSBM fits completely in 1G RAM and TDB isn't very space efficient as it trades it for direct memory mapping.


======================= Jena Configurations: 1- edit fuseki-server:
JVM_ARGS=${JVM_ARGS:--Xmx20G}

3G should be enough. 20G will slow it down (a small amount given your hardware) as much of TDB's memory usage is outside the Java heap.

(why not using SSD's? They are common these days. Does wonders for loading speed!)

>
> =======================
> Jena Configurations:
> 1- edit fuseki-server:
> JVM_ARGS=${JVM_ARGS:--Xmx20G}
>
>
> 2- create an Assembler for Jena Text with Lucene "BSBM-fulltext-1.ttl" :
>
>
> ## Example of a TDB dataset and text index published using Fuseki
> @prefix :        <http://localhost/jena_example/#> .
> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
> @prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
> @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
> @prefix text:    <http://jena.apache.org/text#> .
> @prefix rev: <http://purl.org/stuff/rev#> .
> @prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> . > @prefix bsbm-inst: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/> .
> @prefix dc: <http://purl.org/dc/elements/1.1/> .
> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>
> [] rdf:type fuseki:Server ;
>     # Timeout - server-wide default: milliseconds.
>     # Format 1: "1000" -- 1 second timeout
> # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout to for rest of query.
>     # See java doc for ARQ.queryTimeout
> # ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "10000" ] ;
>     # ja:loadClass "your.code.Class" ;
>
>     fuseki:services (
>       <#service_text_tdb>
>     ) .
>
> # TDB
> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>
> # Text
> [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
> #text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>
> ## ---------------------------------------------------------------
>
> <#service_text_tdb> rdf:type fuseki:Service ;
>      rdfs:label                      "TDB/text service" ;
>      fuseki:name                     "BSBM1M" ;
>      fuseki:serviceQuery             "query" ;
>      fuseki:serviceQuery             "sparql" ;
>      fuseki:serviceUpdate            "update" ;
>      fuseki:serviceUpload            "upload" ;
>      fuseki:serviceReadGraphStore    "get" ;
>      fuseki:serviceReadWriteGraphStore    "data" ;
>      fuseki:dataset                  :text_dataset ;
>      .
>
> :text_dataset rdf:type     text:TextDataset ;
>      text:dataset   <#dataset> ;
>      ##text:index   <#indexSolr> ;
>      text:index     <#indexLucene> ;
>      .
>
> <#dataset> rdf:type      tdb:DatasetTDB ;
>      tdb:location "/home/path/apache-jena-2.11.1/data" ;
>      #tdb:unionDefaultGraph true ;
>      .
>
> <#indexSolr> a text:TextIndexSolr ;
>      #text:server <http://localhost:8983/solr/COLLECTION> ;
>      text:server <embedded:SolrARQ> ;
>      text:entityMap <#entMap> ;
>      .
>
> <#indexLucene> a text:TextIndexLucene ;
>      text:directory <file:/home/path/apache-jena-2.11.1/lucene> ;
>      ##text:directory "mem" ;
>      text:entityMap <#entMap> ;
>      .
>
> <#entMap> a text:EntityMap ;
>      text:entityField      "uri" ;
> text:defaultField "text" ; ## Should be defined in the text:map.
>      text:map (
>           # rdfs:label
>             [ text:field "text" ; text:predicate rdfs:label ]
>             [ text:field "text" ; text:predicate rdfs:comment ]
>             [ text:field "text" ; text:predicate foaf:name ]
>             [ text:field "text" ; text:predicate  dc:title ]
>             [ text:field "text" ; text:predicate  rev:text ]
>            
>           ) .
>
>
>
>
>
>
> =======================
> Jena Test procedure with statistics for BSBM1M (one million triples): using a machine with specs [1,2]
> 1- load data:
> ./tdbloader2 --loc ../data/ ~/bsbmtools-0.2/dataset_1M.ttl
>    15:25:24 -- 35 seconds
>    Size: 137M      .
>
>
>
> 2- build jena text index:
> java -cp fuseki-server.jar jena.textindexer --desc=BSBM-fulltext-1.ttl
>    INFO 31123 (3112 per second)properties indexed (3112 per second overall)
>    INFO 72657 (5589 per second) properties indexed
>    Size: 17M       .
>
> 3- Flush OS memory and swap.
>
> 4- Run Server:
> ./fuseki-server --config=BSBM-fulltext-1.ttl
>
>
> 5- Run test using BSBM driver:
> ./testdriver -ucf usecases/literalSearch/fulltext/jena.txt -w 1000 -o Jena_1Client_BSBM1M.xml http://localhost:3030/BSBM1M/sparql
>
>
> =======================
>
>
>
> Any comments would be appreciated.
>
>
> Many thanks,
> Saud
>
>

> [1] [2]2x AMD Opteron 4280 Processor (2.8GHz, 8C, 8M L2/8M L3 Cache, 95W), DDR3-1600 MHz 128GB Memory for 2CPU (8x16GB Quad Rank LV RDIMMs) 1066MHz 2x 300GB, SAS 6Gbps, 3.5-in, 15K RPM Hard Drive (Hot-plug) SAS 6/iR Controller, For Hot Plug HDD Chassis No Optical Drive Redundant Power Supply (2 PSU) 500W 2M Rack Power Cord C13/C14 12A iDRAC6 Enterprise Sliding Ready Rack Rails C11 Hot-Swap - R0 for SAS 6iR, Min. 2 Max. 4 SAS/SATA Hot Plug Drives

> [2] Red Hat Enterprise Linux Server release 6.4 (Santiago), java version "1.7.0_51",
>
>
>
>

Reply via email to