On 21/04/14 12:23, Saud Al-Jaloud wrote:
Just use current releases.

We are using current releases, we are not looking for tuning systems
but rather the right configs as this is some sort of an extension.
Otherwise, some might argue that we were unfair/miss features for
some stores over others. For example, buffer size or the way of
building full text index etc. To some extend, we are trying to follow
the same rule as BSBM, see
http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/BenchmarkRules/


No relationship to http://www.ldbc.eu/?
Nope, this is part of my PhD, Which is mainly about optimising regex
within SPARQL, but I also look at fulltext search here.

I'd be interested in hearing in what ways the problem is different from SQL.

Also - in SQL, there is LIKE. Would it be a good idea for SPARQL to have a separate "LIKE" (=> can a system do a lot better with that than analysing a regex?).


As it's fulltext, you have to use the custom features of each
system so comparing like-with-like is going to be hard.
Indeed, there are couple of challenges, that’s why I see it
interesting area.

((That's really what stopped it getting standardized in SPARQL 1.1
- it's a large piece of work (see xpath-full-text) and so it was
this or most of the other features.))
I’ve also read your post,
http://mail-archives.apache.org/mod_mbox/jena-users/201306.mbox/%[email protected]%3E


> Thanks for sharing such info.
But, don’t you think that even a common syntax can make a huge
difference, regardless of how stores internally implement it?

It's an argument that came up when SPARQL 1.1 WG was deciding what to do. I happen to agree that common syntax would have been good but others felt if the text search language wasn't standardised as well, it was not a good use of the fixed amount of time we had. There is also an argument that what is really needed is a general extension mechanism (text, spatial, statistical analytics, ...) and again defining that is non-trivial.

Standard involve compromises, as does working to a timescale with volunteers (not that we stuck to the timescale!).

1 million?  Isn't that rather small?  The whole thing will fit in
memory.  1m BSBM fits completely in 1G RAM and TDB isn't very space
efficient as it trades it for direct memory mapping.
That was just a test for the purpose of these emails to make sure we
are doing things right. the test will target 200M, maybe more.

3G should be enough.  20G will slow it down (a small amount given
your hardware) as much of TDB's memory usage is outside the Java
heap.
I’ll take this for Jena.


20G will slow it down
Generally, I thought the max won’t affect the speed, as long as it
doesn’t reach the max. this will reduce GCs being performed, isn’t
it?

Not in the case of TDB because the indexes are cached as memory mapped files, outside heap, so if you have a larger heap you have less index cache space.

And the GC pauses get longer even if less frequent. A full GC happens sometime - see lots of big data blogs about the pain felt when the GC goes off into the weeds for seconds at a time.

(why not using SSD's?  They are common these days.  Does wonders
for loading speed!)
I’ve seen some stores recommending them. Unfortunately, I’ve got no
control over this for now. Just out of curiosity, within Jena, do you
think that the existing index structure i.e B+tree needs any changes
to get the best of SSDs?

Not as far as I know. They work much better on SSDs already. They are large-ish block size (8K - the trees are 150 to 200 way B+Trees). The TDB B+Trees are quite specialised - they only work with fixed size keys and fixed size values making node search fast.

In TDB, the places to look for optimizations are all the trends in modern DBs: e.g. design for in-memory use - the disk is just a backup and to move state across OS restart. Many uses fit in RAM, or a very high %-age of the hot data is RAM sized, on today's servers so designing for that would be good.

Multi-core execution. Multi-machine execution (Project Lizard is going that way).

A big change to make to the indexes use an MVCC design so that update in-place does not happen but transactions are single-write, not 2 writes (write to log, write to main DB sometime) as in CouchDB or, recently, Apache Mavibot.





the NodeTable

the NodeTable is the better place to look for optimizations.

        Andy


Cheers, Saud

On 21 Apr 2014, at 10:38, Andy Seaborne <[email protected]> wrote:

On 20/04/14 17:13, Saud Aljaloud wrote:
Thanks Paul and Andy,

Why are you suggesting non-public?
The idea is that because we are benchmarking a number of triple
stores, and our choice is to ask each of them about the best
configurations privately for their own store, we want to reduce
the amount of core information of our work being publicly
available i.e: the queries or statistics about other stores,
until we publish them later at once. This being said, We can
discuss the general setup of Jena here.

Unclear what anyone would do with such information ahead of
publication unless it's to copy you and publish earlier.  Just use
current releases.

No relationship to http://www.ldbc.eu/?

As it's fulltext, you have to use the custom features of each
system so comparing like-with-like is going to be hard.

Each custom extension is going to have assumptions on usage - for
jena, you can use Solr and have other applications going to the
same index, it's not a Jena specific structure anymore (LARQ was).
The text search languages have different capabilties.

((That's really what stopped it getting standardized in SPARQL 1.1
- it's a large piece of work (see xpath-full-text) and so it was
this or most of the other features.))

The benchmark driver in their SVN repository is somewhat ahead
of the last formal release
Thanks for pointing this out.

I run a modified version that runs TDB locally to the benchmark
driver to benchmark just TDB and not the protocol component.

There isn't much on matching parts of strings in BSBM.
BSBM has got a number of literal predicates with a good/enough
length, see the predicates within the assembler below. or see,
http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/Dataset/index.html





Here are the steps we do to perform a test on a 1 million triples:

1 million?  Isn't that rather small?  The whole thing will fit in
memory.  1m BSBM fits completely in 1G RAM and TDB isn't very space
efficient as it trades it for direct memory mapping.


======================= Jena Configurations: 1- edit
fuseki-server: JVM_ARGS=${JVM_ARGS:--Xmx20G}

3G should be enough.  20G will slow it down (a small amount given
your hardware) as much of TDB's memory usage is outside the Java
heap.

(why not using SSD's?  They are common these days.  Does wonders
for loading speed!)


======================= Jena Configurations: 1- edit
fuseki-server: JVM_ARGS=${JVM_ARGS:--Xmx20G}


2- create an Assembler for Jena Text with Lucene
"BSBM-fulltext-1.ttl" :


## Example of a TDB dataset and text index published using
Fuseki @prefix :        <http://localhost/jena_example/#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> . @prefix rdf:
<http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs:
<http://www.w3.org/2000/01/rdf-schema#> . @prefix tdb:
<http://jena.hpl.hp.com/2008/tdb#> . @prefix ja:
<http://jena.hpl.hp.com/2005/11/Assembler#> . @prefix text:
<http://jena.apache.org/text#> . @prefix rev:
<http://purl.org/stuff/rev#> . @prefix bsbm:
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> .
@prefix bsbm-inst:
<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix foaf:
<http://xmlns.com/foaf/0.1/> .

[] rdf:type fuseki:Server ; # Timeout - server-wide default:
milliseconds. # Format 1: "1000" -- 1 second timeout # Format 2:
"10000,60000" -- 10s timeout to first result, then 60s timeout to
for rest of query. # See java doc for ARQ.queryTimeout #
ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000"
] ; # ja:loadClass "your.code.Class" ;

fuseki:services ( <#service_text_tdb> ) .

# TDB [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB
rdfs:subClassOf  ja:RDFDataset . tdb:GraphTDB    rdfs:subClassOf
ja:Model .

# Text [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
#text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

##
---------------------------------------------------------------

<#service_text_tdb> rdf:type fuseki:Service ; rdfs:label
"TDB/text service" ; fuseki:name                     "BSBM1M" ;
fuseki:serviceQuery             "query" ; fuseki:serviceQuery
"sparql" ; fuseki:serviceUpdate            "update" ;
fuseki:serviceUpload            "upload" ;
fuseki:serviceReadGraphStore    "get" ;
fuseki:serviceReadWriteGraphStore    "data" ; fuseki:dataset
:text_dataset ; .

:text_dataset rdf:type     text:TextDataset ; text:dataset
<#dataset> ; ##text:index   <#indexSolr> ; text:index
<#indexLucene> ; .

<#dataset> rdf:type      tdb:DatasetTDB ; tdb:location
"/home/path/apache-jena-2.11.1/data" ; #tdb:unionDefaultGraph
true ; .

<#indexSolr> a text:TextIndexSolr ; #text:server
<http://localhost:8983/solr/COLLECTION> ; text:server
<embedded:SolrARQ> ; text:entityMap <#entMap> ; .

<#indexLucene> a text:TextIndexLucene ; text:directory
<file:/home/path/apache-jena-2.11.1/lucene> ; ##text:directory
"mem" ; text:entityMap <#entMap> ; .

<#entMap> a text:EntityMap ; text:entityField      "uri" ;
text:defaultField     "text" ;        ## Should be defined in the
text:map. text:map ( # rdfs:label [ text:field "text" ;
text:predicate rdfs:label ] [ text:field "text" ; text:predicate
rdfs:comment ] [ text:field "text" ; text:predicate foaf:name ] [
text:field "text" ; text:predicate  dc:title ] [ text:field
"text" ; text:predicate  rev:text ]  ) .






======================= Jena Test procedure with statistics for
BSBM1M (one million triples): using a machine with specs [1,2] 1-
load data: ./tdbloader2 --loc ../data/
~/bsbmtools-0.2/dataset_1M.ttl 15:25:24 -- 35 seconds Size: 137M
.



2- build jena text index: java -cp fuseki-server.jar
jena.textindexer --desc=BSBM-fulltext-1.ttl INFO 31123 (3112 per
second)properties indexed (3112 per second overall) INFO 72657
(5589 per second) properties indexed Size: 17M  .

3- Flush OS memory and swap.

4- Run Server: ./fuseki-server --config=BSBM-fulltext-1.ttl


5- Run test using BSBM driver: ./testdriver -ucf
usecases/literalSearch/fulltext/jena.txt -w 1000 -o
Jena_1Client_BSBM1M.xml http://localhost:3030/BSBM1M/sparql


=======================



Any comments would be appreciated.


Many thanks, Saud



[1] [2]2x AMD Opteron 4280 Processor (2.8GHz, 8C, 8M L2/8M L3
Cache, 95W), DDR3-1600 MHz 128GB Memory for 2CPU (8x16GB Quad
Rank LV RDIMMs) 1066MHz 2x 300GB, SAS 6Gbps, 3.5-in, 15K RPM Hard
Drive (Hot-plug) SAS 6/iR Controller, For Hot Plug HDD Chassis No
Optical Drive Redundant Power Supply (2 PSU) 500W 2M Rack Power
Cord C13/C14 12A iDRAC6 Enterprise Sliding Ready Rack Rails C11
Hot-Swap - R0 for SAS 6iR, Min. 2 Max. 4 SAS/SATA Hot Plug
Drives

[2] Red Hat Enterprise Linux Server release 6.4 (Santiago), java
version "1.7.0_51",







Reply via email to