Re: Configuring Jena TDB for a benchmark

Saud Aljaloud Mon, 21 Apr 2014 08:36:30 -0700

On 21 Apr 2014, at 14:01, Andy Seaborne <[email protected]> wrote:
> I'd be interested in hearing in what ways the problem is different from SQL.
> 
> Also - in SQL, there is LIKE.  Would it be a good idea for SPARQL to have a 
> separate "LIKE"



SPARQL 1.1 is good in addressing this. There are now three new functions: 
STRSTARTS, STRENDS and CONTAINS. These are all special cases of LIKE. 

>  (=> can a system do a lot better with that than analysing a regex?).

In theory, I think, Yes. Instead of compiling regex at all, you'd perform 
simple string matching or even faster by building dedicated indices.


Many Thanks for the details,
Saud

On 21 Apr 2014, at 14:01, Andy Seaborne <[email protected]> wrote:

> On 21/04/14 12:23, Saud Al-Jaloud wrote:
>>> Just use current releases.
>> 
>> We are using current releases, we are not looking for tuning systems
>> but rather the right configs as this is some sort of an extension.
>> Otherwise, some might argue that we were unfair/miss features for
>> some stores over others. For example, buffer size or the way of
>> building full text index etc. To some extend, we are trying to follow
>> the same rule as BSBM, see
>> http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/BenchmarkRules/
>> 
>> 
>>> No relationship to http://www.ldbc.eu/?
>> Nope, this is part of my PhD, Which is mainly about optimising regex
>> within SPARQL, but I also look at fulltext search here.
> 
> I'd be interested in hearing in what ways the problem is different from SQL.
> 
> Also - in SQL, there is LIKE.  Would it be a good idea for SPARQL to have a 
> separate "LIKE" (=> can a system do a lot better with that than analysing a 
> regex?).
> 
>> 
>>> As it's fulltext, you have to use the custom features of each
>>> system so comparing like-with-like is going to be hard.
>> Indeed, there are couple of challenges, that’s why I see it
>> interesting area.
>> 
>>> ((That's really what stopped it getting standardized in SPARQL 1.1
>>> - it's a large piece of work (see xpath-full-text) and so it was
>>> this or most of the other features.))
>> I’ve also read your post,
>> http://mail-archives.apache.org/mod_mbox/jena-users/201306.mbox/%[email protected]%3E
>> 
>> 
> > Thanks for sharing such info.
>> But, don’t you think that even a common syntax can make a huge
>> difference, regardless of how stores internally implement it?
> 
> It's an argument that came up when SPARQL 1.1 WG was deciding what to do.  I 
> happen to agree that common syntax would have been good but others felt if 
> the text search language wasn't standardised as well, it was not a good use 
> of the fixed amount of time we had.  There is also an argument that what is 
> really needed is a general extension mechanism (text, spatial, statistical 
> analytics, ...) and again defining that is non-trivial.
> 
> Standard involve compromises, as does working to a timescale with volunteers 
> (not that we stuck to the timescale!).
> 
>>> 1 million?  Isn't that rather small?  The whole thing will fit in
>>> memory.  1m BSBM fits completely in 1G RAM and TDB isn't very space
>>> efficient as it trades it for direct memory mapping.
>> That was just a test for the purpose of these emails to make sure we
>> are doing things right. the test will target 200M, maybe more.
>> 
>>> 3G should be enough.  20G will slow it down (a small amount given
>>> your hardware) as much of TDB's memory usage is outside the Java
>>> heap.
>> I’ll take this for Jena.
>> 
>> 
>>> 20G will slow it down
>> Generally, I thought the max won’t affect the speed, as long as it
>> doesn’t reach the max. this will reduce GCs being performed, isn’t
>> it?
> 
> Not in the case of TDB because the indexes are cached as memory mapped files, 
> outside heap, so if you have a larger heap you have less index cache space.
> 
> And the GC pauses get longer even if less frequent.  A full GC happens 
> sometime - see lots of big data blogs about the pain felt when the GC goes 
> off into the weeds for seconds at a time.
> 
>>> (why not using SSD's?  They are common these days.  Does wonders
>>> for loading speed!)
>> I’ve seen some stores recommending them. Unfortunately, I’ve got no
>> control over this for now. Just out of curiosity, within Jena, do you
>> think that the existing index structure i.e B+tree needs any changes
>> to get the best of SSDs?
> 
> Not as far as I know. They work much better on SSDs already.  They are 
> large-ish block size (8K - the trees are 150 to 200 way B+Trees). The TDB 
> B+Trees are quite specialised - they only work with fixed size keys and fixed 
> size values making node search fast.
> 
> In TDB, the places to look for optimizations are all the trends in modern 
> DBs: e.g. design for in-memory use - the disk is just a backup and to move 
> state across OS restart.  Many uses fit in RAM, or a very high %-age of the 
> hot data is RAM sized, on today's servers so designing for that would be good.
> 
> Multi-core execution.  Multi-machine execution (Project Lizard is going that 
> way).
> 
> A big change to make to the indexes use an MVCC design so that update 
> in-place does not happen but transactions are single-write, not 2 writes 
> (write to log, write to main DB sometime) as in CouchDB or, recently, Apache 
> Mavibot.
> 
> 
> 
> 
> 
> the NodeTable
> 
> the NodeTable is the better place to look for optimizations.
> 
>       Andy
>> 
>> 
>> Cheers, Saud
>> 
>> On 21 Apr 2014, at 10:38, Andy Seaborne <[email protected]> wrote:
>> 
>>> On 20/04/14 17:13, Saud Aljaloud wrote:
>>>> Thanks Paul and Andy,
>>>> 
>>>>> Why are you suggesting non-public?
>>>> The idea is that because we are benchmarking a number of triple
>>>> stores, and our choice is to ask each of them about the best
>>>> configurations privately for their own store, we want to reduce
>>>> the amount of core information of our work being publicly
>>>> available i.e: the queries or statistics about other stores,
>>>> until we publish them later at once. This being said, We can
>>>> discuss the general setup of Jena here.
>>> 
>>> Unclear what anyone would do with such information ahead of
>>> publication unless it's to copy you and publish earlier.  Just use
>>> current releases.
>>> 
>>> No relationship to http://www.ldbc.eu/?
>>> 
>>> As it's fulltext, you have to use the custom features of each
>>> system so comparing like-with-like is going to be hard.
>>> 
>>> Each custom extension is going to have assumptions on usage - for
>>> jena, you can use Solr and have other applications going to the
>>> same index, it's not a Jena specific structure anymore (LARQ was).
>>> The text search languages have different capabilties.
>>> 
>>> ((That's really what stopped it getting standardized in SPARQL 1.1
>>> - it's a large piece of work (see xpath-full-text) and so it was
>>> this or most of the other features.))
>>> 
>>>>> The benchmark driver in their SVN repository is somewhat ahead
>>>>> of the last formal release
>>>> Thanks for pointing this out.
>>> 
>>> I run a modified version that runs TDB locally to the benchmark
>>> driver to benchmark just TDB and not the protocol component.
>>> 
>>>>> There isn't much on matching parts of strings in BSBM.
>>>> BSBM has got a number of literal predicates with a good/enough
>>>> length, see the predicates within the assembler below. or see,
>>>> http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/Dataset/index.html
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
> Here are the steps we do to perform a test on a 1 million triples:
>>> 
>>> 1 million?  Isn't that rather small?  The whole thing will fit in
>>> memory.  1m BSBM fits completely in 1G RAM and TDB isn't very space
>>> efficient as it trades it for direct memory mapping.
>>> 
>>>> 
>>>> ======================= Jena Configurations: 1- edit
>>>> fuseki-server: JVM_ARGS=${JVM_ARGS:--Xmx20G}
>>> 
>>> 3G should be enough.  20G will slow it down (a small amount given
>>> your hardware) as much of TDB's memory usage is outside the Java
>>> heap.
>>> 
>>> (why not using SSD's?  They are common these days.  Does wonders
>>> for loading speed!)
>>> 
>>>> 
>>>> ======================= Jena Configurations: 1- edit
>>>> fuseki-server: JVM_ARGS=${JVM_ARGS:--Xmx20G}
>>>> 
>>>> 
>>>> 2- create an Assembler for Jena Text with Lucene
>>>> "BSBM-fulltext-1.ttl" :
>>>> 
>>>> 
>>>> ## Example of a TDB dataset and text index published using
>>>> Fuseki @prefix :        <http://localhost/jena_example/#> .
>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> . @prefix rdf:
>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs:
>>>> <http://www.w3.org/2000/01/rdf-schema#> . @prefix tdb:
>>>> <http://jena.hpl.hp.com/2008/tdb#> . @prefix ja:
>>>> <http://jena.hpl.hp.com/2005/11/Assembler#> . @prefix text:
>>>> <http://jena.apache.org/text#> . @prefix rev:
>>>> <http://purl.org/stuff/rev#> . @prefix bsbm:
>>>> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> .
>>>> @prefix bsbm-inst:
>>>> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/> .
>>>> @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix foaf:
>>>> <http://xmlns.com/foaf/0.1/> .
>>>> 
>>>> [] rdf:type fuseki:Server ; # Timeout - server-wide default:
>>>> milliseconds. # Format 1: "1000" -- 1 second timeout # Format 2:
>>>> "10000,60000" -- 10s timeout to first result, then 60s timeout to
>>>> for rest of query. # See java doc for ARQ.queryTimeout #
>>>> ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000"
>>>> ] ; # ja:loadClass "your.code.Class" ;
>>>> 
>>>> fuseki:services ( <#service_text_tdb> ) .
>>>> 
>>>> # TDB [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB
>>>> rdfs:subClassOf  ja:RDFDataset . tdb:GraphTDB    rdfs:subClassOf
>>>> ja:Model .
>>>> 
>>>> # Text [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>>>> #text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>>> 
>>>> ##
>>>> ---------------------------------------------------------------
>>>> 
>>>> <#service_text_tdb> rdf:type fuseki:Service ; rdfs:label
>>>> "TDB/text service" ; fuseki:name                     "BSBM1M" ;
>>>> fuseki:serviceQuery             "query" ; fuseki:serviceQuery
>>>> "sparql" ; fuseki:serviceUpdate            "update" ;
>>>> fuseki:serviceUpload            "upload" ;
>>>> fuseki:serviceReadGraphStore    "get" ;
>>>> fuseki:serviceReadWriteGraphStore    "data" ; fuseki:dataset
>>>> :text_dataset ; .
>>>> 
>>>> :text_dataset rdf:type     text:TextDataset ; text:dataset
>>>> <#dataset> ; ##text:index   <#indexSolr> ; text:index
>>>> <#indexLucene> ; .
>>>> 
>>>> <#dataset> rdf:type      tdb:DatasetTDB ; tdb:location
>>>> "/home/path/apache-jena-2.11.1/data" ; #tdb:unionDefaultGraph
>>>> true ; .
>>>> 
>>>> <#indexSolr> a text:TextIndexSolr ; #text:server
>>>> <http://localhost:8983/solr/COLLECTION> ; text:server
>>>> <embedded:SolrARQ> ; text:entityMap <#entMap> ; .
>>>> 
>>>> <#indexLucene> a text:TextIndexLucene ; text:directory
>>>> <file:/home/path/apache-jena-2.11.1/lucene> ; ##text:directory
>>>> "mem" ; text:entityMap <#entMap> ; .
>>>> 
>>>> <#entMap> a text:EntityMap ; text:entityField      "uri" ;
>>>> text:defaultField     "text" ;        ## Should be defined in the
>>>> text:map. text:map ( # rdfs:label [ text:field "text" ;
>>>> text:predicate rdfs:label ] [ text:field "text" ; text:predicate
>>>> rdfs:comment ] [ text:field "text" ; text:predicate foaf:name ] [
>>>> text:field "text" ; text:predicate  dc:title ] [ text:field
>>>> "text" ; text:predicate  rev:text ]  ) .
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ======================= Jena Test procedure with statistics for
>>>> BSBM1M (one million triples): using a machine with specs [1,2] 1-
>>>> load data: ./tdbloader2 --loc ../data/
>>>> ~/bsbmtools-0.2/dataset_1M.ttl 15:25:24 -- 35 seconds Size: 137M
>>>> .
>>>> 
>>>> 
>>>> 
>>>> 2- build jena text index: java -cp fuseki-server.jar
>>>> jena.textindexer --desc=BSBM-fulltext-1.ttl INFO 31123 (3112 per
>>>> second)properties indexed (3112 per second overall) INFO 72657
>>>> (5589 per second) properties indexed Size: 17M     .
>>>> 
>>>> 3- Flush OS memory and swap.
>>>> 
>>>> 4- Run Server: ./fuseki-server --config=BSBM-fulltext-1.ttl
>>>> 
>>>> 
>>>> 5- Run test using BSBM driver: ./testdriver -ucf
>>>> usecases/literalSearch/fulltext/jena.txt -w 1000 -o
>>>> Jena_1Client_BSBM1M.xml http://localhost:3030/BSBM1M/sparql
>>>> 
>>>> 
>>>> =======================
>>>> 
>>>> 
>>>> 
>>>> Any comments would be appreciated.
>>>> 
>>>> 
>>>> Many thanks, Saud
>>>> 
>>>> 
>>> 
>>>> [1] [2]2x AMD Opteron 4280 Processor (2.8GHz, 8C, 8M L2/8M L3
>>>> Cache, 95W), DDR3-1600 MHz 128GB Memory for 2CPU (8x16GB Quad
>>>> Rank LV RDIMMs) 1066MHz 2x 300GB, SAS 6Gbps, 3.5-in, 15K RPM Hard
>>>> Drive (Hot-plug) SAS 6/iR Controller, For Hot Plug HDD Chassis No
>>>> Optical Drive Redundant Power Supply (2 PSU) 500W 2M Rack Power
>>>> Cord C13/C14 12A iDRAC6 Enterprise Sliding Ready Rack Rails C11
>>>> Hot-Swap - R0 for SAS 6iR, Min. 2 Max. 4 SAS/SATA Hot Plug
>>>> Drives
>>> 
>>>> [2] Red Hat Enterprise Linux Server release 6.4 (Santiago), java
>>>> version "1.7.0_51",
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> 
>

Re: Configuring Jena TDB for a benchmark

Reply via email to