On 21 Apr 2014, at 14:01, Andy Seaborne <[email protected]> wrote: > I'd be interested in hearing in what ways the problem is different from SQL. > > Also - in SQL, there is LIKE. Would it be a good idea for SPARQL to have a > separate "LIKE"
SPARQL 1.1 is good in addressing this. There are now three new functions: STRSTARTS, STRENDS and CONTAINS. These are all special cases of LIKE. > (=> can a system do a lot better with that than analysing a regex?). In theory, I think, Yes. Instead of compiling regex at all, you'd perform simple string matching or even faster by building dedicated indices. Many Thanks for the details, Saud On 21 Apr 2014, at 14:01, Andy Seaborne <[email protected]> wrote: > On 21/04/14 12:23, Saud Al-Jaloud wrote: >>> Just use current releases. >> >> We are using current releases, we are not looking for tuning systems >> but rather the right configs as this is some sort of an extension. >> Otherwise, some might argue that we were unfair/miss features for >> some stores over others. For example, buffer size or the way of >> building full text index etc. To some extend, we are trying to follow >> the same rule as BSBM, see >> http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/BenchmarkRules/ >> >> >>> No relationship to http://www.ldbc.eu/? >> Nope, this is part of my PhD, Which is mainly about optimising regex >> within SPARQL, but I also look at fulltext search here. > > I'd be interested in hearing in what ways the problem is different from SQL. > > Also - in SQL, there is LIKE. Would it be a good idea for SPARQL to have a > separate "LIKE" (=> can a system do a lot better with that than analysing a > regex?). > >> >>> As it's fulltext, you have to use the custom features of each >>> system so comparing like-with-like is going to be hard. >> Indeed, there are couple of challenges, that’s why I see it >> interesting area. >> >>> ((That's really what stopped it getting standardized in SPARQL 1.1 >>> - it's a large piece of work (see xpath-full-text) and so it was >>> this or most of the other features.)) >> I’ve also read your post, >> http://mail-archives.apache.org/mod_mbox/jena-users/201306.mbox/%[email protected]%3E >> >> > > Thanks for sharing such info. >> But, don’t you think that even a common syntax can make a huge >> difference, regardless of how stores internally implement it? > > It's an argument that came up when SPARQL 1.1 WG was deciding what to do. I > happen to agree that common syntax would have been good but others felt if > the text search language wasn't standardised as well, it was not a good use > of the fixed amount of time we had. There is also an argument that what is > really needed is a general extension mechanism (text, spatial, statistical > analytics, ...) and again defining that is non-trivial. > > Standard involve compromises, as does working to a timescale with volunteers > (not that we stuck to the timescale!). > >>> 1 million? Isn't that rather small? The whole thing will fit in >>> memory. 1m BSBM fits completely in 1G RAM and TDB isn't very space >>> efficient as it trades it for direct memory mapping. >> That was just a test for the purpose of these emails to make sure we >> are doing things right. the test will target 200M, maybe more. >> >>> 3G should be enough. 20G will slow it down (a small amount given >>> your hardware) as much of TDB's memory usage is outside the Java >>> heap. >> I’ll take this for Jena. >> >> >>> 20G will slow it down >> Generally, I thought the max won’t affect the speed, as long as it >> doesn’t reach the max. this will reduce GCs being performed, isn’t >> it? > > Not in the case of TDB because the indexes are cached as memory mapped files, > outside heap, so if you have a larger heap you have less index cache space. > > And the GC pauses get longer even if less frequent. A full GC happens > sometime - see lots of big data blogs about the pain felt when the GC goes > off into the weeds for seconds at a time. > >>> (why not using SSD's? They are common these days. Does wonders >>> for loading speed!) >> I’ve seen some stores recommending them. Unfortunately, I’ve got no >> control over this for now. Just out of curiosity, within Jena, do you >> think that the existing index structure i.e B+tree needs any changes >> to get the best of SSDs? > > Not as far as I know. They work much better on SSDs already. They are > large-ish block size (8K - the trees are 150 to 200 way B+Trees). The TDB > B+Trees are quite specialised - they only work with fixed size keys and fixed > size values making node search fast. > > In TDB, the places to look for optimizations are all the trends in modern > DBs: e.g. design for in-memory use - the disk is just a backup and to move > state across OS restart. Many uses fit in RAM, or a very high %-age of the > hot data is RAM sized, on today's servers so designing for that would be good. > > Multi-core execution. Multi-machine execution (Project Lizard is going that > way). > > A big change to make to the indexes use an MVCC design so that update > in-place does not happen but transactions are single-write, not 2 writes > (write to log, write to main DB sometime) as in CouchDB or, recently, Apache > Mavibot. > > > > > > the NodeTable > > the NodeTable is the better place to look for optimizations. > > Andy >> >> >> Cheers, Saud >> >> On 21 Apr 2014, at 10:38, Andy Seaborne <[email protected]> wrote: >> >>> On 20/04/14 17:13, Saud Aljaloud wrote: >>>> Thanks Paul and Andy, >>>> >>>>> Why are you suggesting non-public? >>>> The idea is that because we are benchmarking a number of triple >>>> stores, and our choice is to ask each of them about the best >>>> configurations privately for their own store, we want to reduce >>>> the amount of core information of our work being publicly >>>> available i.e: the queries or statistics about other stores, >>>> until we publish them later at once. This being said, We can >>>> discuss the general setup of Jena here. >>> >>> Unclear what anyone would do with such information ahead of >>> publication unless it's to copy you and publish earlier. Just use >>> current releases. >>> >>> No relationship to http://www.ldbc.eu/? >>> >>> As it's fulltext, you have to use the custom features of each >>> system so comparing like-with-like is going to be hard. >>> >>> Each custom extension is going to have assumptions on usage - for >>> jena, you can use Solr and have other applications going to the >>> same index, it's not a Jena specific structure anymore (LARQ was). >>> The text search languages have different capabilties. >>> >>> ((That's really what stopped it getting standardized in SPARQL 1.1 >>> - it's a large piece of work (see xpath-full-text) and so it was >>> this or most of the other features.)) >>> >>>>> The benchmark driver in their SVN repository is somewhat ahead >>>>> of the last formal release >>>> Thanks for pointing this out. >>> >>> I run a modified version that runs TDB locally to the benchmark >>> driver to benchmark just TDB and not the protocol component. >>> >>>>> There isn't much on matching parts of strings in BSBM. >>>> BSBM has got a number of literal predicates with a good/enough >>>> length, see the predicates within the assembler below. or see, >>>> http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/Dataset/index.html >>>> >>>> >>>> >>>> >>>> > Here are the steps we do to perform a test on a 1 million triples: >>> >>> 1 million? Isn't that rather small? The whole thing will fit in >>> memory. 1m BSBM fits completely in 1G RAM and TDB isn't very space >>> efficient as it trades it for direct memory mapping. >>> >>>> >>>> ======================= Jena Configurations: 1- edit >>>> fuseki-server: JVM_ARGS=${JVM_ARGS:--Xmx20G} >>> >>> 3G should be enough. 20G will slow it down (a small amount given >>> your hardware) as much of TDB's memory usage is outside the Java >>> heap. >>> >>> (why not using SSD's? They are common these days. Does wonders >>> for loading speed!) >>> >>>> >>>> ======================= Jena Configurations: 1- edit >>>> fuseki-server: JVM_ARGS=${JVM_ARGS:--Xmx20G} >>>> >>>> >>>> 2- create an Assembler for Jena Text with Lucene >>>> "BSBM-fulltext-1.ttl" : >>>> >>>> >>>> ## Example of a TDB dataset and text index published using >>>> Fuseki @prefix : <http://localhost/jena_example/#> . >>>> @prefix fuseki: <http://jena.apache.org/fuseki#> . @prefix rdf: >>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: >>>> <http://www.w3.org/2000/01/rdf-schema#> . @prefix tdb: >>>> <http://jena.hpl.hp.com/2008/tdb#> . @prefix ja: >>>> <http://jena.hpl.hp.com/2005/11/Assembler#> . @prefix text: >>>> <http://jena.apache.org/text#> . @prefix rev: >>>> <http://purl.org/stuff/rev#> . @prefix bsbm: >>>> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> . >>>> @prefix bsbm-inst: >>>> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/> . >>>> @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix foaf: >>>> <http://xmlns.com/foaf/0.1/> . >>>> >>>> [] rdf:type fuseki:Server ; # Timeout - server-wide default: >>>> milliseconds. # Format 1: "1000" -- 1 second timeout # Format 2: >>>> "10000,60000" -- 10s timeout to first result, then 60s timeout to >>>> for rest of query. # See java doc for ARQ.queryTimeout # >>>> ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "10000" >>>> ] ; # ja:loadClass "your.code.Class" ; >>>> >>>> fuseki:services ( <#service_text_tdb> ) . >>>> >>>> # TDB [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB >>>> rdfs:subClassOf ja:RDFDataset . tdb:GraphTDB rdfs:subClassOf >>>> ja:Model . >>>> >>>> # Text [] ja:loadClass "org.apache.jena.query.text.TextQuery" . >>>> text:TextDataset rdfs:subClassOf ja:RDFDataset . >>>> #text:TextIndexSolr rdfs:subClassOf text:TextIndex . >>>> text:TextIndexLucene rdfs:subClassOf text:TextIndex . >>>> >>>> ## >>>> --------------------------------------------------------------- >>>> >>>> <#service_text_tdb> rdf:type fuseki:Service ; rdfs:label >>>> "TDB/text service" ; fuseki:name "BSBM1M" ; >>>> fuseki:serviceQuery "query" ; fuseki:serviceQuery >>>> "sparql" ; fuseki:serviceUpdate "update" ; >>>> fuseki:serviceUpload "upload" ; >>>> fuseki:serviceReadGraphStore "get" ; >>>> fuseki:serviceReadWriteGraphStore "data" ; fuseki:dataset >>>> :text_dataset ; . >>>> >>>> :text_dataset rdf:type text:TextDataset ; text:dataset >>>> <#dataset> ; ##text:index <#indexSolr> ; text:index >>>> <#indexLucene> ; . >>>> >>>> <#dataset> rdf:type tdb:DatasetTDB ; tdb:location >>>> "/home/path/apache-jena-2.11.1/data" ; #tdb:unionDefaultGraph >>>> true ; . >>>> >>>> <#indexSolr> a text:TextIndexSolr ; #text:server >>>> <http://localhost:8983/solr/COLLECTION> ; text:server >>>> <embedded:SolrARQ> ; text:entityMap <#entMap> ; . >>>> >>>> <#indexLucene> a text:TextIndexLucene ; text:directory >>>> <file:/home/path/apache-jena-2.11.1/lucene> ; ##text:directory >>>> "mem" ; text:entityMap <#entMap> ; . >>>> >>>> <#entMap> a text:EntityMap ; text:entityField "uri" ; >>>> text:defaultField "text" ; ## Should be defined in the >>>> text:map. text:map ( # rdfs:label [ text:field "text" ; >>>> text:predicate rdfs:label ] [ text:field "text" ; text:predicate >>>> rdfs:comment ] [ text:field "text" ; text:predicate foaf:name ] [ >>>> text:field "text" ; text:predicate dc:title ] [ text:field >>>> "text" ; text:predicate rev:text ] ) . >>>> >>>> >>>> >>>> >>>> >>>> >>>> ======================= Jena Test procedure with statistics for >>>> BSBM1M (one million triples): using a machine with specs [1,2] 1- >>>> load data: ./tdbloader2 --loc ../data/ >>>> ~/bsbmtools-0.2/dataset_1M.ttl 15:25:24 -- 35 seconds Size: 137M >>>> . >>>> >>>> >>>> >>>> 2- build jena text index: java -cp fuseki-server.jar >>>> jena.textindexer --desc=BSBM-fulltext-1.ttl INFO 31123 (3112 per >>>> second)properties indexed (3112 per second overall) INFO 72657 >>>> (5589 per second) properties indexed Size: 17M . >>>> >>>> 3- Flush OS memory and swap. >>>> >>>> 4- Run Server: ./fuseki-server --config=BSBM-fulltext-1.ttl >>>> >>>> >>>> 5- Run test using BSBM driver: ./testdriver -ucf >>>> usecases/literalSearch/fulltext/jena.txt -w 1000 -o >>>> Jena_1Client_BSBM1M.xml http://localhost:3030/BSBM1M/sparql >>>> >>>> >>>> ======================= >>>> >>>> >>>> >>>> Any comments would be appreciated. >>>> >>>> >>>> Many thanks, Saud >>>> >>>> >>> >>>> [1] [2]2x AMD Opteron 4280 Processor (2.8GHz, 8C, 8M L2/8M L3 >>>> Cache, 95W), DDR3-1600 MHz 128GB Memory for 2CPU (8x16GB Quad >>>> Rank LV RDIMMs) 1066MHz 2x 300GB, SAS 6Gbps, 3.5-in, 15K RPM Hard >>>> Drive (Hot-plug) SAS 6/iR Controller, For Hot Plug HDD Chassis No >>>> Optical Drive Redundant Power Supply (2 PSU) 500W 2M Rack Power >>>> Cord C13/C14 12A iDRAC6 Enterprise Sliding Ready Rack Rails C11 >>>> Hot-Swap - R0 for SAS 6iR, Min. 2 Max. 4 SAS/SATA Hot Plug >>>> Drives >>> >>>> [2] Red Hat Enterprise Linux Server release 6.4 (Santiago), java >>>> version "1.7.0_51", >>>> >>>> >>>> >>>> >> >> >
