Re: Performance with very long strings - Re: large literals best practice?

Andy Seaborne Sun, 20 Aug 2017 09:01:51 -0700

I don't have any experience running about thing like this. I was hopingto learn from other people's experiences.

From a base-technology point of view, this isn't TDB's design centre sotheer may be hot-spots. The only real way to know if it is acceptable isto try an experiment. It will depend on what you want to do with the store.

With 230K blobs of 17Kbytes, doing SPARQL-searching of text (regex(),contains()) will be expensive. So that is a requirement, a text indexis probably necessary whether you store the page content in RDF or not.

One area will be the TDB node cache, the cache of internal TDB NodeId->RDFterm (Node). This is a count-based, and does not consider the size ofitem cached. The cache is going to keep pages cached so it's going touse heap RAM especially as characters are 2 bytes. There again, it'sonly 10G or so.


See the documentation for tuning caches:
https://jena.apache.org/documentation/tdb/store-parameters.html

    Andy

On 19/08/17 15:20, Chris Tomlinson wrote:

Hi again,

Is anyone aware of any issues that may arise when storing triples in TDB that 
have very large string literals (~17KB)?

The use case is illustrated below. This seems a reasonable question under the 
assumption that literals are presumed to be small - like names, titles, maybe 
summaries or abstracts and such, rather than entire pages of text.

Thanks,
Chris

On Aug 17, 2017, at 12:48 PM, Chris Tomlinson <[email protected]> 
wrote:

Hello,

We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, for 
a total of 4GB of text. These texts are currently indexed via Lucene in an 
XMLdb and we’re wanting to know if there are any known issues regarding large 
literals in Jena.

In other words we are considering storing the texts like:

     :Text_08357 a :EText ;
         various metadata about the EText
         :hasPage
           [ :pageNum 1 ;
             :content “. . . 17,000 Bytes . . .” ] ,
           [ :pageNum 2 ;
             :content “. . . 17,000 Bytes . . .” ] ,
           . . .

We know that Lucene is happy with this data, but we’re not sure whether 
Jena/TDB will be stressed with 229K triples with 17KB literals.

The Jena-text offers the possibility of indexing in Lucene via a separate 
process and just using the search in Jena without actually storing the literals 
in TDB. This is a somewhat complex configuration and it would be preferred to 
not use this approach unless the size of the literals will present a problem.

Thank you,
Chris

Re: Performance with very long strings - Re: large literals best practice?

Reply via email to