Re: Performance with very long strings - Re: large literals best practice?

Chris Tomlinson Mon, 21 Aug 2017 19:09:13 -0700

Andy,

Thank you for the reply. I suspected that Jena/TDB might be targeted at 
somewhat different use cases. Is there a document somewhere that characterizes 
the sort of assumptions about how Jena/TDB are expcted to be used?


We'll explore our use case and let you know what we find.

Thank you again,
Chris


> On Aug 20, 2017, at 11:00, Andy Seaborne <[email protected]> wrote:
> 
> I don't have any experience running about thing like this. I was hoping to 
> learn from other people's experiences.
> 
> From a base-technology point of view, this isn't TDB's design centre so theer 
> may be hot-spots. The only real way to know if it is acceptable is to try an 
> experiment. It will depend on what you want to do with the store.
> 
> With 230K blobs of 17Kbytes, doing SPARQL-searching of text (regex(), 
> contains()) will be expensive.  So that is a requirement, a text index is 
> probably necessary whether you store the page content in RDF or not.
> 
> One area will be the TDB node cache, the cache of internal TDB NodeId-> 
> RDFterm (Node). This is a count-based, and does not consider the size of item 
> cached. The cache is going to keep pages cached so it's going to use heap RAM 
> especially as characters are 2 bytes.  There again, it's only 10G or so.
> 
> See the documentation for tuning caches:
> https://jena.apache.org/documentation/tdb/store-parameters.html
> 
>    Andy
> 
>> On 19/08/17 15:20, Chris Tomlinson wrote:
>> Hi again,
>> Is anyone aware of any issues that may arise when storing triples in TDB 
>> that have very large string literals (~17KB)?
>> The use case is illustrated below. This seems a reasonable question under 
>> the assumption that literals are presumed to be small - like names, titles, 
>> maybe summaries or abstracts and such, rather than entire pages of text.
>> Thanks,
>> Chris
>>> On Aug 17, 2017, at 12:48 PM, Chris Tomlinson <[email protected]> 
>>> wrote:
>>> 
>>> Hello,
>>> 
>>> We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, 
>>> for a total of 4GB of text. These texts are currently indexed via Lucene in 
>>> an XMLdb and we’re wanting to know if there are any known issues regarding 
>>> large literals in Jena.
>>> 
>>> In other words we are considering storing the texts like:
>>> 
>>>     :Text_08357 a :EText ;
>>>         various metadata about the EText
>>>         :hasPage
>>>           [ :pageNum 1 ;
>>>             :content “. . . 17,000 Bytes . . .” ] ,
>>>           [ :pageNum 2 ;
>>>             :content “. . . 17,000 Bytes . . .” ] ,
>>>           . . .
>>> 
>>> We know that Lucene is happy with this data, but we’re not sure whether 
>>> Jena/TDB will be stressed with 229K triples with 17KB literals.
>>> 
>>> The Jena-text offers the possibility of indexing in Lucene via a separate 
>>> process and just using the search in Jena without actually storing the 
>>> literals in TDB. This is a somewhat complex configuration and it would be 
>>> preferred to not use this approach unless the size of the literals will 
>>> present a problem.
>>> 
>>> Thank you,
>>> Chris
>>> 
>>>

Re: Performance with very long strings - Re: large literals best practice?

Reply via email to