Re: Performance with very long strings - Re: large literals best practice?

2017-08-21 Thread Chris Tomlinson
Andy,

Thank you for the reply. I suspected that Jena/TDB might be targeted at 
somewhat different use cases. Is there a document somewhere that characterizes 
the sort of assumptions about how Jena/TDB are expcted to be used?

We'll explore our use case and let you know what we find.

Thank you again,
Chris


> On Aug 20, 2017, at 11:00, Andy Seaborne  wrote:
> 
> I don't have any experience running about thing like this. I was hoping to 
> learn from other people's experiences.
> 
> From a base-technology point of view, this isn't TDB's design centre so theer 
> may be hot-spots. The only real way to know if it is acceptable is to try an 
> experiment. It will depend on what you want to do with the store.
> 
> With 230K blobs of 17Kbytes, doing SPARQL-searching of text (regex(), 
> contains()) will be expensive.  So that is a requirement, a text index is 
> probably necessary whether you store the page content in RDF or not.
> 
> One area will be the TDB node cache, the cache of internal TDB NodeId-> 
> RDFterm (Node). This is a count-based, and does not consider the size of item 
> cached. The cache is going to keep pages cached so it's going to use heap RAM 
> especially as characters are 2 bytes.  There again, it's only 10G or so.
> 
> See the documentation for tuning caches:
> https://jena.apache.org/documentation/tdb/store-parameters.html
> 
>Andy
> 
>> On 19/08/17 15:20, Chris Tomlinson wrote:
>> Hi again,
>> Is anyone aware of any issues that may arise when storing triples in TDB 
>> that have very large string literals (~17KB)?
>> The use case is illustrated below. This seems a reasonable question under 
>> the assumption that literals are presumed to be small - like names, titles, 
>> maybe summaries or abstracts and such, rather than entire pages of text.
>> Thanks,
>> Chris
>>> On Aug 17, 2017, at 12:48 PM, Chris Tomlinson  
>>> wrote:
>>> 
>>> Hello,
>>> 
>>> We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, 
>>> for a total of 4GB of text. These texts are currently indexed via Lucene in 
>>> an XMLdb and we’re wanting to know if there are any known issues regarding 
>>> large literals in Jena.
>>> 
>>> In other words we are considering storing the texts like:
>>> 
>>> :Text_08357 a :EText ;
>>> various metadata about the EText
>>> :hasPage
>>>   [ :pageNum 1 ;
>>> :content “. . . 17,000 Bytes . . .” ] ,
>>>   [ :pageNum 2 ;
>>> :content “. . . 17,000 Bytes . . .” ] ,
>>>   . . .
>>> 
>>> We know that Lucene is happy with this data, but we’re not sure whether 
>>> Jena/TDB will be stressed with 229K triples with 17KB literals.
>>> 
>>> The Jena-text offers the possibility of indexing in Lucene via a separate 
>>> process and just using the search in Jena without actually storing the 
>>> literals in TDB. This is a somewhat complex configuration and it would be 
>>> preferred to not use this approach unless the size of the literals will 
>>> present a problem.
>>> 
>>> Thank you,
>>> Chris
>>> 
>>> 


Re: Performance with very long strings - Re: large literals best practice?

2017-08-20 Thread Andy Seaborne
I don't have any experience running about thing like this. I was hoping 
to learn from other people's experiences.


From a base-technology point of view, this isn't TDB's design centre so 
theer may be hot-spots. The only real way to know if it is acceptable is 
to try an experiment. It will depend on what you want to do with the store.


With 230K blobs of 17Kbytes, doing SPARQL-searching of text (regex(), 
contains()) will be expensive.  So that is a requirement, a text index 
is probably necessary whether you store the page content in RDF or not.


One area will be the TDB node cache, the cache of internal TDB NodeId-> 
RDFterm (Node). This is a count-based, and does not consider the size of 
item cached. The cache is going to keep pages cached so it's going to 
use heap RAM especially as characters are 2 bytes.  There again, it's 
only 10G or so.


See the documentation for tuning caches:
https://jena.apache.org/documentation/tdb/store-parameters.html

Andy

On 19/08/17 15:20, Chris Tomlinson wrote:

Hi again,

Is anyone aware of any issues that may arise when storing triples in TDB that 
have very large string literals (~17KB)?

The use case is illustrated below. This seems a reasonable question under the 
assumption that literals are presumed to be small - like names, titles, maybe 
summaries or abstracts and such, rather than entire pages of text.

Thanks,
Chris



On Aug 17, 2017, at 12:48 PM, Chris Tomlinson  
wrote:

Hello,

We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, for 
a total of 4GB of text. These texts are currently indexed via Lucene in an 
XMLdb and we’re wanting to know if there are any known issues regarding large 
literals in Jena.

In other words we are considering storing the texts like:

 :Text_08357 a :EText ;
 various metadata about the EText
 :hasPage
   [ :pageNum 1 ;
 :content “. . . 17,000 Bytes . . .” ] ,
   [ :pageNum 2 ;
 :content “. . . 17,000 Bytes . . .” ] ,
   . . .

We know that Lucene is happy with this data, but we’re not sure whether 
Jena/TDB will be stressed with 229K triples with 17KB literals.

The Jena-text offers the possibility of indexing in Lucene via a separate 
process and just using the search in Jena without actually storing the literals 
in TDB. This is a somewhat complex configuration and it would be preferred to 
not use this approach unless the size of the literals will present a problem.

Thank you,
Chris







Performance with very long strings - Re: large literals best practice?

2017-08-19 Thread Chris Tomlinson
Hi again,

Is anyone aware of any issues that may arise when storing triples in TDB that 
have very large string literals (~17KB)?

The use case is illustrated below. This seems a reasonable question under the 
assumption that literals are presumed to be small - like names, titles, maybe 
summaries or abstracts and such, rather than entire pages of text.

Thanks,
Chris


> On Aug 17, 2017, at 12:48 PM, Chris Tomlinson  
> wrote:
> 
> Hello,
> 
> We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, 
> for a total of 4GB of text. These texts are currently indexed via Lucene in 
> an XMLdb and we’re wanting to know if there are any known issues regarding 
> large literals in Jena.
> 
> In other words we are considering storing the texts like:
> 
> :Text_08357 a :EText ;
> various metadata about the EText
> :hasPage 
>   [ :pageNum 1 ;
> :content “. . . 17,000 Bytes . . .” ] ,
>   [ :pageNum 2 ;
> :content “. . . 17,000 Bytes . . .” ] ,
>   . . .
> 
> We know that Lucene is happy with this data, but we’re not sure whether 
> Jena/TDB will be stressed with 229K triples with 17KB literals.
> 
> The Jena-text offers the possibility of indexing in Lucene via a separate 
> process and just using the search in Jena without actually storing the 
> literals in TDB. This is a somewhat complex configuration and it would be 
> preferred to not use this approach unless the size of the literals will 
> present a problem.
> 
> Thank you,
> Chris
> 
> 



large literals best practice?

2017-08-17 Thread Chris Tomlinson
Hello,

We have 23K texts averaging 10 pp/text (total pages: 229K) and ~17KB/page, for 
a total of 4GB of text. These texts are currently indexed via Lucene in an 
XMLdb and we’re wanting to know if there are any known issues regarding large 
literals in Jena.

In other words we are considering storing the texts like:

:Text_08357 a :EText ;
various metadata about the EText
:hasPage 
  [ :pageNum 1 ;
:content “. . . 17,000 Bytes . . .” ] ,
  [ :pageNum 2 ;
:content “. . . 17,000 Bytes . . .” ] ,
  . . .

We know that Lucene is happy with this data, but we’re not sure whether 
Jena/TDB will be stressed with 229K triples with 17KB literals.

The Jena-text offers the possibility of indexing in Lucene via a separate 
process and just using the search in Jena without actually storing the literals 
in TDB. This is a somewhat complex configuration and it would be preferred to 
not use this approach unless the size of the literals will present a problem.

Thank you,
Chris