Re: Jena and Spark and Elephas

2016-12-22 Thread Dick Murray
On 22 Dec 2016 8:14 pm, "Andy Seaborne" <a...@apache.org> wrote:



On 22/12/16 14:48, Joint wrote:

>
>
> Hi Andy.
> I noticed the WIP status. How does the data get separated from the
> prefixes, it's in the same file! Unless the file gets split...
>

Exactly!

In some systems splitting is not controlled by the app but by the
distributor of data.  It sounds like in yours that you are writing large
files to the persistent layer, correct?


Yes, we have currently written ~1500 triple files each between 1-4M triples
with a few pushing 9M triples. The file name is the graph name. We have a
~100k unique properties per file, large files simply have more objects. The
equivalent TDB is at 1.5TB... Our ETL can process asynchronously which is
helping and the spark seems to be working :-)



My reader requires a prefix file when the triple file is opened otherwise
> it throws an exception. Thus the data file can be split as long as the
> prefix file is along for the ride.
> Is the Elephas compression in memory or disk or both. Our compression
> requirement is driven by the deployment environment which charges for the
> disk writes to the SLA storage.
>

Not sure.


The RDDs are free as they go into memory or local disk. Effectively I'm
> persisting a dataset as a group of triple files which get loaded into RDDs
> and processed as a read-only dataset. This also allows us to perform
> multiple graph writes so we can load data in parallel.
>
> Dick
>
>  Original message 
> From: Andy Seaborne <a...@apache.org>
> Date: 22/12/2016  11:30  (GMT+00:00)
> To: users@jena.apache.org
> Subject: Re: Jena and Spark and Elephas
>
> RDF-Patch update is WIP - in fact one of the potential things is
> removing the prefix stuff. (Reasons: data gets separated from its
> prefixes/; want to add prefix management of the data to RDF Patch.)
>
> Elephas has various compression options. Would any of these work for you?
>
> I find that compressing n-triples gives x5 to x10 compression so applied
> to RDD data I'd expect that or more.
>
> There are line based output formats (I don't know if they work with
> Elephas - no reason why not in principle).
>
> http://jena.apache.org/documentation/io/rdf-output.html#
> line-printed-formats
>
> See RDFFormat TURTLE_FLAT.
>
> Just don't loose the prefixes!
>
>  Andy
>
>
>
> On 21/12/16 20:46, Dick Murray wrote:
>
>> So basically I've got RDF Patch with a default A which I use to build the
>> Apache Spark RDD...
>>
>> A quick Google got me a git master updated 4 years ago, but no code, but
>> the thread says Andy is using the code..?
>>
>> Like you said probably one for Andy.
>>
>> Thanks for the pointer.
>>
>> On 21 Dec 2016 19:59, "A. Soroka" <aj...@virginia.edu> wrote:
>>
>> Andy can say more, but RDF Patch may be heading in a direction where it
>> could be used for such a purpose:
>>
>> https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d
>> 0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E
>>
>> ---
>> A. Soroka
>> The University of Virginia Library
>>
>> On Dec 21, 2016, at 2:17 PM, Dick Murray <dandh...@gmail.com> wrote:
>>>
>>> Hi, on a similar vein I have a modified NTriple reader which uses a
>>> prefix
>>> file to reduce the file size. Whilst the serialisation allows parallel
>>> processing in spark the file sizes were large and this has reduced them
>>> to
>>> 1/10 the original size on average.
>>>
>>> There is not an existing line based serialisation with some for of
>>> prefixing is there?
>>>
>>> On 17 Dec 2016 20:03, "Andy Seaborne" <a...@apache.org> wrote:
>>>
>>> Related:
>>>>
>>>> Jena now provides "Serializable" for Triple/Quad/Node
>>>>
>>>> It did not make 3.1.1, it's in development snapshots and in the next
>>>> release.
>>>>
>>>> Use with spark was the original motivation.
>>>>
>>>> Andy
>>>>
>>>> https://issues.apache.org/jira/browse/JENA-1233
>>>>
>>>> On 17/12/16 09:14, Joint wrote:
>>>>
>>>>
>>>>>
>>>>> Hi.
>>>>> I was about to use the above to wrap some quads and spoof the RDDs as
>>>>> graphs from within a dataset but before I do has this been done before?
>>>>>
>>>> I
>>
>>> have some code which calls the RDD methods from the graph base find. Not
>>>>> wanting to invent the wheel and such...
>>>>>
>>>>>
>>>>> Dick
>>>>>
>>>>>
>>>>>
>>


Re: Jena and Spark and Elephas

2016-12-22 Thread Andy Seaborne



On 22/12/16 14:48, Joint wrote:



Hi Andy.
I noticed the WIP status. How does the data get separated from the prefixes, 
it's in the same file! Unless the file gets split...


Exactly!

In some systems splitting is not controlled by the app but by the 
distributor of data.  It sounds like in yours that you are writing large 
files to the persistent layer, correct?



My reader requires a prefix file when the triple file is opened otherwise it 
throws an exception. Thus the data file can be split as long as the prefix file 
is along for the ride.
Is the Elephas compression in memory or disk or both. Our compression 
requirement is driven by the deployment environment which charges for the disk 
writes to the SLA storage.


Not sure.


The RDDs are free as they go into memory or local disk. Effectively I'm 
persisting a dataset as a group of triple files which get loaded into RDDs and 
processed as a read-only dataset. This also allows us to perform multiple graph 
writes so we can load data in parallel.

Dick

 Original message 
From: Andy Seaborne <a...@apache.org>
Date: 22/12/2016  11:30  (GMT+00:00)
To: users@jena.apache.org
Subject: Re: Jena and Spark and Elephas

RDF-Patch update is WIP - in fact one of the potential things is
removing the prefix stuff. (Reasons: data gets separated from its
prefixes/; want to add prefix management of the data to RDF Patch.)

Elephas has various compression options. Would any of these work for you?

I find that compressing n-triples gives x5 to x10 compression so applied
to RDD data I'd expect that or more.

There are line based output formats (I don't know if they work with
Elephas - no reason why not in principle).

http://jena.apache.org/documentation/io/rdf-output.html#line-printed-formats

See RDFFormat TURTLE_FLAT.

Just don't loose the prefixes!

 Andy



On 21/12/16 20:46, Dick Murray wrote:

So basically I've got RDF Patch with a default A which I use to build the
Apache Spark RDD...

A quick Google got me a git master updated 4 years ago, but no code, but
the thread says Andy is using the code..?

Like you said probably one for Andy.

Thanks for the pointer.

On 21 Dec 2016 19:59, "A. Soroka" <aj...@virginia.edu> wrote:

Andy can say more, but RDF Patch may be heading in a direction where it
could be used for such a purpose:

https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d
0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E

---
A. Soroka
The University of Virginia Library


On Dec 21, 2016, at 2:17 PM, Dick Murray <dandh...@gmail.com> wrote:

Hi, on a similar vein I have a modified NTriple reader which uses a prefix
file to reduce the file size. Whilst the serialisation allows parallel
processing in spark the file sizes were large and this has reduced them to
1/10 the original size on average.

There is not an existing line based serialisation with some for of
prefixing is there?

On 17 Dec 2016 20:03, "Andy Seaborne" <a...@apache.org> wrote:


Related:

Jena now provides "Serializable" for Triple/Quad/Node

It did not make 3.1.1, it's in development snapshots and in the next
release.

Use with spark was the original motivation.

Andy

https://issues.apache.org/jira/browse/JENA-1233

On 17/12/16 09:14, Joint wrote:




Hi.
I was about to use the above to wrap some quads and spoof the RDDs as
graphs from within a dataset but before I do has this been done before?

I

have some code which calls the RDD methods from the graph base find. Not
wanting to invent the wheel and such...


Dick






Re: Jena and Spark and Elephas

2016-12-22 Thread Joint


Hi Andy.
I noticed the WIP status. How does the data get separated from the prefixes, 
it's in the same file! Unless the file gets split... My reader requires a 
prefix file when the triple file is opened otherwise it throws an exception. 
Thus the data file can be split as long as the prefix file is along for the 
ride.
Is the Elephas compression in memory or disk or both. Our compression 
requirement is driven by the deployment environment which charges for the disk 
writes to the SLA storage. The RDDs are free as they go into memory or local 
disk. Effectively I'm persisting a dataset as a group of triple files which get 
loaded into RDDs and processed as a read-only dataset. This also allows us to 
perform multiple graph writes so we can load data in parallel.

Dick

 Original message 
From: Andy Seaborne <a...@apache.org> 
Date: 22/12/2016  11:30  (GMT+00:00) 
To: users@jena.apache.org 
Subject: Re: Jena and Spark and Elephas 

RDF-Patch update is WIP - in fact one of the potential things is 
removing the prefix stuff. (Reasons: data gets separated from its 
prefixes/; want to add prefix management of the data to RDF Patch.)

Elephas has various compression options. Would any of these work for you?

I find that compressing n-triples gives x5 to x10 compression so applied 
to RDD data I'd expect that or more.

There are line based output formats (I don't know if they work with 
Elephas - no reason why not in principle).

http://jena.apache.org/documentation/io/rdf-output.html#line-printed-formats

See RDFFormat TURTLE_FLAT.

Just don't loose the prefixes!

 Andy



On 21/12/16 20:46, Dick Murray wrote:
> So basically I've got RDF Patch with a default A which I use to build the
> Apache Spark RDD...
>
> A quick Google got me a git master updated 4 years ago, but no code, but
> the thread says Andy is using the code..?
>
> Like you said probably one for Andy.
>
> Thanks for the pointer.
>
> On 21 Dec 2016 19:59, "A. Soroka" <aj...@virginia.edu> wrote:
>
> Andy can say more, but RDF Patch may be heading in a direction where it
> could be used for such a purpose:
>
> https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d
> 0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E
>
> ---
> A. Soroka
> The University of Virginia Library
>
>> On Dec 21, 2016, at 2:17 PM, Dick Murray <dandh...@gmail.com> wrote:
>>
>> Hi, on a similar vein I have a modified NTriple reader which uses a prefix
>> file to reduce the file size. Whilst the serialisation allows parallel
>> processing in spark the file sizes were large and this has reduced them to
>> 1/10 the original size on average.
>>
>> There is not an existing line based serialisation with some for of
>> prefixing is there?
>>
>> On 17 Dec 2016 20:03, "Andy Seaborne" <a...@apache.org> wrote:
>>
>>> Related:
>>>
>>> Jena now provides "Serializable" for Triple/Quad/Node
>>>
>>> It did not make 3.1.1, it's in development snapshots and in the next
>>> release.
>>>
>>> Use with spark was the original motivation.
>>>
>>>    Andy
>>>
>>> https://issues.apache.org/jira/browse/JENA-1233
>>>
>>> On 17/12/16 09:14, Joint wrote:
>>>
>>>>
>>>>
>>>> Hi.
>>>> I was about to use the above to wrap some quads and spoof the RDDs as
>>>> graphs from within a dataset but before I do has this been done before?
> I
>>>> have some code which calls the RDD methods from the graph base find. Not
>>>> wanting to invent the wheel and such...
>>>>
>>>>
>>>> Dick
>>>>
>>>>
>


Re: Jena and Spark and Elephas

2016-12-22 Thread Andy Seaborne
RDF-Patch update is WIP - in fact one of the potential things is 
removing the prefix stuff. (Reasons: data gets separated from its 
prefixes/; want to add prefix management of the data to RDF Patch.)


Elephas has various compression options. Would any of these work for you?

I find that compressing n-triples gives x5 to x10 compression so applied 
to RDD data I'd expect that or more.


There are line based output formats (I don't know if they work with 
Elephas - no reason why not in principle).


http://jena.apache.org/documentation/io/rdf-output.html#line-printed-formats

See RDFFormat TURTLE_FLAT.

Just don't loose the prefixes!

Andy



On 21/12/16 20:46, Dick Murray wrote:

So basically I've got RDF Patch with a default A which I use to build the
Apache Spark RDD...

A quick Google got me a git master updated 4 years ago, but no code, but
the thread says Andy is using the code..?

Like you said probably one for Andy.

Thanks for the pointer.

On 21 Dec 2016 19:59, "A. Soroka"  wrote:

Andy can say more, but RDF Patch may be heading in a direction where it
could be used for such a purpose:

https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d
0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E

---
A. Soroka
The University of Virginia Library


On Dec 21, 2016, at 2:17 PM, Dick Murray  wrote:

Hi, on a similar vein I have a modified NTriple reader which uses a prefix
file to reduce the file size. Whilst the serialisation allows parallel
processing in spark the file sizes were large and this has reduced them to
1/10 the original size on average.

There is not an existing line based serialisation with some for of
prefixing is there?

On 17 Dec 2016 20:03, "Andy Seaborne"  wrote:


Related:

Jena now provides "Serializable" for Triple/Quad/Node

It did not make 3.1.1, it's in development snapshots and in the next
release.

Use with spark was the original motivation.

   Andy

https://issues.apache.org/jira/browse/JENA-1233

On 17/12/16 09:14, Joint wrote:




Hi.
I was about to use the above to wrap some quads and spoof the RDDs as
graphs from within a dataset but before I do has this been done before?

I

have some code which calls the RDD methods from the graph base find. Not
wanting to invent the wheel and such...


Dick






Re: Jena and Spark and Elephas

2016-12-21 Thread Dick Murray
So basically I've got RDF Patch with a default A which I use to build the
Apache Spark RDD...

A quick Google got me a git master updated 4 years ago, but no code, but
the thread says Andy is using the code..?

Like you said probably one for Andy.

Thanks for the pointer.

On 21 Dec 2016 19:59, "A. Soroka"  wrote:

Andy can say more, but RDF Patch may be heading in a direction where it
could be used for such a purpose:

https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d
0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E

---
A. Soroka
The University of Virginia Library

> On Dec 21, 2016, at 2:17 PM, Dick Murray  wrote:
>
> Hi, on a similar vein I have a modified NTriple reader which uses a prefix
> file to reduce the file size. Whilst the serialisation allows parallel
> processing in spark the file sizes were large and this has reduced them to
> 1/10 the original size on average.
>
> There is not an existing line based serialisation with some for of
> prefixing is there?
>
> On 17 Dec 2016 20:03, "Andy Seaborne"  wrote:
>
>> Related:
>>
>> Jena now provides "Serializable" for Triple/Quad/Node
>>
>> It did not make 3.1.1, it's in development snapshots and in the next
>> release.
>>
>> Use with spark was the original motivation.
>>
>>Andy
>>
>> https://issues.apache.org/jira/browse/JENA-1233
>>
>> On 17/12/16 09:14, Joint wrote:
>>
>>>
>>>
>>> Hi.
>>> I was about to use the above to wrap some quads and spoof the RDDs as
>>> graphs from within a dataset but before I do has this been done before?
I
>>> have some code which calls the RDD methods from the graph base find. Not
>>> wanting to invent the wheel and such...
>>>
>>>
>>> Dick
>>>
>>>


Re: Jena and Spark and Elephas

2016-12-21 Thread A. Soroka
Andy can say more, but RDF Patch may be heading in a direction where it could 
be used for such a purpose:

https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E

---
A. Soroka
The University of Virginia Library

> On Dec 21, 2016, at 2:17 PM, Dick Murray  wrote:
> 
> Hi, on a similar vein I have a modified NTriple reader which uses a prefix
> file to reduce the file size. Whilst the serialisation allows parallel
> processing in spark the file sizes were large and this has reduced them to
> 1/10 the original size on average.
> 
> There is not an existing line based serialisation with some for of
> prefixing is there?
> 
> On 17 Dec 2016 20:03, "Andy Seaborne"  wrote:
> 
>> Related:
>> 
>> Jena now provides "Serializable" for Triple/Quad/Node
>> 
>> It did not make 3.1.1, it's in development snapshots and in the next
>> release.
>> 
>> Use with spark was the original motivation.
>> 
>>Andy
>> 
>> https://issues.apache.org/jira/browse/JENA-1233
>> 
>> On 17/12/16 09:14, Joint wrote:
>> 
>>> 
>>> 
>>> Hi.
>>> I was about to use the above to wrap some quads and spoof the RDDs as
>>> graphs from within a dataset but before I do has this been done before? I
>>> have some code which calls the RDD methods from the graph base find. Not
>>> wanting to invent the wheel and such...
>>> 
>>> 
>>> Dick
>>> 
>>> 



Re: Jena and Spark and Elephas

2016-12-21 Thread Dick Murray
Hi, on a similar vein I have a modified NTriple reader which uses a prefix
file to reduce the file size. Whilst the serialisation allows parallel
processing in spark the file sizes were large and this has reduced them to
1/10 the original size on average.

There is not an existing line based serialisation with some for of
prefixing is there?

On 17 Dec 2016 20:03, "Andy Seaborne"  wrote:

> Related:
>
> Jena now provides "Serializable" for Triple/Quad/Node
>
> It did not make 3.1.1, it's in development snapshots and in the next
> release.
>
> Use with spark was the original motivation.
>
> Andy
>
> https://issues.apache.org/jira/browse/JENA-1233
>
> On 17/12/16 09:14, Joint wrote:
>
>>
>>
>> Hi.
>> I was about to use the above to wrap some quads and spoof the RDDs as
>> graphs from within a dataset but before I do has this been done before? I
>> have some code which calls the RDD methods from the graph base find. Not
>> wanting to invent the wheel and such...
>>
>>
>> Dick
>>
>>


Re: Jena and Spark and Elephas

2016-12-17 Thread Dick Murray
Excellent, I was currently wrapping and unwrapping as Strings which fixed
another issue along with prefixing bnodes to remove clashes between TDB's.
I'll pull and refactoring my code...

On 17 Dec 2016 20:03, "Andy Seaborne"  wrote:

Related:

Jena now provides "Serializable" for Triple/Quad/Node

It did not make 3.1.1, it's in development snapshots and in the next
release.

Use with spark was the original motivation.

Andy

https://issues.apache.org/jira/browse/JENA-1233


On 17/12/16 09:14, Joint wrote:

>
>
> Hi.
> I was about to use the above to wrap some quads and spoof the RDDs as
> graphs from within a dataset but before I do has this been done before? I
> have some code which calls the RDD methods from the graph base find. Not
> wanting to invent the wheel and such...
>
>
> Dick
>
>


Re: Jena and Spark and Elephas

2016-12-17 Thread Andy Seaborne

Related:

Jena now provides "Serializable" for Triple/Quad/Node

It did not make 3.1.1, it's in development snapshots and in the next 
release.


Use with spark was the original motivation.

Andy

https://issues.apache.org/jira/browse/JENA-1233

On 17/12/16 09:14, Joint wrote:



Hi.
I was about to use the above to wrap some quads and spoof the RDDs as graphs 
from within a dataset but before I do has this been done before? I have some 
code which calls the RDD methods from the graph base find. Not wanting to 
invent the wheel and such...


Dick



Jena and Spark and Elephas

2016-12-17 Thread Joint


Hi.
I was about to use the above to wrap some quads and spoof the RDDs as graphs 
from within a dataset but before I do has this been done before? I have some 
code which calls the RDD methods from the graph base find. Not wanting to 
invent the wheel and such...


Dick