Re: Jena and Spark and Elephas
On 22 Dec 2016 8:14 pm, "Andy Seaborne" <a...@apache.org> wrote: On 22/12/16 14:48, Joint wrote: > > > Hi Andy. > I noticed the WIP status. How does the data get separated from the > prefixes, it's in the same file! Unless the file gets split... > Exactly! In some systems splitting is not controlled by the app but by the distributor of data. It sounds like in yours that you are writing large files to the persistent layer, correct? Yes, we have currently written ~1500 triple files each between 1-4M triples with a few pushing 9M triples. The file name is the graph name. We have a ~100k unique properties per file, large files simply have more objects. The equivalent TDB is at 1.5TB... Our ETL can process asynchronously which is helping and the spark seems to be working :-) My reader requires a prefix file when the triple file is opened otherwise > it throws an exception. Thus the data file can be split as long as the > prefix file is along for the ride. > Is the Elephas compression in memory or disk or both. Our compression > requirement is driven by the deployment environment which charges for the > disk writes to the SLA storage. > Not sure. The RDDs are free as they go into memory or local disk. Effectively I'm > persisting a dataset as a group of triple files which get loaded into RDDs > and processed as a read-only dataset. This also allows us to perform > multiple graph writes so we can load data in parallel. > > Dick > > Original message > From: Andy Seaborne <a...@apache.org> > Date: 22/12/2016 11:30 (GMT+00:00) > To: users@jena.apache.org > Subject: Re: Jena and Spark and Elephas > > RDF-Patch update is WIP - in fact one of the potential things is > removing the prefix stuff. (Reasons: data gets separated from its > prefixes/; want to add prefix management of the data to RDF Patch.) > > Elephas has various compression options. Would any of these work for you? > > I find that compressing n-triples gives x5 to x10 compression so applied > to RDD data I'd expect that or more. > > There are line based output formats (I don't know if they work with > Elephas - no reason why not in principle). > > http://jena.apache.org/documentation/io/rdf-output.html# > line-printed-formats > > See RDFFormat TURTLE_FLAT. > > Just don't loose the prefixes! > > Andy > > > > On 21/12/16 20:46, Dick Murray wrote: > >> So basically I've got RDF Patch with a default A which I use to build the >> Apache Spark RDD... >> >> A quick Google got me a git master updated 4 years ago, but no code, but >> the thread says Andy is using the code..? >> >> Like you said probably one for Andy. >> >> Thanks for the pointer. >> >> On 21 Dec 2016 19:59, "A. Soroka" <aj...@virginia.edu> wrote: >> >> Andy can say more, but RDF Patch may be heading in a direction where it >> could be used for such a purpose: >> >> https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d >> 0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E >> >> --- >> A. Soroka >> The University of Virginia Library >> >> On Dec 21, 2016, at 2:17 PM, Dick Murray <dandh...@gmail.com> wrote: >>> >>> Hi, on a similar vein I have a modified NTriple reader which uses a >>> prefix >>> file to reduce the file size. Whilst the serialisation allows parallel >>> processing in spark the file sizes were large and this has reduced them >>> to >>> 1/10 the original size on average. >>> >>> There is not an existing line based serialisation with some for of >>> prefixing is there? >>> >>> On 17 Dec 2016 20:03, "Andy Seaborne" <a...@apache.org> wrote: >>> >>> Related: >>>> >>>> Jena now provides "Serializable" for Triple/Quad/Node >>>> >>>> It did not make 3.1.1, it's in development snapshots and in the next >>>> release. >>>> >>>> Use with spark was the original motivation. >>>> >>>> Andy >>>> >>>> https://issues.apache.org/jira/browse/JENA-1233 >>>> >>>> On 17/12/16 09:14, Joint wrote: >>>> >>>> >>>>> >>>>> Hi. >>>>> I was about to use the above to wrap some quads and spoof the RDDs as >>>>> graphs from within a dataset but before I do has this been done before? >>>>> >>>> I >> >>> have some code which calls the RDD methods from the graph base find. Not >>>>> wanting to invent the wheel and such... >>>>> >>>>> >>>>> Dick >>>>> >>>>> >>>>> >>
Re: Jena and Spark and Elephas
On 22/12/16 14:48, Joint wrote: Hi Andy. I noticed the WIP status. How does the data get separated from the prefixes, it's in the same file! Unless the file gets split... Exactly! In some systems splitting is not controlled by the app but by the distributor of data. It sounds like in yours that you are writing large files to the persistent layer, correct? My reader requires a prefix file when the triple file is opened otherwise it throws an exception. Thus the data file can be split as long as the prefix file is along for the ride. Is the Elephas compression in memory or disk or both. Our compression requirement is driven by the deployment environment which charges for the disk writes to the SLA storage. Not sure. The RDDs are free as they go into memory or local disk. Effectively I'm persisting a dataset as a group of triple files which get loaded into RDDs and processed as a read-only dataset. This also allows us to perform multiple graph writes so we can load data in parallel. Dick Original message From: Andy Seaborne <a...@apache.org> Date: 22/12/2016 11:30 (GMT+00:00) To: users@jena.apache.org Subject: Re: Jena and Spark and Elephas RDF-Patch update is WIP - in fact one of the potential things is removing the prefix stuff. (Reasons: data gets separated from its prefixes/; want to add prefix management of the data to RDF Patch.) Elephas has various compression options. Would any of these work for you? I find that compressing n-triples gives x5 to x10 compression so applied to RDD data I'd expect that or more. There are line based output formats (I don't know if they work with Elephas - no reason why not in principle). http://jena.apache.org/documentation/io/rdf-output.html#line-printed-formats See RDFFormat TURTLE_FLAT. Just don't loose the prefixes! Andy On 21/12/16 20:46, Dick Murray wrote: So basically I've got RDF Patch with a default A which I use to build the Apache Spark RDD... A quick Google got me a git master updated 4 years ago, but no code, but the thread says Andy is using the code..? Like you said probably one for Andy. Thanks for the pointer. On 21 Dec 2016 19:59, "A. Soroka" <aj...@virginia.edu> wrote: Andy can say more, but RDF Patch may be heading in a direction where it could be used for such a purpose: https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d 0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E --- A. Soroka The University of Virginia Library On Dec 21, 2016, at 2:17 PM, Dick Murray <dandh...@gmail.com> wrote: Hi, on a similar vein I have a modified NTriple reader which uses a prefix file to reduce the file size. Whilst the serialisation allows parallel processing in spark the file sizes were large and this has reduced them to 1/10 the original size on average. There is not an existing line based serialisation with some for of prefixing is there? On 17 Dec 2016 20:03, "Andy Seaborne" <a...@apache.org> wrote: Related: Jena now provides "Serializable" for Triple/Quad/Node It did not make 3.1.1, it's in development snapshots and in the next release. Use with spark was the original motivation. Andy https://issues.apache.org/jira/browse/JENA-1233 On 17/12/16 09:14, Joint wrote: Hi. I was about to use the above to wrap some quads and spoof the RDDs as graphs from within a dataset but before I do has this been done before? I have some code which calls the RDD methods from the graph base find. Not wanting to invent the wheel and such... Dick
Re: Jena and Spark and Elephas
Hi Andy. I noticed the WIP status. How does the data get separated from the prefixes, it's in the same file! Unless the file gets split... My reader requires a prefix file when the triple file is opened otherwise it throws an exception. Thus the data file can be split as long as the prefix file is along for the ride. Is the Elephas compression in memory or disk or both. Our compression requirement is driven by the deployment environment which charges for the disk writes to the SLA storage. The RDDs are free as they go into memory or local disk. Effectively I'm persisting a dataset as a group of triple files which get loaded into RDDs and processed as a read-only dataset. This also allows us to perform multiple graph writes so we can load data in parallel. Dick Original message From: Andy Seaborne <a...@apache.org> Date: 22/12/2016 11:30 (GMT+00:00) To: users@jena.apache.org Subject: Re: Jena and Spark and Elephas RDF-Patch update is WIP - in fact one of the potential things is removing the prefix stuff. (Reasons: data gets separated from its prefixes/; want to add prefix management of the data to RDF Patch.) Elephas has various compression options. Would any of these work for you? I find that compressing n-triples gives x5 to x10 compression so applied to RDD data I'd expect that or more. There are line based output formats (I don't know if they work with Elephas - no reason why not in principle). http://jena.apache.org/documentation/io/rdf-output.html#line-printed-formats See RDFFormat TURTLE_FLAT. Just don't loose the prefixes! Andy On 21/12/16 20:46, Dick Murray wrote: > So basically I've got RDF Patch with a default A which I use to build the > Apache Spark RDD... > > A quick Google got me a git master updated 4 years ago, but no code, but > the thread says Andy is using the code..? > > Like you said probably one for Andy. > > Thanks for the pointer. > > On 21 Dec 2016 19:59, "A. Soroka" <aj...@virginia.edu> wrote: > > Andy can say more, but RDF Patch may be heading in a direction where it > could be used for such a purpose: > > https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d > 0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E > > --- > A. Soroka > The University of Virginia Library > >> On Dec 21, 2016, at 2:17 PM, Dick Murray <dandh...@gmail.com> wrote: >> >> Hi, on a similar vein I have a modified NTriple reader which uses a prefix >> file to reduce the file size. Whilst the serialisation allows parallel >> processing in spark the file sizes were large and this has reduced them to >> 1/10 the original size on average. >> >> There is not an existing line based serialisation with some for of >> prefixing is there? >> >> On 17 Dec 2016 20:03, "Andy Seaborne" <a...@apache.org> wrote: >> >>> Related: >>> >>> Jena now provides "Serializable" for Triple/Quad/Node >>> >>> It did not make 3.1.1, it's in development snapshots and in the next >>> release. >>> >>> Use with spark was the original motivation. >>> >>> Andy >>> >>> https://issues.apache.org/jira/browse/JENA-1233 >>> >>> On 17/12/16 09:14, Joint wrote: >>> >>>> >>>> >>>> Hi. >>>> I was about to use the above to wrap some quads and spoof the RDDs as >>>> graphs from within a dataset but before I do has this been done before? > I >>>> have some code which calls the RDD methods from the graph base find. Not >>>> wanting to invent the wheel and such... >>>> >>>> >>>> Dick >>>> >>>> >
Re: Jena and Spark and Elephas
RDF-Patch update is WIP - in fact one of the potential things is removing the prefix stuff. (Reasons: data gets separated from its prefixes/; want to add prefix management of the data to RDF Patch.) Elephas has various compression options. Would any of these work for you? I find that compressing n-triples gives x5 to x10 compression so applied to RDD data I'd expect that or more. There are line based output formats (I don't know if they work with Elephas - no reason why not in principle). http://jena.apache.org/documentation/io/rdf-output.html#line-printed-formats See RDFFormat TURTLE_FLAT. Just don't loose the prefixes! Andy On 21/12/16 20:46, Dick Murray wrote: So basically I've got RDF Patch with a default A which I use to build the Apache Spark RDD... A quick Google got me a git master updated 4 years ago, but no code, but the thread says Andy is using the code..? Like you said probably one for Andy. Thanks for the pointer. On 21 Dec 2016 19:59, "A. Soroka"wrote: Andy can say more, but RDF Patch may be heading in a direction where it could be used for such a purpose: https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d 0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E --- A. Soroka The University of Virginia Library On Dec 21, 2016, at 2:17 PM, Dick Murray wrote: Hi, on a similar vein I have a modified NTriple reader which uses a prefix file to reduce the file size. Whilst the serialisation allows parallel processing in spark the file sizes were large and this has reduced them to 1/10 the original size on average. There is not an existing line based serialisation with some for of prefixing is there? On 17 Dec 2016 20:03, "Andy Seaborne" wrote: Related: Jena now provides "Serializable" for Triple/Quad/Node It did not make 3.1.1, it's in development snapshots and in the next release. Use with spark was the original motivation. Andy https://issues.apache.org/jira/browse/JENA-1233 On 17/12/16 09:14, Joint wrote: Hi. I was about to use the above to wrap some quads and spoof the RDDs as graphs from within a dataset but before I do has this been done before? I have some code which calls the RDD methods from the graph base find. Not wanting to invent the wheel and such... Dick
Re: Jena and Spark and Elephas
So basically I've got RDF Patch with a default A which I use to build the Apache Spark RDD... A quick Google got me a git master updated 4 years ago, but no code, but the thread says Andy is using the code..? Like you said probably one for Andy. Thanks for the pointer. On 21 Dec 2016 19:59, "A. Soroka"wrote: Andy can say more, but RDF Patch may be heading in a direction where it could be used for such a purpose: https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d 0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E --- A. Soroka The University of Virginia Library > On Dec 21, 2016, at 2:17 PM, Dick Murray wrote: > > Hi, on a similar vein I have a modified NTriple reader which uses a prefix > file to reduce the file size. Whilst the serialisation allows parallel > processing in spark the file sizes were large and this has reduced them to > 1/10 the original size on average. > > There is not an existing line based serialisation with some for of > prefixing is there? > > On 17 Dec 2016 20:03, "Andy Seaborne" wrote: > >> Related: >> >> Jena now provides "Serializable" for Triple/Quad/Node >> >> It did not make 3.1.1, it's in development snapshots and in the next >> release. >> >> Use with spark was the original motivation. >> >>Andy >> >> https://issues.apache.org/jira/browse/JENA-1233 >> >> On 17/12/16 09:14, Joint wrote: >> >>> >>> >>> Hi. >>> I was about to use the above to wrap some quads and spoof the RDDs as >>> graphs from within a dataset but before I do has this been done before? I >>> have some code which calls the RDD methods from the graph base find. Not >>> wanting to invent the wheel and such... >>> >>> >>> Dick >>> >>>
Re: Jena and Spark and Elephas
Andy can say more, but RDF Patch may be heading in a direction where it could be used for such a purpose: https://lists.apache.org/thread.html/79e0fbd41126a1d8d0b2fb3b7b837d0d1d58d568a3583701b366cfcc@%3Cdev.jena.apache.org%3E --- A. Soroka The University of Virginia Library > On Dec 21, 2016, at 2:17 PM, Dick Murraywrote: > > Hi, on a similar vein I have a modified NTriple reader which uses a prefix > file to reduce the file size. Whilst the serialisation allows parallel > processing in spark the file sizes were large and this has reduced them to > 1/10 the original size on average. > > There is not an existing line based serialisation with some for of > prefixing is there? > > On 17 Dec 2016 20:03, "Andy Seaborne" wrote: > >> Related: >> >> Jena now provides "Serializable" for Triple/Quad/Node >> >> It did not make 3.1.1, it's in development snapshots and in the next >> release. >> >> Use with spark was the original motivation. >> >>Andy >> >> https://issues.apache.org/jira/browse/JENA-1233 >> >> On 17/12/16 09:14, Joint wrote: >> >>> >>> >>> Hi. >>> I was about to use the above to wrap some quads and spoof the RDDs as >>> graphs from within a dataset but before I do has this been done before? I >>> have some code which calls the RDD methods from the graph base find. Not >>> wanting to invent the wheel and such... >>> >>> >>> Dick >>> >>>
Re: Jena and Spark and Elephas
Hi, on a similar vein I have a modified NTriple reader which uses a prefix file to reduce the file size. Whilst the serialisation allows parallel processing in spark the file sizes were large and this has reduced them to 1/10 the original size on average. There is not an existing line based serialisation with some for of prefixing is there? On 17 Dec 2016 20:03, "Andy Seaborne"wrote: > Related: > > Jena now provides "Serializable" for Triple/Quad/Node > > It did not make 3.1.1, it's in development snapshots and in the next > release. > > Use with spark was the original motivation. > > Andy > > https://issues.apache.org/jira/browse/JENA-1233 > > On 17/12/16 09:14, Joint wrote: > >> >> >> Hi. >> I was about to use the above to wrap some quads and spoof the RDDs as >> graphs from within a dataset but before I do has this been done before? I >> have some code which calls the RDD methods from the graph base find. Not >> wanting to invent the wheel and such... >> >> >> Dick >> >>
Re: Jena and Spark and Elephas
Excellent, I was currently wrapping and unwrapping as Strings which fixed another issue along with prefixing bnodes to remove clashes between TDB's. I'll pull and refactoring my code... On 17 Dec 2016 20:03, "Andy Seaborne"wrote: Related: Jena now provides "Serializable" for Triple/Quad/Node It did not make 3.1.1, it's in development snapshots and in the next release. Use with spark was the original motivation. Andy https://issues.apache.org/jira/browse/JENA-1233 On 17/12/16 09:14, Joint wrote: > > > Hi. > I was about to use the above to wrap some quads and spoof the RDDs as > graphs from within a dataset but before I do has this been done before? I > have some code which calls the RDD methods from the graph base find. Not > wanting to invent the wheel and such... > > > Dick > >
Re: Jena and Spark and Elephas
Related: Jena now provides "Serializable" for Triple/Quad/Node It did not make 3.1.1, it's in development snapshots and in the next release. Use with spark was the original motivation. Andy https://issues.apache.org/jira/browse/JENA-1233 On 17/12/16 09:14, Joint wrote: Hi. I was about to use the above to wrap some quads and spoof the RDDs as graphs from within a dataset but before I do has this been done before? I have some code which calls the RDD methods from the graph base find. Not wanting to invent the wheel and such... Dick
Jena and Spark and Elephas
Hi. I was about to use the above to wrap some quads and spoof the RDDs as graphs from within a dataset but before I do has this been done before? I have some code which calls the RDD methods from the graph base find. Not wanting to invent the wheel and such... Dick