Re: Writing to multiple outputs in Spark

2015-08-17 Thread Silas Davis
@Reynold Xin: not really: it only works for Parquet (see partitionBy: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter), it requires you to have a DataFrame in the first place (for my use case the spark sql interface to avro records is more of a hindera

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Reynold Xin
This is already supported with the new partitioned data sources in DataFrame/SQL right? On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini wrote: > Speaking about Shopify's deployment, this would be a really nice to have > feature. > > We would like to write data to folders with the structure > `//

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Nicholas Chammas
See: https://issues.apache.org/jira/browse/SPARK-3533 Feel free to comment there and make a case if you think the issue should be reopened. Nick On Fri, Aug 14, 2015 at 11:11 AM Abhishek R. Singh < abhis...@tetrationanalytics.com> wrote: > A workaround would be to have multiple passes on the RD

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Abhishek R. Singh
A workaround would be to have multiple passes on the RDD and each pass write its own output? Or in a foreachPartition do it in a single pass (open up multiple files per partition to write out)? -Abhishek- On Aug 14, 2015, at 7:56 AM, Silas Davis wrote: > Would it be right to assume that the

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Alex Angelini
Speaking about Shopify's deployment, this would be a really nice to have feature. We would like to write data to folders with the structure `//` but have had to hold off on that because of the lack of support for MultipleOutputs. On Fri, Aug 14, 2015 at 10:56 AM, Silas Davis wrote: > Would it b

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Silas Davis
Would it be right to assume that the silence on this topic implies others don't really have this issue/desire? On Sat, 18 Jul 2015 at 17:24 Silas Davis wrote: > *tl;dr hadoop and cascading* *provide ways of writing tuples to multiple > output files based on key, but the plain RDD interface doesn

Writing to multiple outputs in Spark

2015-07-18 Thread Silas Davis
*tl;dr hadoop and cascading* *provide ways of writing tuples to multiple output files based on key, but the plain RDD interface doesn't seem to and it should.* I have been looking into ways to write to multiple outputs in Spark. It seems like a feature that is somewhat missing from Spark. The ide