Re: How to use a custom partitioner in a dataframe in Spark

2016-02-18 Thread Koert Kuipers
although it is not a bad idea to write data out partitioned, and then use a
merge join when reading it back in, this currently isn't even easily doable
with rdds because when you read an rdd from disk the partitioning info is
lost. re-introducing a partitioner at that point causes a shuffle defeating
the purpose.

On Thu, Feb 18, 2016 at 1:49 PM, Rishi Mishra  wrote:

> Michael,
> Is there any specific reason why DataFrames does not have partitioners
> like RDDs ? This will be very useful if one is writing custom datasources ,
> which keeps data in partitions. While storing data one can pre-partition
> the data at Spark level rather than at the datasource.
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Thu, Feb 18, 2016 at 3:50 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> So suppose I have a bunch of userIds and I need to save them as parquet
>> in database. I also need to load them back and need to be able to do a join
>> on userId. My idea is to partition by userId hashcode first and then on
>> userId. So that I don't have to deal with any performance issues because of
>> a number of small files and also to be able to scan faster.
>>
>>
>> Something like ...df.write.format("parquet").partitionBy( "userIdHash"
>> , "userId").mode(SaveMode.Append).save("userRecords");
>>
>> On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> So suppose I have a bunch of userIds and I need to save them as parquet
>>> in database. I also need to load them back and need to be able to do a join
>>> on userId. My idea is to partition by userId hashcode first and then on
>>> userId.
>>>
>>>
>>>
>>> On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 Can you describe what you are trying to accomplish?  What would the
 custom partitioner be?

 On Tue, Feb 16, 2016 at 1:21 PM, SRK  wrote:

> Hi,
>
> How do I use a custom partitioner when I do a saveAsTable in a
> dataframe.
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>


Re: How to use a custom partitioner in a dataframe in Spark

2016-02-18 Thread Rishi Mishra
Michael,
Is there any specific reason why DataFrames does not have partitioners like
RDDs ? This will be very useful if one is writing custom datasources ,
which keeps data in partitions. While storing data one can pre-partition
the data at Spark level rather than at the datasource.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Thu, Feb 18, 2016 at 3:50 AM, swetha kasireddy  wrote:

> So suppose I have a bunch of userIds and I need to save them as parquet in
> database. I also need to load them back and need to be able to do a join
> on userId. My idea is to partition by userId hashcode first and then on
> userId. So that I don't have to deal with any performance issues because of
> a number of small files and also to be able to scan faster.
>
>
> Something like ...df.write.format("parquet").partitionBy( "userIdHash"
> , "userId").mode(SaveMode.Append).save("userRecords");
>
> On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> So suppose I have a bunch of userIds and I need to save them as parquet
>> in database. I also need to load them back and need to be able to do a join
>> on userId. My idea is to partition by userId hashcode first and then on
>> userId.
>>
>>
>>
>> On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Can you describe what you are trying to accomplish?  What would the
>>> custom partitioner be?
>>>
>>> On Tue, Feb 16, 2016 at 1:21 PM, SRK  wrote:
>>>
 Hi,

 How do I use a custom partitioner when I do a saveAsTable in a
 dataframe.


 Thanks,
 Swetha



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
So suppose I have a bunch of userIds and I need to save them as parquet in
database. I also need to load them back and need to be able to do a join
on userId. My idea is to partition by userId hashcode first and then on
userId. So that I don't have to deal with any performance issues because of
a number of small files and also to be able to scan faster.


Something like ...df.write.format("parquet").partitionBy( "userIdHash"
, "userId").mode(SaveMode.Append).save("userRecords");

On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy  wrote:

> So suppose I have a bunch of userIds and I need to save them as parquet in
> database. I also need to load them back and need to be able to do a join
> on userId. My idea is to partition by userId hashcode first and then on
> userId.
>
>
>
> On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust  > wrote:
>
>> Can you describe what you are trying to accomplish?  What would the
>> custom partitioner be?
>>
>> On Tue, Feb 16, 2016 at 1:21 PM, SRK  wrote:
>>
>>> Hi,
>>>
>>> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>>>
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
So suppose I have a bunch of userIds and I need to save them as parquet in
database. I also need to load them back and need to be able to do a join
on userId. My idea is to partition by userId hashcode first and then on
userId.



On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust 
wrote:

> Can you describe what you are trying to accomplish?  What would the custom
> partitioner be?
>
> On Tue, Feb 16, 2016 at 1:21 PM, SRK  wrote:
>
>> Hi,
>>
>> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread Michael Armbrust
Can you describe what you are trying to accomplish?  What would the custom
partitioner be?

On Tue, Feb 16, 2016 at 1:21 PM, SRK  wrote:

> Hi,
>
> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread Rishi Mishra
Unfortunately there is not any,  at least till 1.5.  Have not gone through
the new DataSet of 1.6.  There is some basic support for Parquet like
partitionByColumn.
If you want to partition your dataset on a certain way you have to use an
RDD to partition & convert that into a DataFrame before storing in table.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Tue, Feb 16, 2016 at 11:51 PM, SRK  wrote:

> Hi,
>
> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>