Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-05 Thread Michael Armbrust
>
> Follow up question in this case: what is the cost of registering a temp
> table? Is there a limit to the number of temp tables that can be registered
> by Spark context?
>

It is pretty cheap.  Just an entry in an in-memory hashtable to a query
plan (similar to a view).


Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-04 Thread Isabelle Phan
Thanks all for your reply!

I tested both approaches: registering the temp table then executing SQL vs.
saving to HDFS filepath directly. The problem with the second approach is
that I am inserting data into a Hive table, so if I create a new partition
with this method, Hive metadata is not updated.

So I will be going with first approach.
Follow up question in this case: what is the cost of registering a temp
table? Is there a limit to the number of temp tables that can be registered
by Spark context?


Thanks again for your input.

Isabelle



On Wed, Dec 2, 2015 at 10:30 AM, Michael Armbrust 
wrote:

> you might also coalesce to 1 (or some small number) before writing to
> avoid creating a lot of files in that partition if you know that there is
> not a ton of data.
>
> On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra 
> wrote:
>
>> As long as all your data is being inserted by Spark , hence using the
>> same hash partitioner,  what Fengdong mentioned should work.
>>
>> On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu 
>> wrote:
>>
>>> Hi
>>> you can try:
>>>
>>> if your table under location “/test/table/“ on HDFS
>>> and has partitions:
>>>
>>>  “/test/table/dt=2012”
>>>  “/test/table/dt=2013”
>>>
>>> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")
>>>
>>>
>>>
>>> On Dec 2, 2015, at 10:50 AM, Isabelle Phan  wrote:
>>>
>>> df.write.partitionBy("date").insertInto("my_table")
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>> Rishitesh Mishra,
>> SnappyData . (http://www.snappydata.io/)
>>
>> https://in.linkedin.com/in/rishiteshmishra
>>
>
>


Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Michael Armbrust
you might also coalesce to 1 (or some small number) before writing to avoid
creating a lot of files in that partition if you know that there is not a
ton of data.

On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra  wrote:

> As long as all your data is being inserted by Spark , hence using the same
> hash partitioner,  what Fengdong mentioned should work.
>
> On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu 
> wrote:
>
>> Hi
>> you can try:
>>
>> if your table under location “/test/table/“ on HDFS
>> and has partitions:
>>
>>  “/test/table/dt=2012”
>>  “/test/table/dt=2013”
>>
>> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")
>>
>>
>>
>> On Dec 2, 2015, at 10:50 AM, Isabelle Phan  wrote:
>>
>> df.write.partitionBy("date").insertInto("my_table")
>>
>>
>>
>
>
> --
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>


Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Rishi Mishra
As long as all your data is being inserted by Spark , hence using the same
hash partitioner,  what Fengdong mentioned should work.

On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu 
wrote:

> Hi
> you can try:
>
> if your table under location “/test/table/“ on HDFS
> and has partitions:
>
>  “/test/table/dt=2012”
>  “/test/table/dt=2013”
>
> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")
>
>
>
> On Dec 2, 2015, at 10:50 AM, Isabelle Phan  wrote:
>
> df.write.partitionBy("date").insertInto("my_table")
>
>
>


-- 
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra


Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Fengdong Yu
Hi
you can try:

if your table under location “/test/table/“ on HDFS
and has partitions:

 “/test/table/dt=2012”
 “/test/table/dt=2013”

df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")



> On Dec 2, 2015, at 10:50 AM, Isabelle Phan  wrote:
> 
> df.write.partitionBy("date").insertInto("my_table")



Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Jeff Zhang
I don't think there's api for that, but think it is reasonable and helpful
for ETL.

As a workaround you can first register your dataframe as temp table, and
use sql to insert to the static partition.

On Wed, Dec 2, 2015 at 10:50 AM, Isabelle Phan  wrote:

> Hello,
>
> Is there any API to insert data into a single partition of a table?
>
> Let's say I have a table with 2 columns (col_a, col_b) and a partition by
> date.
> After doing some computation for a specific date, I have a DataFrame with
> 2 columns (col_a, col_b) which I would like to insert into a specific date
> partition. What is the best way to achieve this?
>
> It seems that if I add a date column to my DataFrame, and turn on dynamic
> partitioning, I can do:
> df.write.partitionBy("date").insertInto("my_table")
> But it seems overkill to use dynamic partitioning function for such a case.
>
>
> Thanks for any pointers!
>
> Isabelle
>
>
>


-- 
Best Regards

Jeff Zhang


SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Isabelle Phan
Hello,

Is there any API to insert data into a single partition of a table?

Let's say I have a table with 2 columns (col_a, col_b) and a partition by
date.
After doing some computation for a specific date, I have a DataFrame with 2
columns (col_a, col_b) which I would like to insert into a specific date
partition. What is the best way to achieve this?

It seems that if I add a date column to my DataFrame, and turn on dynamic
partitioning, I can do:
df.write.partitionBy("date").insertInto("my_table")
But it seems overkill to use dynamic partitioning function for such a case.


Thanks for any pointers!

Isabelle