Re: SparkSQL API to insert DataFrame into a static partition?
> > Follow up question in this case: what is the cost of registering a temp > table? Is there a limit to the number of temp tables that can be registered > by Spark context? > It is pretty cheap. Just an entry in an in-memory hashtable to a query plan (similar to a view).
Re: SparkSQL API to insert DataFrame into a static partition?
Thanks all for your reply! I tested both approaches: registering the temp table then executing SQL vs. saving to HDFS filepath directly. The problem with the second approach is that I am inserting data into a Hive table, so if I create a new partition with this method, Hive metadata is not updated. So I will be going with first approach. Follow up question in this case: what is the cost of registering a temp table? Is there a limit to the number of temp tables that can be registered by Spark context? Thanks again for your input. Isabelle On Wed, Dec 2, 2015 at 10:30 AM, Michael Armbrust wrote: > you might also coalesce to 1 (or some small number) before writing to > avoid creating a lot of files in that partition if you know that there is > not a ton of data. > > On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra > wrote: > >> As long as all your data is being inserted by Spark , hence using the >> same hash partitioner, what Fengdong mentioned should work. >> >> On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu >> wrote: >> >>> Hi >>> you can try: >>> >>> if your table under location “/test/table/“ on HDFS >>> and has partitions: >>> >>> “/test/table/dt=2012” >>> “/test/table/dt=2013” >>> >>> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") >>> >>> >>> >>> On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: >>> >>> df.write.partitionBy("date").insertInto("my_table") >>> >>> >>> >> >> >> -- >> Regards, >> Rishitesh Mishra, >> SnappyData . (http://www.snappydata.io/) >> >> https://in.linkedin.com/in/rishiteshmishra >> > >
Re: SparkSQL API to insert DataFrame into a static partition?
you might also coalesce to 1 (or some small number) before writing to avoid creating a lot of files in that partition if you know that there is not a ton of data. On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra wrote: > As long as all your data is being inserted by Spark , hence using the same > hash partitioner, what Fengdong mentioned should work. > > On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu > wrote: > >> Hi >> you can try: >> >> if your table under location “/test/table/“ on HDFS >> and has partitions: >> >> “/test/table/dt=2012” >> “/test/table/dt=2013” >> >> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") >> >> >> >> On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: >> >> df.write.partitionBy("date").insertInto("my_table") >> >> >> > > > -- > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > https://in.linkedin.com/in/rishiteshmishra >
Re: SparkSQL API to insert DataFrame into a static partition?
As long as all your data is being inserted by Spark , hence using the same hash partitioner, what Fengdong mentioned should work. On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu wrote: > Hi > you can try: > > if your table under location “/test/table/“ on HDFS > and has partitions: > > “/test/table/dt=2012” > “/test/table/dt=2013” > > df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") > > > > On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: > > df.write.partitionBy("date").insertInto("my_table") > > > -- Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra
Re: SparkSQL API to insert DataFrame into a static partition?
Hi you can try: if your table under location “/test/table/“ on HDFS and has partitions: “/test/table/dt=2012” “/test/table/dt=2013” df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") > On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: > > df.write.partitionBy("date").insertInto("my_table")
Re: SparkSQL API to insert DataFrame into a static partition?
I don't think there's api for that, but think it is reasonable and helpful for ETL. As a workaround you can first register your dataframe as temp table, and use sql to insert to the static partition. On Wed, Dec 2, 2015 at 10:50 AM, Isabelle Phan wrote: > Hello, > > Is there any API to insert data into a single partition of a table? > > Let's say I have a table with 2 columns (col_a, col_b) and a partition by > date. > After doing some computation for a specific date, I have a DataFrame with > 2 columns (col_a, col_b) which I would like to insert into a specific date > partition. What is the best way to achieve this? > > It seems that if I add a date column to my DataFrame, and turn on dynamic > partitioning, I can do: > df.write.partitionBy("date").insertInto("my_table") > But it seems overkill to use dynamic partitioning function for such a case. > > > Thanks for any pointers! > > Isabelle > > > -- Best Regards Jeff Zhang
SparkSQL API to insert DataFrame into a static partition?
Hello, Is there any API to insert data into a single partition of a table? Let's say I have a table with 2 columns (col_a, col_b) and a partition by date. After doing some computation for a specific date, I have a DataFrame with 2 columns (col_a, col_b) which I would like to insert into a specific date partition. What is the best way to achieve this? It seems that if I add a date column to my DataFrame, and turn on dynamic partitioning, I can do: df.write.partitionBy("date").insertInto("my_table") But it seems overkill to use dynamic partitioning function for such a case. Thanks for any pointers! Isabelle