Hi Bijay, This approach might not work for me as I have to do partial inserts/overwrites in a given table and data_frame.write.partitionBy will overwrite the entire table.
Thanks, Swetha On Mon, Jun 13, 2016 at 9:25 PM, Bijay Pathak <bijay.pat...@cloudwick.com> wrote: > Hi Swetha, > > One option is to use Hive with the above issues fixed which is Hive 2.0 or > Cloudera CDH Hive 1.2 which has above issue resolved. One thing to remember > is it's not the Hive you have installed but the Hive Spark is using which > in Spark 1.6 is Hive version 1.2 as of now. > > The workaround I did for this issue was to write dataframe directly using > dataframe write method and to create the Hive Table on top of that, doing > which my processing time was down from 4+ hrs to just under 1 hr. > > > > data_frame.write.partitionBy('idPartitioner','dtPartitoner').orc("path/to/final/location") > > And ORC format is supported with HiveContext only. > > Thanks, > Bijay > > > On Mon, Jun 13, 2016 at 11:41 AM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> Hi Mich, >> >> Following is a sample code snippet: >> >> >> *val *userDF = >> userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId", >> "userRecord").persist() >> System.*out*.println(" userRecsDF.partitions.size"+ >> userRecsDF.partitions.size) >> >> userDF.registerTempTable("userRecordsTemp") >> >> sqlContext.sql("SET hive.default.fileformat=Orc ") >> sqlContext.sql("set hive.enforce.bucketing = true; ") >> sqlContext.sql("set hive.enforce.sorting = true; ") >> sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS users (userId >> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING, >> dtPartitioner STRING) stored as ORC LOCATION '/user/userId/userRecords' ") >> sqlContext.sql( >> """ from userRecordsTemp ps insert overwrite table users >> partition(idPartitioner, dtPartitioner) select ps.userId, ps.userRecord, >> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner >> """.stripMargin) >> >> >> On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy < >> swethakasire...@gmail.com> wrote: >> >>> Hi Bijay, >>> >>> If I am hitting this issue, >>> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be >>> done? Incrementing to higher version of hive is the only solution? >>> >>> Thanks! >>> >>> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy < >>> swethakasire...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Following is a sample code snippet: >>>> >>>> >>>> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", >>>> "userId", "userRecord").persist() >>>> System.*out*.println(" userRecsDF.partitions.size"+ >>>> userRecsDF.partitions.size) >>>> >>>> userDF.registerTempTable("userRecordsTemp") >>>> >>>> sqlContext.sql("SET hive.default.fileformat=Orc ") >>>> sqlContext.sql("set hive.enforce.bucketing = true; ") >>>> sqlContext.sql("set hive.enforce.sorting = true; ") >>>> sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS users (userId >>>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING, >>>> dtPartitioner STRING) stored as ORC LOCATION '/user/userId/userRecords' " >>>> ) >>>> sqlContext.sql( >>>> """ from userRecordsTemp ps insert overwrite table users >>>> partition(idPartitioner, dtPartitioner) select ps.userId, ps.userRecord, >>>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner >>>> """.stripMargin) >>>> >>>> >>>> >>>> >>>> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak < >>>> bijay.pat...@cloudwick.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> Looks like you are hitting this: >>>>> https://issues.apache.org/jira/browse/HIVE-11940. >>>>> >>>>> Thanks, >>>>> Bijay >>>>> >>>>> >>>>> >>>>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> cam you provide a code snippet of how you are populating the target >>>>>> table from temp table. >>>>>> >>>>>> >>>>>> HTH >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> >>>>>> On 9 June 2016 at 23:43, swetha kasireddy <swethakasire...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> No, I am reading the data from hdfs, transforming it , registering >>>>>>> the data in a temp table using registerTempTable and then doing insert >>>>>>> overwrite using Spark SQl' hiveContext. >>>>>>> >>>>>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>>> how are you doing the insert? from an existing table? >>>>>>>> >>>>>>>> Dr Mich Talebzadeh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> LinkedIn * >>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 9 June 2016 at 21:16, Stephen Boesch <java...@gmail.com> wrote: >>>>>>>> >>>>>>>>> How many workers (/cpu cores) are assigned to this job? >>>>>>>>> >>>>>>>>> 2016-06-09 13:01 GMT-07:00 SRK <swethakasire...@gmail.com>: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> How to insert data into 2000 partitions(directories) of >>>>>>>>>> ORC/parquet at a >>>>>>>>>> time using Spark SQL? It seems to be not performant when I try to >>>>>>>>>> insert >>>>>>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face >>>>>>>>>> this issue? >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> View this message in context: >>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html >>>>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>>>> Nabble.com. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >