Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread Bijay Pathak
ext.sql("set hive.enforce.bucketing = true; ") >>> sqlContext.sql("set hive.enforce.sorting = true; ") >>> sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS users (userId >>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING, >&

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-10 Thread Bijay Pathak
Hello, Looks like you are hitting this: https://issues.apache.org/jira/browse/HIVE-11940. Thanks, Bijay On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh wrote: > cam you provide a code snippet of how you are populating the target table > from temp table. > > > HTH > > Dr Mich Talebzadeh > > >

Re: Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Bijay Pathak
Sorry, for the confusion this was supposed to be answer for another thread. Bijay On Sat, Apr 30, 2016 at 2:37 PM, Bijay Kumar Pathak wrote: > Hi, > > I was facing the same issue on Spark 1.6. My data size was around 100 GB > and was writing in the partition Hive table. > > I was able to solve

Fwd: Connection closed Exception.

2016-04-10 Thread Bijay Pathak
Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table wh

Connection closed Exception.

2016-04-10 Thread Bijay Pathak
Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table wh

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Bijay Pathak
Hi, I have written the UDF for doing same in pyspark DataFrame since some of my dates are before unix standard time epoch of 1/1/1970. I have more than 250 columns and applying custom date_format UDF to more than 50 columns. I am getting OOM error and poor performance because of UDF. What's your

Re: multiple tables for join

2016-03-24 Thread Bijay Pathak
Hi, Can you elaborate what's the issues you are facing, I am doing the similar kind of join so I may be able to provide you with some suggestions and pointers. Thanks, BIjay On Thu, Mar 24, 2016 at 5:12 AM, pseudo oduesp wrote: > hi , i spent two months of my times to make 10 joins whith folow

DataFrame python UDF performnce too slow

2016-03-24 Thread Bijay Pathak
Hi, I am running Spark 1.6.0 on EMR. The job fails with OOM.I have DataFrame with 250 columns and I am applying UDF on more than 50 of the columns. I am registering the DataFrame as temptable and applying the UDF in hive_context sql statement. I am applying the UDF after sort merge join of two Dat

Re: adding rows to a DataFrame

2016-03-11 Thread Bijay Pathak
Here is another way you can achieve that(in Python): base_df.withColumn("column_name","column_expression_for_new_column") # To add new row create the data frame containing the new row and do the unionAll() base_df.unionAll(new_df) # Another approach convert to rdd add required fields and convert b

Getting all field value as Null while reading Hive Table with Partition

2016-01-20 Thread Bijay Pathak
Hello, I am getting all the value of field as NULL while reading Hive Table with Partition in SPARK 1.5.0 running on CDH5.5.1 with YARN (Dynamic Allocation). Below is the command I used in Spark_Shell: import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) val hiveTab

Base metrics for Spark Benchmarking.

2015-04-16 Thread Bijay Pathak
Hello, We wanted to tune the Spark running on YARN cluster.The Spark History Server UI shows lots of parameters like: - GC time - Task Duration - Shuffle R/W - Shuffle Spill (Memory/Disk) - Serialization Time (Task/Result) - Scheduler Delay Among the above metrics, which are th

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
om older map tasks to > memory? > > On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak > wrote: > >> The Spark Sort-Based Shuffle (default from 1.1) keeps the data from >> each Map tasks to memory until they they can't fit after which they >> are sorted and spilled to

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
The Spark Sort-Based Shuffle (default from 1.1) keeps the data from each Map tasks to memory until they they can't fit after which they are sorted and spilled to disk. You can reduce the Shuffle write to disk by increasing spark.shuffle.memoryFraction(default 0.2). By writing the shuffle output to

Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Bijay Pathak
Hello, I am running TeraSort on 100GB of data. The final metrics I am getting on Shuffle Spill are: Shuffle Spill(Memory): 122.5 GB Shuffle Spill(Disk): 3.4 GB What's the difference and relation between these two metrics? Does these mean 122.5 GB was s