from:"Bijay Pathak"

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread Bijay Pathak

ext.sql("set hive.enforce.bucketing = true; ") >>> sqlContext.sql("set hive.enforce.sorting = true; ") >>> sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS users (userId >>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING, >&

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-10 Thread Bijay Pathak

Hello, Looks like you are hitting this: https://issues.apache.org/jira/browse/HIVE-11940. Thanks, Bijay On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh wrote: > cam you provide a code snippet of how you are populating the target table > from temp table. > > > HTH > > Dr Mich Talebzadeh > > >

Re: Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Bijay Pathak

Sorry, for the confusion this was supposed to be answer for another thread. Bijay On Sat, Apr 30, 2016 at 2:37 PM, Bijay Kumar Pathak wrote: > Hi, > > I was facing the same issue on Spark 1.6. My data size was around 100 GB > and was writing in the partition Hive table. > > I was able to solve

Fwd: Connection closed Exception.

2016-04-10 Thread Bijay Pathak

Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table wh

Connection closed Exception.

2016-04-10 Thread Bijay Pathak

Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table wh

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Bijay Pathak

Hi, I have written the UDF for doing same in pyspark DataFrame since some of my dates are before unix standard time epoch of 1/1/1970. I have more than 250 columns and applying custom date_format UDF to more than 50 columns. I am getting OOM error and poor performance because of UDF. What's your

Re: multiple tables for join

2016-03-24 Thread Bijay Pathak

Hi, Can you elaborate what's the issues you are facing, I am doing the similar kind of join so I may be able to provide you with some suggestions and pointers. Thanks, BIjay On Thu, Mar 24, 2016 at 5:12 AM, pseudo oduesp wrote: > hi , i spent two months of my times to make 10 joins whith folow

DataFrame python UDF performnce too slow

2016-03-24 Thread Bijay Pathak

Hi, I am running Spark 1.6.0 on EMR. The job fails with OOM.I have DataFrame with 250 columns and I am applying UDF on more than 50 of the columns. I am registering the DataFrame as temptable and applying the UDF in hive_context sql statement. I am applying the UDF after sort merge join of two Dat

Re: adding rows to a DataFrame

2016-03-11 Thread Bijay Pathak

Here is another way you can achieve that(in Python): base_df.withColumn("column_name","column_expression_for_new_column") # To add new row create the data frame containing the new row and do the unionAll() base_df.unionAll(new_df) # Another approach convert to rdd add required fields and convert b

Getting all field value as Null while reading Hive Table with Partition

2016-01-20 Thread Bijay Pathak

Hello, I am getting all the value of field as NULL while reading Hive Table with Partition in SPARK 1.5.0 running on CDH5.5.1 with YARN (Dynamic Allocation). Below is the command I used in Spark_Shell: import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) val hiveTab

Base metrics for Spark Benchmarking.

2015-04-16 Thread Bijay Pathak

Hello, We wanted to tune the Spark running on YARN cluster.The Spark History Server UI shows lots of parameters like: - GC time - Task Duration - Shuffle R/W - Shuffle Spill (Memory/Disk) - Serialization Time (Task/Result) - Scheduler Delay Among the above metrics, which are th

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak

om older map tasks to > memory? > > On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak > wrote: > >> The Spark Sort-Based Shuffle (default from 1.1) keeps the data from >> each Map tasks to memory until they they can't fit after which they >> are sorted and spilled to

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak

The Spark Sort-Based Shuffle (default from 1.1) keeps the data from each Map tasks to memory until they they can't fit after which they are sorted and spilled to disk. You can reduce the Shuffle write to disk by increasing spark.shuffle.memoryFraction(default 0.2). By writing the shuffle output to

Shuffle Spill Memory and Shuffle Spill Disk

2015-03-23 Thread Bijay Pathak

Hello, I am running TeraSort on 100GB of data. The final metrics I am getting on Shuffle Spill are: Shuffle Spill(Memory): 122.5 GB Shuffle Spill(Disk): 3.4 GB What's the difference and relation between these two metrics? Does these mean 122.5 GB was s

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Re: Dataframe saves for a large set but throws OOM for a small dataset

Fwd: Connection closed Exception.

Connection closed Exception.

Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

Re: multiple tables for join

DataFrame python UDF performnce too slow

Re: adding rows to a DataFrame

Getting all field value as Null while reading Hive Table with Partition

Base metrics for Spark Benchmarking.

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

Shuffle Spill Memory and Shuffle Spill Disk

14 matches

Site Navigation

Mail list logo

Footer information