Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

arun kumar Natva Tue, 14 Feb 2017 19:11:28 -0800

Hi John,
The number of rows in input file is 30 billion rows. The size of input data
is 72 GB, and the output is expected to have readings for each account &
day combination for 50k sample accounts, which means total output records
count =  50k * 365



On Tue, Feb 14, 2017 at 6:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Can you check in the UI which tasks took most of the time?
>
> Even the 45 min looks a little bit much given that you only work most of
> the time with 50k rows
>
> On 15 Feb 2017, at 00:03, Timur Shenkao <t...@timshenkao.su> wrote:
>
> Hello,
> I'm not sure that's your reason but check this discussion:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-
> performance-regression-between-1-6-and-2-x-td20803.html
>
> On Tue, Feb 14, 2017 at 9:25 PM, anatva <arun.na...@gmail.com> wrote:
>
>> Hi,
>> I am reading an ORC file, and perform some joins, aggregations and finally
>> generate a dense vector to perform analytics.
>>
>> The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the
>> same
>> code is migrated to run on spark 2.0 on the same cluster, it takes around
>> 4-5 hours. It is surprising and frustrating.
>>
>> Can anyone take a look and help me what should I change in order to get
>> atleast same performance in spark 2.0.
>>
>> spark-shell --master yarn-client --driver-memory 5G --executor-memory 20G
>> \
>> --num-executors 15 --executor-cores 5 --queue grid_analytics --conf
>> spark.yarn.executor.memoryOverhead=5120 --conf
>> spark.executor.heartbeatInterval=1200
>>
>> import sqlContext.implicits._
>> import org.apache.spark.storage.StorageLevel
>> import org.apache.spark.sql.functions.lit
>> import org.apache.hadoop._
>> import org.apache.spark.sql.functions.udf
>> import org.apache.spark.mllib.linalg.{Vector, Vectors}
>> import org.apache.spark.ml.clustering.KMeans
>> import org.apache.spark.ml.feature.StandardScaler
>> import org.apache.spark.ml.feature.Normalizer
>>
>> here are the steps:
>> 1. read orc file
>> 2. filter some of the records
>> 3. persist resulting data frame
>> 4. get distinct accounts from the df and get a sample of 50k accts from
>> the
>> distinct list
>> 5. join the above data frame with distinct 50k accounts to pull records
>> for
>> only those 50k accts
>> 6. perform a group by to get the avg, mean, sum, count of readings for the
>> given 50k accts
>> 7. join the df obtained by GROUPBY with original DF
>> 8. convert the resultant DF to an RDD, do a groupbyKey(), and generate a
>> DENSE VECTOR
>> 9. convert RDD back to DF and store it in a parquet file
>>
>> The above steps worked fine in spark 1.6 but i m not sure why they run
>> painfully long in spark 2.0.
>>
>> I am using spark 1.6 & spark 2.0 on HDP 2.5.3
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/My-spark-job-runs-faster-in-spark-1-6-
>> and-much-slower-in-spark-2-0-tp28390.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
Regards,
Arun Kumar Natva

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

Reply via email to