Hi John, The number of rows in input file is 30 billion rows. The size of input data is 72 GB, and the output is expected to have readings for each account & day combination for 50k sample accounts, which means total output records count = 50k * 365
On Tue, Feb 14, 2017 at 6:29 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Can you check in the UI which tasks took most of the time? > > Even the 45 min looks a little bit much given that you only work most of > the time with 50k rows > > On 15 Feb 2017, at 00:03, Timur Shenkao <t...@timshenkao.su> wrote: > > Hello, > I'm not sure that's your reason but check this discussion: > > http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline- > performance-regression-between-1-6-and-2-x-td20803.html > > On Tue, Feb 14, 2017 at 9:25 PM, anatva <arun.na...@gmail.com> wrote: > >> Hi, >> I am reading an ORC file, and perform some joins, aggregations and finally >> generate a dense vector to perform analytics. >> >> The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the >> same >> code is migrated to run on spark 2.0 on the same cluster, it takes around >> 4-5 hours. It is surprising and frustrating. >> >> Can anyone take a look and help me what should I change in order to get >> atleast same performance in spark 2.0. >> >> spark-shell --master yarn-client --driver-memory 5G --executor-memory 20G >> \ >> --num-executors 15 --executor-cores 5 --queue grid_analytics --conf >> spark.yarn.executor.memoryOverhead=5120 --conf >> spark.executor.heartbeatInterval=1200 >> >> import sqlContext.implicits._ >> import org.apache.spark.storage.StorageLevel >> import org.apache.spark.sql.functions.lit >> import org.apache.hadoop._ >> import org.apache.spark.sql.functions.udf >> import org.apache.spark.mllib.linalg.{Vector, Vectors} >> import org.apache.spark.ml.clustering.KMeans >> import org.apache.spark.ml.feature.StandardScaler >> import org.apache.spark.ml.feature.Normalizer >> >> here are the steps: >> 1. read orc file >> 2. filter some of the records >> 3. persist resulting data frame >> 4. get distinct accounts from the df and get a sample of 50k accts from >> the >> distinct list >> 5. join the above data frame with distinct 50k accounts to pull records >> for >> only those 50k accts >> 6. perform a group by to get the avg, mean, sum, count of readings for the >> given 50k accts >> 7. join the df obtained by GROUPBY with original DF >> 8. convert the resultant DF to an RDD, do a groupbyKey(), and generate a >> DENSE VECTOR >> 9. convert RDD back to DF and store it in a parquet file >> >> The above steps worked fine in spark 1.6 but i m not sure why they run >> painfully long in spark 2.0. >> >> I am using spark 1.6 & spark 2.0 on HDP 2.5.3 >> >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/My-spark-job-runs-faster-in-spark-1-6- >> and-much-slower-in-spark-2-0-tp28390.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > -- Regards, Arun Kumar Natva