Spark SQL concurrent runs fails with java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

2016-06-28 Thread Jesse F Chen
With the Spark 2.0 build from 0615, when running 4-user concurrent SQL tests against Spark SQL on 1TB TPCDS, we are seeing consistently the following exceptions: 10:35:33 AM: 16/06/27 23:40:37 INFO scheduler.TaskSetManager: Finished task 412.0 in stage 819.0 (TID 270396) in 8468 ms on

Re: spark 2.0 issue with yarn?

2016-05-09 Thread Jesse F Chen
From: Sean Owen <so...@cloudera.com> To: Jesse F Chen/San Francisco/IBM@IBMUS Cc: spark

spark 2.0 issue with yarn?

2016-05-09 Thread Jesse F Chen
I had been running fine until builds around 05/07/2016 If I used the "--master yarn" in builds after 05/07, I got the following error...sounds like something jars are missing. I am using YARN 2.7.2 and Hive 1.2.1. Do I need something new to deploy related to YARN? bin/spark-sql

Re: OOM Exception in my spark streaming application

2016-03-19 Thread Jesse F Chen
Somewhat related, though this JIRA is on 1.6. https://issues.apache.org/jira/browse/SPARK-13288#

Re: Configuring/Optimizing Spark

2016-03-03 Thread Jesse F Chen
So you have 90GB total memory, and 24 total cores. Let's say you want to use 80% of all that memory (leaving memory for other components) so you have 72GB to use. You want to take advantage of all the cores and memory. So this would be close: executor size = 6g number of executors = 12 cores

select count(*) return wrong row counts

2016-03-02 Thread Jesse F Chen
I am finding a strange issue with Spark SQL where "select count(*) " returns wrong row counts for certain tables. I am using TPCDS tables, so here are the actual counts: Row

streaming in 1.6.0 slower than 1.5.1

2016-01-28 Thread Jesse F Chen
I ran the same streaming application (compiled individually for 1.5.1 and 1.6.0) that processes 5-second tweet batches. I noticed two things: 1. 10% regression in 1.6.0 vs 1.5.1 Spark v1.6.0: 1,564 tweets/s Spark v1.5.1: 1,747 tweets/s 2. 1.6.0 streaming seems to have a memory

Spark metrics not working

2015-12-08 Thread Jesse F Chen
v1.5.1. Trying to enable CsvSink for metrics collecting, but I get the following error as soon as kicking off a 'spark-submit' app: 15/12/08 11:24:02 INFO storage.BlockManagerMaster: Registered BlockManager 15/12/08 11:24:02 ERROR metrics.MetricsSystem: Sink class

Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Jesse F Chen
igible. From: Davies Liu <dav...@databricks.com> To:

RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Jesse F Chen
-master/target/scala-2.10/tpcdssparksql_2.10-0.9.jar hdfs://rhel2.cisco.com:8020/user/bigsql/hadoopds100g /TestAutomation/databricks/spark-sql-perf-master/src/main/queries/jesse/query39b.sql From: "Cheng, Hao" <hao.ch...@intel.com> To: Todd <bit1...@163.com> Cc

Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Jesse F Chen
Could this be a build issue (i.e., sbt package)? If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too in queries (all other things identical)... I am curious, to build 1.5 (when it isn't released yet), what do I need to do with the build.sbt file? any special

Re: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread Jesse F Chen
If you already loaded csv data into a dataframe, why not register it as a table, and use Spark SQL to find max/min or any other aggregates? SELECT MAX(column_name) FROM dftable_name ... seems natural.

tweet transformation ideas

2015-08-27 Thread Jesse F Chen
This is a question on general usage/best practice/best transformation method to use for a sentiment analysis on tweets... Input: Tweets (e.g, @xyz, sorry but this movie is poorly scripted http://t.co/uyser876;) - large data set, ie. 1 billion tweets Sentiment dictionary (e.g,