Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
...@databricks.com] > *Sent:* Monday, August 24, 2015 2:13 PM > *To:* Philip Weaver > *Cc:* Jerrick Hoang ; Raghavendra Pandey < > raghavendra.pan...@gmail.com>; User ; Cheng, Hao < > hao.ch...@intel.com> > > *Subject:* Re: Spark Sql behaves strangely with tab

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
Michael Armbrust [mailto:mich...@databricks.com] >> *Sent:* Monday, August 24, 2015 2:13 PM >> *To:* Philip Weaver >> *Cc:* Jerrick Hoang ; Raghavendra Pandey < >> raghavendra.pan...@gmail.com>; User ; Cheng, Hao < >> hao.ch...@intel.com> >> >&g

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Jerrick Hoang
chael Armbrust [mailto:mich...@databricks.com] > *Sent:* Monday, August 24, 2015 2:13 PM > *To:* Philip Weaver > *Cc:* Jerrick Hoang ; Raghavendra Pandey < > raghavendra.pan...@gmail.com>; User ; Cheng, Hao < > hao.ch...@intel.com> > > *Subject:* Re: Spark Sql behaves strangely w

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Sereday, Scott
if you can paste the physical plan for the simple query. From: Jerrick Hoang [mailto:jerrickho...@gmail.com<mailto:jerrickho...@gmail.com>] Sent: Thursday, August 20, 2015 1:46 PM To: Cheng, Hao Cc: Philip Weaver; user Subject: Re: Spark Sql behaves strangely with tables with a lot of par

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-24 Thread Michael Armbrust
gt; philip.wea...@gmail.com> wrote: >>>>>> >>>>>>> I hadn't heard of spark.sql.sources.partitionDiscovery.enabled >>>>>>> before, and I couldn't find much information about it online. What does >>>>>>> it

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Philip Weaver
;>>> And also, it’s will be great if you can paste the physical plan for >>>>>>> the simple query. >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] >>>>&g

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Michael Armbrust
m] >>>>>> *Sent:* Thursday, August 20, 2015 1:46 PM >>>>>> *To:* Cheng, Hao >>>>>> *Cc:* Philip Weaver; user >>>>>> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of >>>>>> partitions >>>>>> >>&

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-23 Thread Jerrick Hoang
gt;> *Subject:* Re: Spark Sql behaves strangely with tables with a lot of >>>>> partitions >>>>> >>>>> >>>>> >>>>> I cloned from TOT after 1.5.0 cut off. I noticed there were a couple >>>>> of CLs trying to speed up spark sql

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
m wondering if the driver is busy >>>> with scanning the HDFS / S3. >>>> >>>> Like jstack >>>> >>>> >>>> >>>> And also, it’s will be great if you can paste the physical plan for the >>>> simple query

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Raghavendra Pandey
> >>> >>> >>> And also, it’s will be great if you can paste the physical plan for the >>> simple query. >>> >>> >>> >>> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] >>> *Sent:* Thursday, August 20, 2015 1:46 PM &

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-21 Thread Jerrick Hoang
ysical plan for the >> simple query. >> >> >> >> *From:* Jerrick Hoang [mailto:jerrickho...@gmail.com] >> *Sent:* Thursday, August 20, 2015 1:46 PM >> *To:* Cheng, Hao >> *Cc:* Philip Weaver; user >> *Subject:* Re: Spark Sql behaves strangely with

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-20 Thread Philip Weaver
.com] > *Sent:* Thursday, August 20, 2015 1:46 PM > *To:* Cheng, Hao > *Cc:* Philip Weaver; user > *Subject:* Re: Spark Sql behaves strangely with tables with a lot of > partitions > > > > I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of > CLs tr

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
: Cheng, Hao Cc: Philip Weaver; user Subject: Re: Spark Sql behaves strangely with tables with a lot of partitions I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs trying to speed up spark sql with tables with a huge number of partitions, I've made sure that thos

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
> *Cc:* user > *Subject:* Re: Spark Sql behaves strangely with tables with a lot of > partitions > > > > I guess the question is why does spark have to do partition discovery with > all partitions when the query only needs to look at one partition? Is there > a conf f

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false. BTW, which version are you using? Hao From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August 20, 2015 12:16 PM To: Philip Weaver Cc: user Subject: Re: Spark Sql behaves strangely with tables with

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I guess the question is why does spark have to do partition discovery with all partitions when the query only needs to look at one partition? Is there a conf flag to turn this off? On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver wrote: > I've had the same problem. It turns out that Spark (specifi

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Philip Weaver
I've had the same problem. It turns out that Spark (specifically parquet) is very slow at partition discovery. It got better in 1.5 (not yet released), but was still unacceptably slow. Sadly, we ended up reading parquet files manually in Python (via C++) and had to abandon Spark SQL because of this

Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
Hi all, I did a simple experiment with Spark SQL. I created a partitioned parquet table with only one partition (date=20140701). A simple `select count(*) from table where date=20140701` would run very fast (0.1 seconds). However, as I added more partitions the query takes longer and longer. When