I've had the same problem. It turns out that Spark (specifically parquet)
is very slow at partition discovery. It got better in 1.5 (not yet
released), but was still unacceptably slow. Sadly, we ended up reading
parquet files manually in Python (via C++) and had to abandon Spark SQL
because of this problem.

On Wed, Aug 19, 2015 at 7:51 PM, Jerrick Hoang <jerrickho...@gmail.com>
wrote:

> Hi all,
>
> I did a simple experiment with Spark SQL. I created a partitioned parquet
> table with only one partition (date=20140701). A simple `select count(*)
> from table where date=20140701` would run very fast (0.1 seconds). However,
> as I added more partitions the query takes longer and longer. When I added
> about 10,000 partitions, the query took way too long. I feel like querying
> for a single partition should not be affected by having more partitions. Is
> this a known behaviour? What does spark try to do here?
>
> Thanks,
> Jerrick
>

Reply via email to