Thanks for the suggestion.  The output from EXPLAIN is indeed equivalent in
both sparkSQL and via the Thrift server.  I did some more testing.  The
source of the performance difference is in the way I was triggering the
sparkSQL query.  I was using .count() instead of .collect().  When I use
.collect() I get the same performance as the Thrift server.  My table has
28 columns.  I guess that .count() only required one column to be loaded
into memory, whereas .collect() required all columns to be loaded?
Curiously, it doesn't appear to matter how many rows are returned.  The
speed is the same even if I adjust the query to return 0 rows.  Anyway,
looks like it was a poor comparison on my part.  No real performance
difference between Thrift and SparkSQL.  Thanks for the help.

-Jeff

On Sat, Oct 3, 2015 at 1:26 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Underneath the covers, the thrift server is just calling
> <https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L224>
> hiveContext.sql(...) so this is surprising.  Maybe running EXPLAIN or
> EXPLAIN EXTENDED in both modes would be helpful in debugging?
>
>
>
> On Sat, Oct 3, 2015 at 1:08 PM, Jeff Thompson <
> jeffreykeatingthomp...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm running a simple SQL query over a ~700 million row table of the form:
>>
>> SELECT * FROM my_table WHERE id = '12345';
>>
>> When I submit the query via beeline & the JDBC thrift server it returns
>> in 35s
>> When I submit the exact same query using sparkSQL from a pyspark shell
>> (sqlContex.sql("SELECT * FROM ....")) it returns in 3s.
>>
>> Both times are obtained from the spark web UI.  The query only returns 43
>> rows, a small amount of data.
>>
>> The table was created by saving a sparkSQL dataframe as a parquet file
>> and then calling createExternalTable.
>>
>> I have tried to ensure that all relevant cluster parameters are
>> equivalent across the two queries:
>> spark.executor.memory = 6g
>> spark.executor.instances = 100
>> no explicit caching (storage tab in web UI is empty)
>> spark version: 1.4.1
>> Hadoop v2.5.0-cdh5.3.0, running spark on top of YARN
>> jobs run on the same physical cluster (on-site harware)
>>
>> From the web UIs, I can see that the query plans are clearly different,
>> and I think this may be the source of the performance difference.
>>
>> Thrift server job:
>> 1 stage only, stage 1 (35s) map -> Filter -> mapPartitions
>>
>> SparkSQL job:
>> 2 stages, stage 1 (2s): map -> filter -> Project -> Aggregate ->
>> Exchange, stage 2 (0.4s): Exchange -> Aggregate -> mapPartitions
>>
>> Is this a know issue?  Is there anything I can do to get the Thrift
>> server to use the same query optimizer as the one used by sparkSQL?  I'd
>> love to pick up a ~10x performance gain for my jobs submitted via the
>> Thrift server.
>>
>> Best regards,
>>
>> Jeff
>>
>
>

Reply via email to