Hey Jerry,

When you ran these queries using different methods, did you see any
discrepancy in the returned results (i.e. the counts)?

On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust
<mich...@databricks.com> wrote:
> Yeah, sorry.  I think you are seeing some weirdness with partitioned tables
> that I have also seen elsewhere. I've created a JIRA and assigned someone at
> databricks to investigate.
>
> https://issues.apache.org/jira/browse/SPARK-2443
>
>
> On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>> Hi Michael,
>>
>> Yes the table is partitioned on 1 column. There are 11 columns in the
>> table and they are all String type.
>>
>> I understand that SerDes contributes to some overheads but using pure
>> Hive, we could run the query about 5 times faster than SparkSQL. Given that
>> Hive also has the same SerDes overhead, then there must be something
>> additional that SparkSQL adds to the overall overheads that Hive doesn't
>> have.
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>>
>> On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>>
>>> On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>>>
>>>> For the curious mind, the dataset is about 200-300GB and we are using 10
>>>> machines for this benchmark. Given the env is equal between the two
>>>> experiments, why pure spark is faster than SparkSQL?
>>>
>>>
>>> There is going to be some overhead to parsing data using the Hive SerDes
>>> instead of the native Spark code, however, the slow down you are seeing here
>>> is much larger than I would expect. Can you tell me more about the table?
>>> What does the schema look like?  Is it partitioned?
>>>
>>>> By the way, I also try hql("select * from m").count. It is terribly slow
>>>> too.
>>>
>>>
>>> FYI, this query is actually identical to the one where you write out
>>> COUNT(*).
>>
>>
>

Reply via email to