"select  'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case
when uid is null then 0 else 1 end),sum(case when uid is null then 1 else 0
end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb"

Is this as is, or did you use a UDF here?

-Sahil

On Thu, Dec 3, 2015 at 4:06 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Can you try running it directly on hive to see the timing or through
> spark-sql may be.
>
>
>
> Spark does what Hive does that is processing large sets of data, but it
> attempts to do the intermediate iterations in memory if it can (i.e. if
> there is enough memory available to keep the data set in memory), otherwise
> it will have to use disk space. So it boils down to how much memory you
> have.
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* hxw黄祥为 [mailto:huang...@ctrip.com]
> *Sent:* 03 December 2015 10:29
> *To:* user@spark.apache.org
> *Subject:* spark1.4.1 extremely slow for take(1) or head() or first() or
> show
>
>
>
> Dear All,
>
>
>
> I have a hive table with 100 million data and I just ran some very simple
> operations on this dataset like:
>
>
>
>   val df = sqlContext.sql("select * from user ").toDF
>
>   df.cache
>
>   df.registerTempTable("tb")
>
>   val b=sqlContext.sql("select
> 'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case when uid is
> null then 0 else 1 end),sum(case when uid is null then 1 else 0
> end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb")
>
>   b.show  //the result just one line but this step is extremely slow
>
>
>
> Is this expected? Why show is so slow for dataframe? Is it a bug in the
> optimizer? or I did something wrong?
>
>
>
>
>
> Best Regards,
>
> tylor
>

Reply via email to