"select 'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case when uid is null then 0 else 1 end),sum(case when uid is null then 1 else 0 end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb"
Is this as is, or did you use a UDF here? -Sahil On Thu, Dec 3, 2015 at 4:06 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > Can you try running it directly on hive to see the timing or through > spark-sql may be. > > > > Spark does what Hive does that is processing large sets of data, but it > attempts to do the intermediate iterations in memory if it can (i.e. if > there is enough memory available to keep the data set in memory), otherwise > it will have to use disk space. So it boils down to how much memory you > have. > > > > HTH > > > > Mich Talebzadeh > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Ltd, its subsidiaries nor their employees > accept any responsibility. > > > > *From:* hxw黄祥为 [mailto:huang...@ctrip.com] > *Sent:* 03 December 2015 10:29 > *To:* user@spark.apache.org > *Subject:* spark1.4.1 extremely slow for take(1) or head() or first() or > show > > > > Dear All, > > > > I have a hive table with 100 million data and I just ran some very simple > operations on this dataset like: > > > > val df = sqlContext.sql("select * from user ").toDF > > df.cache > > df.registerTempTable("tb") > > val b=sqlContext.sql("select > 'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case when uid is > null then 0 else 1 end),sum(case when uid is null then 1 else 0 > end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb") > > b.show //the result just one line but this step is extremely slow > > > > Is this expected? Why show is so slow for dataframe? Is it a bug in the > optimizer? or I did something wrong? > > > > > > Best Regards, > > tylor >