Good stuff I decided to do some boundary value analysis by getting records where the ID (unique value) is IN (min() and max()
Unfortanely Hive SQL does not yet support more than one level of sub-query. For example this operation is perfectly valid in Oracle select * from dummy where id IN (select MIN(ID) from dummy) OR id IN (select MAX(ID) from dummy); It comes back with two rows In Hive SQL you get the following error select * from dummy where id IN (select MIN(ID) from dummy) OR id IN (select MAX(ID) from dummy); FAILED: SemanticException [Error 10249]: Line 1:66 Unsupported SubQuery Expression 'id': Only 1 SubQuery expression is supported. So the solution I found was to use UNION in Hive SELECT * FROM dummy WHERE id IN (SELECT MIN(ID) FROM dummy) UNION SELECT * FROM dummy WHERE id IN (SELECT MAX(ID) FROM dummy) ORDER BY id It took 2 min, 6 sec to return two rows In FP I decided to do this val d = HiveContext.table("test.dummy") //Obtain boundary values val minValue: Int = d.agg(min(col("id"))).collect.apply(0).getInt(0) val maxValue: Int = d.agg(max(col("id"))).collect.apply(0).getInt(0) d.filter( col("id") === lit(minValue) || col("id") === lit(maxValue)).orderBy(col("id")).show It returns the rows back in 1 min. Now I am not sure whether the FP code is most optimised but still wins HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 26 February 2016 at 14:51, Yin Yang <yy201...@gmail.com> wrote: > Since collect is involved, the approach would be slower compared to the > SQL Mich gave in his first email. > > On Fri, Feb 26, 2016 at 1:42 AM, Michał Zieliński < > zielinski.mich...@gmail.com> wrote: > >> You need to collect the value. >> >> val m: Int = d.agg(max($"id")).collect.apply(0).getInt(0) >> d.filter(col("id") === lit(m)) >> >> On 26 February 2016 at 09:41, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Can this be done using DFs? >>> >>> >>> >>> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >>> >>> scala> val d = HiveContext.table("test.dummy") >>> d: org.apache.spark.sql.DataFrame = [id: int, clustered: int, scattered: >>> int, randomised: int, random_string: string, small_vc: string, padding: >>> string] >>> >>> scala> var m = d.agg(max($"id")) >>> m: org.apache.spark.sql.DataFrame = [max(id): int] >>> >>> How can I join these two? In other words I want to get all rows with id >>> = m here? >>> >>> d.filter($"id" = m) ? >>> >>> Thanks >>> >>> On 25/02/2016 22:58, Mohammad Tariq wrote: >>> >>> AFAIK, this isn't supported yet. A ticket >>> <https://issues.apache.org/jira/browse/SPARK-4226> is in progress >>> though. >>> >>> >>> >>> [image: http://] <http://about.me/mti> >>> >>> Tariq, Mohammad >>> about.me/mti >>> [image: http://] >>> >>> >>> >>> On Fri, Feb 26, 2016 at 4:16 AM, Mich Talebzadeh < >>> mich.talebza...@cloudtechnologypartners.co.uk> wrote: >>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> I guess the following confirms that Spark does bot support sub-queries >>>> >>>> >>>> >>>> val d = HiveContext.table("test.dummy") >>>> >>>> d.registerTempTable("tmp") >>>> >>>> HiveContext.sql("select * from tmp where id IN (select max(id) from >>>> tmp)") >>>> >>>> It crashes >>>> >>>> The SQL works OK in Hive itself on the underlying table! >>>> >>>> select * from dummy where id IN (select max(id) from dummy); >>>> >>>> >>>> >>>> Thanks >>>> >>> >> >