Apologies, wrong shortcuts in gmail and I managed to send the mail before I finished editing the query. I edited it below.
On Tue, Jul 11, 2017 at 7:58 PM, Valentin Ursu < valentindaniel.u...@gmail.com> wrote: > Hello, > > Short description: A SQL query sent via Thrift server returns an > inexplicable response. Running the same (exact same) query inside Apache > Zeppelin or submitting a job returns the correct result. Furthermore, a > similar table returns the correct response in both cases. > > Details: > I'm using Spark 2.0.0 on a Cloudera 5.7 distribution. I did not test in on > Spark 2.1.0 but if helpful I can install it and test. > > I have 2 Hive metastore tables which are saved by a spark batch job. > orders_2years is processed and saveAsTable() > orders_30days is filtered from _2years based on a column containing date > and saveAsTable() (same batch job creates both) > > I checked and both tables contain this record and it is the only record > that satisfies all the conditions in the query: > > | customer_id | order_id | doc_id | category | vendor_id| > +-------------+----------+---------+----------+----------+ > | 916339 | 25144502 | 5596579 | 1455 | 1 | > > The exact same query is run using a PHP application connecting to Thrift > and via Zeppelin/Spark using sqlContext.sql("") > > The query on table _30days, in Zeppelin > good result > The query on table _30days, via Thrift > bad result > The query on table _2years, via Thrift > good result > > Furthermore, changing _30days in _2years in b and leaving _30days to > create a > good result. > > The query is: > > SELECT a.customer_id_custom, > collect_list(b.order_id) AS b__orderIds, > collect_list(b.doc_id) AS b__docIds, > collect_list(b.category_id) AS b__cat, > collect_list(b.vendor_id) AS b__vendor > FROM ((SELECT customer_id AS customer_id_custom > FROM orders_30days > WHERE 1 = 1 AND category_id IN (1455) AND vendor_id IN (1) AND > order_id IN (25144502) AND > (fullDate > '2017-06-08 12:07' AND > fullDate < '2017-07-07 12:07') > GROUP BY customer_id > HAVING count(1) >= 1 AND > SUM(total_price_with_vat) BETWEEN 1 AND 999) a ) > INNER JOIN orders_30days AS b > ON b.customer_id = a.customer_id_custom AND > 1 = 1 AND b.category_id IN (1455) AND > b.vendor_id IN (1) AND > (b.fullDate > '2017-06-08 12:07' AND > b.fullDate < '2017-07-07 12:07') > WHERE 1 = 1 > GROUP BY a.customer_id_custom > > If you're wondering why I'm so specific with my query, the original is a > lot more complex. For example a is actually obtained by joining 5 tables > but I tried simplifying it as much as I could while obtaining the same > effect. > > Good result is: > > | customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor | > +--------------------+---------------+-----------+--------+-----------+ > | 916339 | [25144502] | [5596579] | [1455] | [1] | > > Bad result: > > | customer_id_custom | b__orderIds | b__docIds | b__cat | b__vendor | > +--------------------+---------------+-----------+--------+-----------+ > | 916339 | [null] | [1] | [null] | [1455] | > > Notice how some columns appear to be misplaced (b__cat is actually in > b__vendor) while others just return null > > >