RE: Multiple column aggregations

2019-02-11 Thread Shiva Prashanth Vallabhaneni
Hi Sonu,

You could use a query that is similar to the below one. You could further 
optimize the below query by adding a WHERE clause. I would suggest that you 
benchmark the performance of both approaches (multiple group-by queries vs 
single query with multiple window functions), before choosing one of these 
options. Before running the benchmark, I would ensure that the underlying data 
is stored in a columnar storage format with compression enabled. For instance, 
you could use parquet file format with block-level compression using Snappy.

SELECT  SUM(CASE WHEN accountRank =2 THEN 1 ELSE 0 END) AS 
accountsWithMoreThanOneOrder,
SUM(CASE WHEN orderRank =2 THEN 1 ELSE 0 END) AS ordersWithMoreThanOneAccount,
FROM   (
  SELECT  accountNo,
   orderNo,
  rank() OVER (PARTITION BY orderNo ORDER BY accountNo) AS orderRank,
 rank() OVER (PARTITION BY accountNo ORDER BY orderNo) AS accountRank
  FROM   accountOrders
)

P.S – You will need to check the above query for any syntax errors.

– Shiva

From: Sonu Jyotshna 
Sent: Saturday, February 9, 2019 10:17 AM
To: user@spark.apache.org
Subject: Multiple column aggregations


Hello,

I have a requirement where I need to group by multiple columns and aggregate 
them not at same time .. I mean I have a structure which contains accountid, 
some cols, order id . I need to calculate some scenarios like account having 
multiple orders so group by account and aggregate will work here but I need to 
find orderid associated to multiple accounts so may be group by orderid will 
work here but for better performance on the dataset level can we do in single 
step? Where both will work or any better approach I can follow . Can you help


Regards,
Sonu

Any comments or statements made in this email are not necessarily those of 
Tavant Technologies. The information transmitted is intended only for the 
person or entity to which it is addressed and may contain confidential and/or 
privileged material. If you have received this in error, please contact the 
sender and delete the material from any computer. All emails sent from or to 
Tavant Technologies may be subject to our monitoring procedures.


Recall: spark sql in-clause problem

2018-05-23 Thread Shiva Prashanth Vallabhaneni
Shiva Prashanth Vallabhaneni would like to recall the message, "spark sql 
in-clause problem".


Any comments or statements made in this email are not necessarily those of 
Tavant Technologies. The information transmitted is intended only for the 
person or entity to which it is addressed and may contain confidential and/or 
privileged material. If you have received this in error, please contact the 
sender and delete the material from any computer. All emails sent from or to 
Tavant Technologies may be subject to our monitoring procedures.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: spark sql in-clause problem

2018-05-22 Thread Shiva Prashanth Vallabhaneni
Assuming the list of values in the β€œIN” clause is small, you could try using

sparkSqlContext.sql(select * from mytable where key = 1 and ( (X,Y) = (1,2) OR 
(X,Y) = (3,4) )

Another solution could be to load the possible values for X & Y into a table 
and then using this table in the sub-query;

Table coordinates (
Integer X,
Integer Y
)

sparkSqlContext.sql(select * from mytable where key = 1 and (X,Y) IN (select X, 
Y from coordinates))

From: onmstester onmstester 
Sent: Wednesday, May 23, 2018 10:33 AM
To: user 
Subject: spark sql in-clause problem

I'm reading from this table in cassandra:
Table mytable (
Integer Key,
Integer X,
Interger Y

Using:
sparkSqlContext.sql(select * from mytable where key = 1 and (X,Y) in 
((1,2),(3,4)))

Encountered error:

StructType(StructField((X,IntegerType,true),StructField((Y,IntegerType,true)) 
!= 
StructType(StructField((X,IntegerType,false),StructField((Y,IntegerType,false))



Sent using Zoho 
Mail



Any comments or statements made in this email are not necessarily those of 
Tavant Technologies. The information transmitted is intended only for the 
person or entity to which it is addressed and may contain confidential and/or 
privileged material. If you have received this in error, please contact the 
sender and delete the material from any computer. All emails sent from or to 
Tavant Technologies may be subject to our monitoring procedures.