Re: Difference between timestamp is 15 minutes
joins are on columns , what you are trying to do is where condition On Thu, Jul 19, 2012 at 5:25 AM, Raihan Jamal jamalrai...@gmail.com wrote: Something like this will work in Hive? *ON ((UNIX_TIMESTAMP(testingtable1.created_time) - (prod_and_ts.timestamps / 1000)) / 60* 1000 = 15 minutes)* *Raihan Jamal* On Wed, Jul 18, 2012 at 4:48 PM, Raihan Jamal jamalrai...@gmail.comwrote: This is the CREATED_TIME *`2012-07-17 00:00:22`* and this is the Corresponding Timestamp *`1342508427000`*. Here timestamp is *5 seconds*more than the *`CREATED_TIME`*. I need to see below scenario Currently I have a query, in which I am joining on created_time and timestamp like this- ON (UNIX_TIMESTAMP(testingtable1.created_time) = (prod_and_ts.timestamps / 1000)) So in above case, it will not match as timestamp is 5 seconds more than created_time. But I need if the difference between either of the two is within 15 minutes then I need to match it. So I need something like this- ON (UNIX_TIMESTAMP(testingtable1.created_time) - (prod_and_ts.timestamps / 1000) = 15 minutes) How I can do the above case if difference between timestamps is within 15 minutes then data will get matched by the above `ON clause` *Raihan Jamal* -- Nitin Pawar
Request write access to the Hive wiki
Please grant me write access to the Hive wiki so that I can work on improving the documentation. Thank you. – Lefty Leverenz le...@hortonworks.com
Re: Unable to start hive
Hi Prabhjot Have you set $HADOOP_HOME? If not try setting that, if already set please verify whether it is the correct one. Normally I do set $HADOOP_HOME and $PATH and have never faced any issues. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: iwannaplay games funnlearnfork...@gmail.com Date: Thu, 19 Jul 2012 15:22:39 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Unable to start hive Hi, I installed apache hive and set the paths.Still i am getting this error. [hduser@master bin]$ hive Exception in thread main java.lang.NoClassDefFoundError: org/apache/thrift/TException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 3 more What is wrong can you please help me out Thanks in advance Regards Prabhjot
Re: Unable to start hive
Hi, Everything is set. [hduser@master bin]$ echo $HADOOP_HOME /usr/local/hadoop [hduser@master bin]$ echo $PATH usr/local/sqoop/bin:/usr/local/hive/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/hduser/bin:/usr/java/jdk1.6.0_33/bin:/usr/local/hadoop/bin created two directories $ bin/hadoop fs -mkdir /tmp $ bin/hadoop fs -mkdir /user/hive/warehouse $ bin/hadoop fs -chmod g+w /tmp $ bin/hadoop fs -chmod g+w /user/hive/warehouse Do i need to make any changes in files in HIVE_HOME/conf.I have some templates over there.I need to rename them? Is there any other change that we make? Thanks Regards Prabhjot On 7/19/12, Bejoy KS bejoy...@yahoo.com wrote: Hi Prabhjot Have you set $HADOOP_HOME? If not try setting that, if already set please verify whether it is the correct one. Normally I do set $HADOOP_HOME and $PATH and have never faced any issues. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: iwannaplay games funnlearnfork...@gmail.com Date: Thu, 19 Jul 2012 15:22:39 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Unable to start hive Hi, I installed apache hive and set the paths.Still i am getting this error. [hduser@master bin]$ hive Exception in thread main java.lang.NoClassDefFoundError: org/apache/thrift/TException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 3 more What is wrong can you please help me out Thanks in advance Regards Prabhjot
Performance tuning a hive query
Apart from partitions and buckets how to improve of hive queries Regards Abhi Sent from my iPhone
Re: Performance tuning a hive query
depends on what kind of query if yoy are doing joins then there are different kind of join queries depending on how did you layout the data and how much of data is held in what table. On Thu, Jul 19, 2012 at 6:54 PM, Abhishek abhishek.dod...@gmail.com wrote: Apart from partitions and buckets how to improve of hive queries * * *Regards * Abhi Sent from my iPhone -- Nitin Pawar
Re: Performance tuning a hive query
There are many ways, but beware that some of them may result in worse performance when used inappropriately. Some of the settings we use to achieve faster queries: hive.map.aggr=true hive.exec.parallel=true hive.exec.compress.intermediate=true mapred.job.reuse.jvm.num.tasks=-1 Structuring the queries properly can help a lot. For example if you eliminate unneeded data early in the query before further processing. E.g. if you use subquery in FROM, you should put all WHERE clauses where possible into the subquery, to eliminate the amount of data passed to the next stage. Using multi-group-by queries helps a lot when computing multiple queries on same set of data. As Nitin Pawar mentioned, the JOINs can be often optimized as well. Also, fine tuning the hadoop server itself for your specific needs might help. I am very interested in optimization of queries as well, so if anyone knows some more tricks, please share... J. Dolinar On Thu, Jul 19, 2012 at 3:24 PM, Abhishek abhishek.dod...@gmail.com wrote: Apart from partitions and buckets how to improve of hive queries * * *Regards * Abhi Sent from my iPhone
Something wrong with my query to get TOP 3?
This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So Basically If BID and PID are same but with different timestamps, then I need TOP 3 sorted with TIME in descending order And for this I created rank UDF (User Defined Function). And I wrote the below query but its not working for me. Can anyone help me on this? * * * * *SELECT buyer_id, item_id, created_time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09' AND rank(buyer_id) 3* *DISTRIBUTE BY buyer_id* *SORT BY buyer_id, created_time desc*
Re: Something wrong with my query to get TOP 3?
Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.comwrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.comwrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com wrote: This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So Basically If BID and PID are same but with different timestamps, then I need TOP 3 sorted with TIME in descending order And for this I created rank UDF (User Defined Function). And I wrote the below query but its not working for me. Can anyone help me on this? * * * * *SELECT buyer_id, item_id, created_time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id* *SORT BY buyer_id, created_time desc*
Re: Something wrong with my query to get TOP 3?
Can anyone help me with this? I have tried other options by tweaking the query also. I am not able to achieve my expected output. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.comwrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.comwrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.comwrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com wrote: This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So Basically If BID and PID are same but with different timestamps, then I need TOP 3 sorted with TIME in descending order And for this I created rank UDF (User Defined Function). And I wrote the below query but its not working for me. Can anyone help me on this? * * * * *SELECT buyer_id, item_id, created_time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id* *SORT BY buyer_id, created_time desc*
Re: Something wrong with my query to get TOP 3?
Hi, I more or less had the same problem and finally got it down by introducing a second subquery. This will guarantee that the rank function is invoked on the reduce phase and that the rank results are properly sorted. I guess something like this: *SELECT bid, pid, rank FROM * *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) FROM * * ( SELECT bid, pid, time FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09' * *DISTRIBUTE BY bid,pid* *SORT BY bid,pid, time desc) A * * ) B WHERE rank 3;* The A. query is carried out at the map phase and makes sure the record arrive in the proper order at the reducer The B. query takes care of applying the rank function (aliased as rank) The outer most query (a second MR job) takes care of filtering the TOP-3 Maybe this will not run at once but hope you get the idea for the proper set up. I modelled it after a query that I had to run, which finally returned the proper TOP-10 for me. Jasper 2012/7/20 comptech geeky comptechge...@gmail.com Can anyone help me with this? I have tried other options by tweaking the query also. I am not able to achieve my expected output. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.comwrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.comwrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com wrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com wrote: This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So Basically If BID and PID are same but with different timestamps, then I need TOP 3 sorted with TIME in descending order And for this I created rank UDF (User Defined Function). And I wrote the below query but its not working for me. Can anyone help me on this? * * * * *SELECT buyer_id, item_id, created_time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id* *SORT BY buyer_id,
Re: Something wrong with my query to get TOP 3?
Remove pid,time from DISTRIBUTE BY. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.comwrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.comwrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.comwrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com wrote: This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So Basically If BID and PID are same but with different timestamps, then I need TOP 3 sorted with TIME in descending order And for this I created rank UDF (User Defined Function). And I wrote the below query but its not working for me. Can anyone help me on this? * * * * *SELECT buyer_id, item_id, created_time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id* *SORT BY buyer_id, created_time desc*
Re: Something wrong with my query to get TOP 3?
Hi Igor, I am not sure what I have to remove from Distribute By as in distribute by we have bid, pid and you said remove bid and time from distribute by and it doesn't have time *SELECT bid, pid, rank FROM * *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) FROM * * ( SELECT bid, pid, time FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09' * *DISTRIBUTE BY bid,pid* *SORT BY bid,pid, time desc) A * * ) B * *WHERE rank 3;* And also I tried running the above query as it is. I am not getting expected output instead of that I am getting output like this which is wrong If you compare my expected output with the below output- *1345653 1109093169040* *1345653 2210657967611* *1345653 3307601379502* On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote: Remove pid,time from DISTRIBUTE BY. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.comwrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.comwrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com wrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com wrote: This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So Basically If BID and PID are same but with different timestamps, then I need TOP 3 sorted with TIME in descending order And for this I created rank UDF (User Defined Function). And I wrote the below query but its not working for me. Can anyone help me on this? * * * * *SELECT buyer_id, item_id, created_time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id* *SORT BY buyer_id, created_time desc*
Re: Something wrong with my query to get TOP 3?
I am not really aware of your use case. Play around with it. At least the rank function is now properly applied. Maybe, remove pid from the DISTRIBUTE and de SORT clauses ?? Jasper 2012/7/20 comptech geeky comptechge...@gmail.com Hi Igor, I am not sure what I have to remove from Distribute By as in distribute by we have bid, pid and you said remove bid and time from distribute by and it doesn't have time *SELECT bid, pid, rank FROM * *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) FROM * * ( SELECT bid, pid, time FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09' * *DISTRIBUTE BY bid,pid* *SORT BY bid,pid, time desc) A * * ) B * *WHERE rank 3;* And also I tried running the above query as it is. I am not getting expected output instead of that I am getting output like this which is wrong If you compare my expected output with the below output- *1345653 1109093169040* *1345653 2210657967611* *1345653 3307601379502* On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote: Remove pid,time from DISTRIBUTE BY. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.comwrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.com wrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com wrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com wrote: This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So Basically If BID and PID are same but with different timestamps, then I need TOP 3 sorted with TIME in descending order And for this I created rank UDF (User Defined Function). And I wrote the below query but its not working for me. Can anyone help me on this? * * * * *SELECT buyer_id, item_id, created_time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id* *SORT BY buyer_id, created_time desc* -- Met
Re: Something wrong with my query to get TOP 3?
Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses. Your very first query was correct except for the nested subquery part. (You don't need a double-nested subquery.) On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky comptechge...@gmail.comwrote: Hi Igor, I am not sure what I have to remove from Distribute By as in distribute by we have bid, pid and you said remove bid and time from distribute by and it doesn't have time *SELECT bid, pid, rank FROM * *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) FROM * * ( SELECT bid, pid, time FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09' * *DISTRIBUTE BY bid,pid* *SORT BY bid,pid, time desc) A * * ) B * *WHERE rank 3;* And also I tried running the above query as it is. I am not getting expected output instead of that I am getting output like this which is wrong If you compare my expected output with the below output- *1345653 1109093169040* *1345653 2210657967611* *1345653 3307601379502* On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote: Remove pid,time from DISTRIBUTE BY. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.comwrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.com wrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com wrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com wrote: This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So Basically If BID and PID are same but with different timestamps, then I need TOP 3 sorted with TIME in descending order And for this I created rank UDF (User Defined Function). And I wrote the below query but its not working for me. Can anyone help me on this? * * * * *SELECT buyer_id, item_id, created_time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id* *SORT BY buyer_id, created_time
Re: Something wrong with my query to get TOP 3?
Thanks Jasper for replying back. I have mentioned my use case in my first email. And also I have already wrote the HiveQL query with the rank function working but it is not giving me the exact output that I am supposed to get from the query. On Thu, Jul 19, 2012 at 3:53 PM, Jasper Knulst jasper.knu...@incentro.comwrote: I am not really aware of your use case. Play around with it. At least the rank function is now properly applied. Maybe, remove pid from the DISTRIBUTE and de SORT clauses ?? Jasper 2012/7/20 comptech geeky comptechge...@gmail.com Hi Igor, I am not sure what I have to remove from Distribute By as in distribute by we have bid, pid and you said remove bid and time from distribute by and it doesn't have time *SELECT bid, pid, rank FROM * *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) FROM * * ( SELECT bid, pid, time FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09' * *DISTRIBUTE BY bid,pid* *SORT BY bid,pid, time desc) A * * ) B * *WHERE rank 3;* And also I tried running the above query as it is. I am not getting expected output instead of that I am getting output like this which is wrong If you compare my expected output with the below output- *1345653 1109093169040* *1345653 2210657967611* *1345653 3307601379502* On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote: Remove pid,time from DISTRIBUTE BY. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.com wrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.com wrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com wrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com wrote: This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So Basically If BID and PID are same but with different timestamps, then I need TOP 3 sorted with TIME in descending order And for this I created rank UDF (User Defined Function). And I wrote
Re: Something wrong with my query to get TOP 3?
Actually, never mind. Looks like you need to partition by both bid and pid. In that case, your problem is that rank() has to handle a combined bid+pid key. So first you need to create a combined key, partition by that key and pass it to your rank() function (assuming rank() knows to reset on a new key). You can cast bid and pid to string and concatenate them with a separator (bid_pid) to get a single partitioning key. Hope this makes sense. On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov i...@decide.com wrote: Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses. Your very first query was correct except for the nested subquery part. (You don't need a double-nested subquery.) On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky comptechge...@gmail.comwrote: Hi Igor, I am not sure what I have to remove from Distribute By as in distribute by we have bid, pid and you said remove bid and time from distribute by and it doesn't have time *SELECT bid, pid, rank FROM * *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) FROM * * ( SELECT bid, pid, time FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09' * *DISTRIBUTE BY bid,pid* *SORT BY bid,pid, time desc) A * * ) B * *WHERE rank 3;* And also I tried running the above query as it is. I am not getting expected output instead of that I am getting output like this which is wrong If you compare my expected output with the below output- *1345653 1109093169040* *1345653 2210657967611* *1345653 3307601379502* On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote: Remove pid,time from DISTRIBUTE BY. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.com wrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.com wrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com wrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com wrote: This is the below data in my Table1 BID PID TIME --+-+ 1345653 330760137950 2012-07-09 21:42:29 1345653 330760137950 2012-07-09 21:43:29 1345653 330760137950 2012-07-09 21:40:29 1345653 330760137950 2012-07-09 21:41:29 1345653 110909316904 2012-07-09 21:29:06 1345653 221065796761 2012-07-09 19:31:48 So If I need to clarify the above scenario- I have data in above table like this- For USER *`1345653` *I have this PID `*330760137950` *four times but with different timestamps in red color. So I need the output something like this- Output that I need:- *1345653330760137950 2012-07-09 21:43:29 * *1345653330760137950 2012-07-09 21:42:29 * *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761
Re: Something wrong with my query to get TOP 3?
Hi Igor, I am new to HiveQL world. Don't know that much basically. Currently I have my Rank UDF function like this- *public final class Rank extends UDF{* *private int counter;* *private String last_key;* *public int evaluate(final String key){* * if ( !key.equalsIgnoreCase(this.last_key) ) {* * this.counter = 0;* * this.last_key = key;* * }* * return this.counter++;* *}* *}* * * And I tried that query after removing pid from distribute by and sort by clause, but I got the below output which is wrong again- *1345653 3307601379500* *1345653 3307601379501* *1345653 3307601379502* But I need output something like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 Any help will be appreciated. On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov i...@decide.com wrote: Actually, never mind. Looks like you need to partition by both bid and pid. In that case, your problem is that rank() has to handle a combined bid+pid key. So first you need to create a combined key, partition by that key and pass it to your rank() function (assuming rank() knows to reset on a new key). You can cast bid and pid to string and concatenate them with a separator (bid_pid) to get a single partitioning key. Hope this makes sense. On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov i...@decide.com wrote: Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses. Your very first query was correct except for the nested subquery part. (You don't need a double-nested subquery.) On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky comptechge...@gmail.comwrote: Hi Igor, I am not sure what I have to remove from Distribute By as in distribute by we have bid, pid and you said remove bid and time from distribute by and it doesn't have time *SELECT bid, pid, rank FROM * *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) FROM * * ( SELECT bid, pid, time FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09' * *DISTRIBUTE BY bid,pid* *SORT BY bid,pid, time desc) A * * ) B * *WHERE rank 3;* And also I tried running the above query as it is. I am not getting expected output instead of that I am getting output like this which is wrong If you compare my expected output with the below output- *1345653 1109093169040* *1345653 2210657967611* *1345653 3307601379502* On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote: Remove pid,time from DISTRIBUTE BY. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.com wrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.com wrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com wrote: Can you show me the exact query that I need to do for this particular problem consideing my scenario? It will be of great help to me. As I am new to HiveQL. I need TOP 3 for those if BID and PID gets matched but with different timestamp. On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans philip.j.trom...@gmail.com wrote: Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query.
Re: Something wrong with my query to get TOP 3?
I wrote this query few minutes back- *select bid, pid, time from (* *select bid, pid, time, rank() over (partition by bid, pid order by time desc) as k * *from table1 ) as x * *where k =3* *order by bid, pid, time desc* Do you think this query will work with my Rank function that I provided below? On Thu, Jul 19, 2012 at 4:05 PM, comptech geeky comptechge...@gmail.comwrote: Hi Igor, I am new to HiveQL world. Don't know that much basically. Currently I have my Rank UDF function like this- *public final class Rank extends UDF{* *private int counter;* *private String last_key;* *public int evaluate(final String key){* * if ( !key.equalsIgnoreCase(this.last_key) ) {* * this.counter = 0;* * this.last_key = key;* * }* * return this.counter++;* *}* *}* * * And I tried that query after removing pid from distribute by and sort by clause, but I got the below output which is wrong again- *1345653 3307601379500* *1345653 3307601379501* *1345653 3307601379502* But I need output something like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 Any help will be appreciated. On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov i...@decide.com wrote: Actually, never mind. Looks like you need to partition by both bid and pid. In that case, your problem is that rank() has to handle a combined bid+pid key. So first you need to create a combined key, partition by that key and pass it to your rank() function (assuming rank() knows to reset on a new key). You can cast bid and pid to string and concatenate them with a separator (bid_pid) to get a single partitioning key. Hope this makes sense. On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov i...@decide.com wrote: Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses. Your very first query was correct except for the nested subquery part. (You don't need a double-nested subquery.) On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky comptechge...@gmail.com wrote: Hi Igor, I am not sure what I have to remove from Distribute By as in distribute by we have bid, pid and you said remove bid and time from distribute by and it doesn't have time *SELECT bid, pid, rank FROM * *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) FROM * * ( SELECT bid, pid, time FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09' * *DISTRIBUTE BY bid,pid* *SORT BY bid,pid, time desc) A * * ) B * *WHERE rank 3;* And also I tried running the above query as it is. I am not getting expected output instead of that I am getting output like this which is wrong If you compare my expected output with the below output- *1345653 1109093169040* *1345653 2210657967611* *1345653 3307601379502* On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.comwrote: Remove pid,time from DISTRIBUTE BY. On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.com wrote: Modified Query that I wrote and its not working as expected output is. * * *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* *FROM (* *SELECT bid, pid, time* *FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09'* *DISTRIBUTE BY bid,pid,time* *SORT BY bid, time desc* *) a* *WHERE rank(bid) 3;* On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.com wrote: I wrote this query after modifying it- *SELECT buyer_id, item_id, rank(buyer_id), created_time, UNIX_TIMESTAMP(created_time)* *FROM (* *SELECT buyer_id, item_id, created_time* *FROM testingtable1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) = '2012-07-09'* *DISTRIBUTE BY buyer_id,item_id* *SORT BY buyer_id, created_time desc* *) a* *WHERE rank(buyer_id) 3;* And the output I got is which is sligtly wrong as it is missing two rows- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* These two rows are missing- 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 So full output should be like this- *1345653330760137950 2012-07-09 21:43:29* *1345653330760137950 2012-07-09 21:42:29* *1345653330760137950 2012-07-09 21:41:29* 1345653110909316904 2012-07-09 21:29:06 1345653221065796761 2012-07-09 19:31:48 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com wrote: Can you show me the exact query that I need to do for this
Re: Request write access to the Hive wiki
Granted! :) On Thu, Jul 19, 2012 at 12:56 AM, Lefty Leverenz le...@hortonworks.comwrote: Please grant me write access to the Hive wiki so that I can work on improving the documentation. Thank you. – Lefty Leverenz le...@hortonworks.com
org/apache/thrift/TException error
Hi, I have installed hive and set the environment variable export JAVA_HOME= export HADOOP_HOME= export HIVE_HOME= $ bin/hadoop fs -mkdir /tmp $ bin/hadoop fs -mkdir /user/hive/warehouse $ bin/hadoop fs -chmod g+w /tmp $ bin/hadoop fs -chmod g+w /user/hive/warehouse After this when i execute hive it shows this error [root@slave bin]# hive Exception in thread main java.lang.NoClassDefFoundError: org/apache/thrift/TException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 3 more Please help Thanks Regards Prabhjot