Re: Difference between timestamp is 15 minutes

2012-07-19 Thread Nitin Pawar
joins are on columns , what you are trying to do is where condition

On Thu, Jul 19, 2012 at 5:25 AM, Raihan Jamal jamalrai...@gmail.com wrote:

 Something like this will work in Hive?

 *ON ((UNIX_TIMESTAMP(testingtable1.created_time) -
 (prod_and_ts.timestamps / 1000)) / 60* 1000 = 15 minutes)*




 *Raihan Jamal*



 On Wed, Jul 18, 2012 at 4:48 PM, Raihan Jamal jamalrai...@gmail.comwrote:

 This is the CREATED_TIME *`2012-07-17 00:00:22`* and this is the
 Corresponding Timestamp *`1342508427000`*. Here timestamp is *5 seconds*more 
 than the
 *`CREATED_TIME`*. I need to see below scenario

 Currently I have a query, in which I am joining on created_time and
 timestamp like this-

 ON (UNIX_TIMESTAMP(testingtable1.created_time) = (prod_and_ts.timestamps
 / 1000))


 So in above case, it will not match as timestamp is 5 seconds more than
 created_time. But I need if the difference between either of the two is
 within 15 minutes then I need to match it.

 So I need something like this-

 ON (UNIX_TIMESTAMP(testingtable1.created_time) - (prod_and_ts.timestamps
 / 1000) = 15 minutes)


  How I can do the above case if difference between timestamps is within
 15 minutes then data will get matched by the above `ON clause`



 *Raihan Jamal*





-- 
Nitin Pawar


Request write access to the Hive wiki

2012-07-19 Thread Lefty Leverenz
Please grant me write access to the Hive wiki so that I can work on
improving the documentation.

Thank you.

– Lefty Leverenz
   le...@hortonworks.com


Re: Unable to start hive

2012-07-19 Thread Bejoy KS
Hi Prabhjot

Have you set $HADOOP_HOME? If not try setting that, if already set please 
verify whether it is the correct one.

Normally I do set $HADOOP_HOME and $PATH and have never faced any issues.


Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: iwannaplay games funnlearnfork...@gmail.com
Date: Thu, 19 Jul 2012 15:22:39 
To: user@hive.apache.org
Reply-To: user@hive.apache.org
Subject: Unable to start hive

Hi,

I installed apache hive and set the paths.Still i am getting this error.

[hduser@master bin]$ hive
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/thrift/TException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 3 more
What is wrong can you please help me out

Thanks in advance

Regards
Prabhjot


Re: Unable to start hive

2012-07-19 Thread iwannaplay games
Hi,

Everything is set.

[hduser@master bin]$ echo $HADOOP_HOME
/usr/local/hadoop
[hduser@master bin]$ echo $PATH
usr/local/sqoop/bin:/usr/local/hive/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/hduser/bin:/usr/java/jdk1.6.0_33/bin:/usr/local/hadoop/bin

created two directories
$ bin/hadoop fs -mkdir /tmp
$ bin/hadoop fs -mkdir /user/hive/warehouse
$ bin/hadoop fs -chmod g+w /tmp
$ bin/hadoop fs -chmod g+w /user/hive/warehouse

Do i need to make any changes in files in HIVE_HOME/conf.I have some
templates over there.I need to rename them?
Is there any other change that we make?

Thanks  Regards
Prabhjot


On 7/19/12, Bejoy KS bejoy...@yahoo.com wrote:
 Hi Prabhjot

 Have you set $HADOOP_HOME? If not try setting that, if already set please
 verify whether it is the correct one.

 Normally I do set $HADOOP_HOME and $PATH and have never faced any issues.


 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.

 -Original Message-
 From: iwannaplay games funnlearnfork...@gmail.com
 Date: Thu, 19 Jul 2012 15:22:39
 To: user@hive.apache.org
 Reply-To: user@hive.apache.org
 Subject: Unable to start hive

 Hi,

 I installed apache hive and set the paths.Still i am getting this error.

 [hduser@master bin]$ hive
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/thrift/TException
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
 Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TException
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 ... 3 more
 What is wrong can you please help me out

 Thanks in advance

 Regards
 Prabhjot



Performance tuning a hive query

2012-07-19 Thread Abhishek

Apart from partitions and buckets how to improve of hive queries 

Regards 
Abhi
Sent from my iPhone

Re: Performance tuning a hive query

2012-07-19 Thread Nitin Pawar
depends on what kind of query

if yoy are doing joins then there are different kind of join queries
depending on how did you layout the data and how much of data is held in
what table.

On Thu, Jul 19, 2012 at 6:54 PM, Abhishek abhishek.dod...@gmail.com wrote:


 Apart from partitions and buckets how to improve of hive queries
 *
 *
 *Regards
 *
 Abhi
 Sent from my iPhone




-- 
Nitin Pawar


Re: Performance tuning a hive query

2012-07-19 Thread Jan Dolinár
There are many ways, but beware that some of them may result in worse
performance when used inappropriately.

Some of the settings we use to achieve faster queries:
hive.map.aggr=true
hive.exec.parallel=true
hive.exec.compress.intermediate=true
mapred.job.reuse.jvm.num.tasks=-1

Structuring the queries properly can help a lot. For example if you
eliminate unneeded data early in the query before further processing. E.g.
if you use subquery in FROM, you should put all WHERE clauses where
possible into the subquery, to eliminate the amount of data passed to the
next stage.

Using multi-group-by queries helps a lot when computing multiple queries on
same set of data.

As Nitin Pawar mentioned, the JOINs can be often optimized as well.

Also, fine tuning the hadoop server itself for your specific needs might
help.

I am very interested in optimization of queries as well, so if anyone knows
some more tricks, please share...

J. Dolinar



On Thu, Jul 19, 2012 at 3:24 PM, Abhishek abhishek.dod...@gmail.com wrote:


 Apart from partitions and buckets how to improve of hive queries
 *
 *
 *Regards
 *
 Abhi
 Sent from my iPhone



Something wrong with my query to get TOP 3?

2012-07-19 Thread comptech geeky
This is the below data in my Table1


BID   PID   TIME
--+-+
1345653   330760137950   2012-07-09 21:42:29
1345653   330760137950   2012-07-09 21:43:29
1345653   330760137950   2012-07-09 21:40:29
1345653   330760137950   2012-07-09 21:41:29
1345653   110909316904   2012-07-09 21:29:06
1345653   221065796761   2012-07-09 19:31:48

So If I need to clarify the above scenario- I have data in above table like
this-
For USER *`1345653` *I have this PID `*330760137950` *four times but with
different timestamps in red color. So I need the output something like this-

Output that I need:-

*1345653330760137950   2012-07-09 21:43:29 *
*1345653330760137950   2012-07-09 21:42:29 *
*1345653330760137950   2012-07-09 21:41:29*
1345653110909316904   2012-07-09 21:29:06
1345653221065796761   2012-07-09 19:31:48

So Basically If BID and PID are same but with different timestamps, then I
need TOP 3 sorted with TIME in descending order

And for this I created rank UDF (User Defined Function). And I wrote the
below query but its not working for me. Can anyone help me on this?
*
*
*
*
*SELECT buyer_id, item_id, created_time*
*FROM table1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int))) =
'2012-07-09' AND rank(buyer_id)  3*
*DISTRIBUTE BY buyer_id*
*SORT BY buyer_id, created_time desc*


Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread comptech geeky
Modified Query that I wrote and its not working as expected output is.

*
*
*SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
*FROM (*
*SELECT bid, pid, time*
*FROM table1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
'2012-07-09'*
*DISTRIBUTE BY bid,pid,time*
*SORT BY bid, time desc*
*) a*
*WHERE rank(bid)  3;*





On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.comwrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as int)))
 = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

 So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 Can you show me the exact query that I need to do for this particular
 problem consideing my scenario? It will be of great help to me. As I am new
 to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with different
 timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and sort
 by in an inner query, and then evaluate your rank() in an outer query.

 Phil.
 On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com
 wrote:

 This is the below data in my Table1


 BID   PID   TIME

 --+-+
 1345653   330760137950   2012-07-09 21:42:29
 1345653   330760137950   2012-07-09 21:43:29
 1345653   330760137950   2012-07-09 21:40:29
 1345653   330760137950   2012-07-09 21:41:29
 1345653   110909316904   2012-07-09 21:29:06
 1345653   221065796761   2012-07-09 19:31:48

 So If I need to clarify the above scenario- I have data in above table
 like this-
 For USER *`1345653` *I have this PID `*330760137950` *four times but
 with different timestamps in red color. So I need the output something like
 this-

 Output that I need:-

 *1345653330760137950   2012-07-09 21:43:29 *
 *1345653330760137950   2012-07-09 21:42:29 *
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

 So Basically If BID and PID are same but with different timestamps,
 then I need TOP 3 sorted with TIME in descending order

 And for this I created rank UDF (User Defined Function). And I wrote
 the below query but its not working for me. Can anyone help me on this?
 *
 *
 *
 *
 *SELECT buyer_id, item_id, created_time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id*
 *SORT BY buyer_id, created_time desc*






Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread comptech geeky
Can anyone help me with this? I have tried other options by tweaking the
query also. I am not able to achieve my expected output.



On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.comwrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 Can you show me the exact query that I need to do for this particular
 problem consideing my scenario? It will be of great help to me. As I am new
 to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with different
 timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and
 sort by in an inner query, and then evaluate your rank() in an outer query.

 Phil.
 On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com
 wrote:

 This is the below data in my Table1


 BID   PID   TIME

 --+-+
 1345653   330760137950   2012-07-09 21:42:29
 1345653   330760137950   2012-07-09 21:43:29
 1345653   330760137950   2012-07-09 21:40:29
 1345653   330760137950   2012-07-09 21:41:29
 1345653   110909316904   2012-07-09 21:29:06
 1345653   221065796761   2012-07-09 19:31:48

 So If I need to clarify the above scenario- I have data in above table
 like this-
 For USER *`1345653` *I have this PID `*330760137950` *four times but
 with different timestamps in red color. So I need the output something 
 like
 this-

 Output that I need:-

 *1345653330760137950   2012-07-09 21:43:29 *
 *1345653330760137950   2012-07-09 21:42:29 *
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

 So Basically If BID and PID are same but with different timestamps,
 then I need TOP 3 sorted with TIME in descending order

 And for this I created rank UDF (User Defined Function). And I wrote
 the below query but its not working for me. Can anyone help me on this?
 *
 *
 *
 *
 *SELECT buyer_id, item_id, created_time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id*
 *SORT BY buyer_id, created_time desc*







Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread Jasper Knulst
Hi,

I more or less had the same problem and finally got it down by introducing
a second subquery. This will guarantee that the rank function is invoked
on the reduce phase and that the rank results are properly sorted.


I guess something like this:

*SELECT bid, pid, rank FROM *
  *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)  FROM
*
  *
   ( SELECT bid, pid, time FROM table1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
'2012-07-09' *
*DISTRIBUTE BY bid,pid*
*SORT BY bid,pid, time desc) A
*
*
  ) B
WHERE rank  3;*

The A. query is carried out at the map phase and makes sure the record
arrive in the proper order at the reducer
The B. query takes care of applying the rank function (aliased as rank)
The outer most query (a second MR job) takes care of filtering the TOP-3

Maybe this will not run at once but hope you get the idea for the proper
set up. I modelled it after a query that I had to run, which finally
returned the proper TOP-10 for me.


Jasper



2012/7/20 comptech geeky comptechge...@gmail.com

 Can anyone help me with this? I have tried other options by tweaking the
 query also. I am not able to achieve my expected output.



 On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com
  wrote:

 Can you show me the exact query that I need to do for this particular
 problem consideing my scenario? It will be of great help to me. As I am new
 to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with different
 timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and
 sort by in an inner query, and then evaluate your rank() in an outer 
 query.

 Phil.
 On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com
 wrote:

 This is the below data in my Table1


 BID   PID   TIME

 --+-+
 1345653   330760137950   2012-07-09 21:42:29
 1345653   330760137950   2012-07-09 21:43:29
 1345653   330760137950   2012-07-09 21:40:29
 1345653   330760137950   2012-07-09 21:41:29
 1345653   110909316904   2012-07-09 21:29:06
 1345653   221065796761   2012-07-09 19:31:48

 So If I need to clarify the above scenario- I have data in above
 table like this-
 For USER *`1345653` *I have this PID `*330760137950` *four times but
 with different timestamps in red color. So I need the output something 
 like
 this-

 Output that I need:-

 *1345653330760137950   2012-07-09 21:43:29 *
 *1345653330760137950   2012-07-09 21:42:29 *
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

 So Basically If BID and PID are same but with different timestamps,
 then I need TOP 3 sorted with TIME in descending order

 And for this I created rank UDF (User Defined Function). And I wrote
 the below query but its not working for me. Can anyone help me on this?
 *
 *
 *
 *
 *SELECT buyer_id, item_id, created_time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id*
 *SORT BY buyer_id, 

Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread Igor Tatarinov
Remove pid,time from DISTRIBUTE BY.

On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.comwrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 Can you show me the exact query that I need to do for this particular
 problem consideing my scenario? It will be of great help to me. As I am new
 to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with different
 timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and
 sort by in an inner query, and then evaluate your rank() in an outer query.

 Phil.
 On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com
 wrote:

 This is the below data in my Table1


 BID   PID   TIME

 --+-+
 1345653   330760137950   2012-07-09 21:42:29
 1345653   330760137950   2012-07-09 21:43:29
 1345653   330760137950   2012-07-09 21:40:29
 1345653   330760137950   2012-07-09 21:41:29
 1345653   110909316904   2012-07-09 21:29:06
 1345653   221065796761   2012-07-09 19:31:48

 So If I need to clarify the above scenario- I have data in above table
 like this-
 For USER *`1345653` *I have this PID `*330760137950` *four times but
 with different timestamps in red color. So I need the output something 
 like
 this-

 Output that I need:-

 *1345653330760137950   2012-07-09 21:43:29 *
 *1345653330760137950   2012-07-09 21:42:29 *
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

 So Basically If BID and PID are same but with different timestamps,
 then I need TOP 3 sorted with TIME in descending order

 And for this I created rank UDF (User Defined Function). And I wrote
 the below query but its not working for me. Can anyone help me on this?
 *
 *
 *
 *
 *SELECT buyer_id, item_id, created_time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id*
 *SORT BY buyer_id, created_time desc*







Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread comptech geeky
Hi Igor,

I am not sure what I have to remove from Distribute By as in distribute by
we have bid, pid and you said remove bid and time from distribute by and it
doesn't have time

*SELECT bid, pid, rank FROM *
  *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)  FROM
*
  *
   ( SELECT bid, pid, time FROM table1*
* where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
'2012-07-09' *
*DISTRIBUTE BY bid,pid*
*SORT BY bid,pid, time desc) A
*
*
  ) B
*
*WHERE rank  3;*


And also I tried running the above query as it is. I am not getting
expected output instead of that I am getting output like this which is
wrong If you compare my expected output with the below output-

*1345653 1109093169040*
*1345653 2210657967611*
*1345653 3307601379502*


On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote:

 Remove pid,time from DISTRIBUTE BY.

 On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky comptechge...@gmail.com
  wrote:

 Can you show me the exact query that I need to do for this particular
 problem consideing my scenario? It will be of great help to me. As I am new
 to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with different
 timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and
 sort by in an inner query, and then evaluate your rank() in an outer 
 query.

 Phil.
 On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com
 wrote:

 This is the below data in my Table1


 BID   PID   TIME

 --+-+
 1345653   330760137950   2012-07-09 21:42:29
 1345653   330760137950   2012-07-09 21:43:29
 1345653   330760137950   2012-07-09 21:40:29
 1345653   330760137950   2012-07-09 21:41:29
 1345653   110909316904   2012-07-09 21:29:06
 1345653   221065796761   2012-07-09 19:31:48

 So If I need to clarify the above scenario- I have data in above
 table like this-
 For USER *`1345653` *I have this PID `*330760137950` *four times but
 with different timestamps in red color. So I need the output something 
 like
 this-

 Output that I need:-

 *1345653330760137950   2012-07-09 21:43:29 *
 *1345653330760137950   2012-07-09 21:42:29 *
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

 So Basically If BID and PID are same but with different timestamps,
 then I need TOP 3 sorted with TIME in descending order

 And for this I created rank UDF (User Defined Function). And I wrote
 the below query but its not working for me. Can anyone help me on this?
 *
 *
 *
 *
 *SELECT buyer_id, item_id, created_time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id*
 *SORT BY buyer_id, created_time desc*








Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread Jasper Knulst
I am not really aware of your use case.
Play around with it. At least the rank function is now properly applied.


Maybe, remove pid from the DISTRIBUTE and de SORT clauses ??

Jasper



2012/7/20 comptech geeky comptechge...@gmail.com

 Hi Igor,

 I am not sure what I have to remove from Distribute By as in distribute by
 we have bid, pid and you said remove bid and time from distribute by and it
 doesn't have time

 *SELECT bid, pid, rank FROM *
   *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)  FROM
 *
   *
( SELECT bid, pid, time FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09' *
 *DISTRIBUTE BY bid,pid*
 *SORT BY bid,pid, time desc) A
 *
 *
   ) B
 *
 *WHERE rank  3;*


 And also I tried running the above query as it is. I am not getting
 expected output instead of that I am getting output like this which is
 wrong If you compare my expected output with the below output-

 *1345653 1109093169040*
 *1345653 2210657967611*
 *1345653 3307601379502*


 On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote:

 Remove pid,time from DISTRIBUTE BY.

 On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.com
  wrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two
 rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 Can you show me the exact query that I need to do for this particular
 problem consideing my scenario? It will be of great help to me. As I am 
 new
 to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with different
 timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and
 sort by in an inner query, and then evaluate your rank() in an outer 
 query.

 Phil.
 On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com
 wrote:

 This is the below data in my Table1


 BID   PID   TIME

 --+-+
 1345653   330760137950   2012-07-09 21:42:29
 1345653   330760137950   2012-07-09 21:43:29
 1345653   330760137950   2012-07-09 21:40:29
 1345653   330760137950   2012-07-09 21:41:29
 1345653   110909316904   2012-07-09 21:29:06
 1345653   221065796761   2012-07-09 19:31:48

 So If I need to clarify the above scenario- I have data in above
 table like this-
 For USER *`1345653` *I have this PID `*330760137950` *four times
 but with different timestamps in red color. So I need the output 
 something
 like this-

 Output that I need:-

 *1345653330760137950   2012-07-09 21:43:29 *
 *1345653330760137950   2012-07-09 21:42:29 *
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

 So Basically If BID and PID are same but with different timestamps,
 then I need TOP 3 sorted with TIME in descending order

 And for this I created rank UDF (User Defined Function). And I wrote
 the below query but its not working for me. Can anyone help me on this?
 *
 *
 *
 *
 *SELECT buyer_id, item_id, created_time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id*
 *SORT BY buyer_id, created_time desc*









-- 

Met 

Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread Igor Tatarinov
Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses.
Your very first query was correct except for the nested subquery part. (You
don't need a double-nested subquery.)

On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky comptechge...@gmail.comwrote:

 Hi Igor,

 I am not sure what I have to remove from Distribute By as in distribute by
 we have bid, pid and you said remove bid and time from distribute by and it
 doesn't have time

 *SELECT bid, pid, rank FROM *
   *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)  FROM
 *
   *
( SELECT bid, pid, time FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09' *
 *DISTRIBUTE BY bid,pid*
 *SORT BY bid,pid, time desc) A
 *
 *
   ) B
 *
 *WHERE rank  3;*


 And also I tried running the above query as it is. I am not getting
 expected output instead of that I am getting output like this which is
 wrong If you compare my expected output with the below output-

 *1345653 1109093169040*
 *1345653 2210657967611*
 *1345653 3307601379502*


 On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote:

 Remove pid,time from DISTRIBUTE BY.

 On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky comptechge...@gmail.com
  wrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two
 rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 Can you show me the exact query that I need to do for this particular
 problem consideing my scenario? It will be of great help to me. As I am 
 new
 to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with different
 timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and
 sort by in an inner query, and then evaluate your rank() in an outer 
 query.

 Phil.
 On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com
 wrote:

 This is the below data in my Table1


 BID   PID   TIME

 --+-+
 1345653   330760137950   2012-07-09 21:42:29
 1345653   330760137950   2012-07-09 21:43:29
 1345653   330760137950   2012-07-09 21:40:29
 1345653   330760137950   2012-07-09 21:41:29
 1345653   110909316904   2012-07-09 21:29:06
 1345653   221065796761   2012-07-09 19:31:48

 So If I need to clarify the above scenario- I have data in above
 table like this-
 For USER *`1345653` *I have this PID `*330760137950` *four times
 but with different timestamps in red color. So I need the output 
 something
 like this-

 Output that I need:-

 *1345653330760137950   2012-07-09 21:43:29 *
 *1345653330760137950   2012-07-09 21:42:29 *
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

 So Basically If BID and PID are same but with different timestamps,
 then I need TOP 3 sorted with TIME in descending order

 And for this I created rank UDF (User Defined Function). And I wrote
 the below query but its not working for me. Can anyone help me on this?
 *
 *
 *
 *
 *SELECT buyer_id, item_id, created_time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id*
 *SORT BY buyer_id, created_time 

Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread comptech geeky
Thanks Jasper for replying back. I have mentioned my use case in my first
email. And also I have already wrote the HiveQL query with the rank
function working but it is not giving me the exact output that I am
supposed to get from the query.




On Thu, Jul 19, 2012 at 3:53 PM, Jasper Knulst
jasper.knu...@incentro.comwrote:

 I am not really aware of your use case.
 Play around with it. At least the rank function is now properly applied.


 Maybe, remove pid from the DISTRIBUTE and de SORT clauses ??

 Jasper




 2012/7/20 comptech geeky comptechge...@gmail.com

 Hi Igor,

 I am not sure what I have to remove from Distribute By as in distribute
 by we have bid, pid and you said remove bid and time from distribute by and
 it doesn't have time

 *SELECT bid, pid, rank FROM *
   *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
 FROM
 *
   *
( SELECT bid, pid, time FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09' *
 *DISTRIBUTE BY bid,pid*
 *SORT BY bid,pid, time desc) A
 *
 *
   ) B
 *
 *WHERE rank  3;*


 And also I tried running the above query as it is. I am not getting
 expected output instead of that I am getting output like this which is
 wrong If you compare my expected output with the below output-

  *1345653 1109093169040*
 *1345653 2210657967611*
 *1345653 3307601379502*


 On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote:

 Remove pid,time from DISTRIBUTE BY.

 On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.com
  wrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two
 rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 Can you show me the exact query that I need to do for this particular
 problem consideing my scenario? It will be of great help to me. As I am 
 new
 to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with different
 timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and
 sort by in an inner query, and then evaluate your rank() in an outer 
 query.

 Phil.
 On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com
 wrote:

 This is the below data in my Table1


 BID   PID   TIME

 --+-+
 1345653   330760137950   2012-07-09 21:42:29
 1345653   330760137950   2012-07-09 21:43:29
 1345653   330760137950   2012-07-09 21:40:29
 1345653   330760137950   2012-07-09 21:41:29
 1345653   110909316904   2012-07-09 21:29:06
 1345653   221065796761   2012-07-09 19:31:48

 So If I need to clarify the above scenario- I have data in above
 table like this-
 For USER *`1345653` *I have this PID `*330760137950` *four times
 but with different timestamps in red color. So I need the output 
 something
 like this-

 Output that I need:-

 *1345653330760137950   2012-07-09 21:43:29 *
 *1345653330760137950   2012-07-09 21:42:29 *
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

 So Basically If BID and PID are same but with different timestamps,
 then I need TOP 3 sorted with TIME in descending order

 And for this I created rank UDF (User Defined Function). And I
 wrote 

Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread Igor Tatarinov
Actually, never mind. Looks like you need to partition by both bid and pid.
In that case, your problem is that rank() has to handle a combined bid+pid
key. So first you need to create a combined key, partition by that key and
pass it to your rank() function (assuming rank() knows to reset on a new
key). You can cast bid and pid to string and concatenate them with a
separator (bid_pid) to get a single partitioning key. Hope this makes sense.

On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov i...@decide.com wrote:

 Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses.
 Your very first query was correct except for the nested subquery part.
 (You don't need a double-nested subquery.)

 On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 Hi Igor,

 I am not sure what I have to remove from Distribute By as in distribute
 by we have bid, pid and you said remove bid and time from distribute by and
 it doesn't have time

 *SELECT bid, pid, rank FROM *
   *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
 FROM
 *
   *
( SELECT bid, pid, time FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09' *
 *DISTRIBUTE BY bid,pid*
 *SORT BY bid,pid, time desc) A
 *
 *
   ) B
 *
 *WHERE rank  3;*


 And also I tried running the above query as it is. I am not getting
 expected output instead of that I am getting output like this which is
 wrong If you compare my expected output with the below output-

  *1345653 1109093169040*
 *1345653 2210657967611*
 *1345653 3307601379502*


 On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote:

 Remove pid,time from DISTRIBUTE BY.

 On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky comptechge...@gmail.com
  wrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two
 rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 Can you show me the exact query that I need to do for this particular
 problem consideing my scenario? It will be of great help to me. As I am 
 new
 to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with different
 timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and
 sort by in an inner query, and then evaluate your rank() in an outer 
 query.

 Phil.
 On Jul 19, 2012 9:00 PM, comptech geeky comptechge...@gmail.com
 wrote:

 This is the below data in my Table1


 BID   PID   TIME

 --+-+
 1345653   330760137950   2012-07-09 21:42:29
 1345653   330760137950   2012-07-09 21:43:29
 1345653   330760137950   2012-07-09 21:40:29
 1345653   330760137950   2012-07-09 21:41:29
 1345653   110909316904   2012-07-09 21:29:06
 1345653   221065796761   2012-07-09 19:31:48

 So If I need to clarify the above scenario- I have data in above
 table like this-
 For USER *`1345653` *I have this PID `*330760137950` *four times
 but with different timestamps in red color. So I need the output 
 something
 like this-

 Output that I need:-

 *1345653330760137950   2012-07-09 21:43:29 *
 *1345653330760137950   2012-07-09 21:42:29 *
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   

Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread comptech geeky
Hi Igor,

I am new to HiveQL world. Don't know that much basically. Currently I have
my Rank UDF function like this-

*public final class Rank extends UDF{*
*private int  counter;*
*private String last_key;*
*public int evaluate(final String key){*
*  if ( !key.equalsIgnoreCase(this.last_key) ) {*
* this.counter = 0;*
* this.last_key = key;*
*  }*
*  return this.counter++;*
*}*
*}*
*
*
And I tried that query after removing pid from distribute by and sort by
clause, but I got the below output which is wrong again-

*1345653 3307601379500*
*1345653 3307601379501*
*1345653 3307601379502*

But I need output something like this-

*1345653330760137950   2012-07-09 21:43:29*
*1345653330760137950   2012-07-09 21:42:29*
*1345653330760137950   2012-07-09 21:41:29*
1345653110909316904   2012-07-09 21:29:06
1345653221065796761   2012-07-09 19:31:48


Any help will be appreciated.




On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov i...@decide.com wrote:

 Actually, never mind. Looks like you need to partition by both bid and
 pid. In that case, your problem is that rank() has to handle a combined
 bid+pid key. So first you need to create a combined key, partition by that
 key and pass it to your rank() function (assuming rank() knows to reset on
 a new key). You can cast bid and pid to string and concatenate them with a
 separator (bid_pid) to get a single partitioning key. Hope this makes sense.

 On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov i...@decide.com wrote:

 Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses.
 Your very first query was correct except for the nested subquery part.
 (You don't need a double-nested subquery.)

 On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky 
 comptechge...@gmail.comwrote:

 Hi Igor,

 I am not sure what I have to remove from Distribute By as in distribute
 by we have bid, pid and you said remove bid and time from distribute by and
 it doesn't have time

 *SELECT bid, pid, rank FROM *
   *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
 FROM
 *
   *
( SELECT bid, pid, time FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09' *
 *DISTRIBUTE BY bid,pid*
 *SORT BY bid,pid, time desc) A
 *
 *
   ) B
 *
 *WHERE rank  3;*


 And also I tried running the above query as it is. I am not getting
 expected output instead of that I am getting output like this which is
 wrong If you compare my expected output with the below output-

  *1345653 1109093169040*
 *1345653 2210657967611*
 *1345653 3307601379502*


 On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.com wrote:

 Remove pid,time from DISTRIBUTE BY.

 On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two
 rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 Can you show me the exact query that I need to do for this
 particular problem consideing my scenario? It will be of great help to 
 me.
 As I am new to HiveQL.

 I need TOP 3 for those if BID and PID gets matched but with
 different timestamp.



 On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans 
 philip.j.trom...@gmail.com wrote:

 Your rank() is being evaluated map side. Put your distribute by and
 sort by in an inner query, and then evaluate your rank() in an outer 
 query.

 

Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread comptech geeky
I wrote this query few minutes back-

*select bid, pid, time from (*
*select bid, pid, time, rank() over (partition by bid, pid order by
time desc) as k *
*from table1 ) as x *
*where k  =3*
*order by bid, pid, time desc*


Do you think this query will work with my Rank function that I provided
below?




On Thu, Jul 19, 2012 at 4:05 PM, comptech geeky comptechge...@gmail.comwrote:

 Hi Igor,

 I am new to HiveQL world. Don't know that much basically. Currently I have
 my Rank UDF function like this-

 *public final class Rank extends UDF{*
 *private int  counter;*
 *private String last_key;*
 *public int evaluate(final String key){*
 *  if ( !key.equalsIgnoreCase(this.last_key) ) {*
 * this.counter = 0;*
 * this.last_key = key;*
 *  }*
 *  return this.counter++;*
 *}*
 *}*
 *
 *
 And I tried that query after removing pid from distribute by and sort by
 clause, but I got the below output which is wrong again-

 *1345653 3307601379500*
 *1345653 3307601379501*
 *1345653 3307601379502*

 But I need output something like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 Any help will be appreciated.




 On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov i...@decide.com wrote:

 Actually, never mind. Looks like you need to partition by both bid and
 pid. In that case, your problem is that rank() has to handle a combined
 bid+pid key. So first you need to create a combined key, partition by that
 key and pass it to your rank() function (assuming rank() knows to reset on
 a new key). You can cast bid and pid to string and concatenate them with a
 separator (bid_pid) to get a single partitioning key. Hope this makes sense.

 On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov i...@decide.com wrote:

 Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT
 clauses.
 Your very first query was correct except for the nested subquery part.
 (You don't need a double-nested subquery.)

 On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky comptechge...@gmail.com
  wrote:

 Hi Igor,

 I am not sure what I have to remove from Distribute By as in distribute
 by we have bid, pid and you said remove bid and time from distribute by and
 it doesn't have time

 *SELECT bid, pid, rank FROM *
   *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
 FROM
 *
   *
( SELECT bid, pid, time FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09' *
 *DISTRIBUTE BY bid,pid*
 *SORT BY bid,pid, time desc) A
 *
 *
   ) B
 *
 *WHERE rank  3;*


 And also I tried running the above query as it is. I am not getting
 expected output instead of that I am getting output like this which is
 wrong If you compare my expected output with the below output-

  *1345653 1109093169040*
 *1345653 2210657967611*
 *1345653 3307601379502*


 On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov i...@decide.comwrote:

 Remove pid,time from DISTRIBUTE BY.

 On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 Modified Query that I wrote and its not working as expected output is.

 *
 *
 *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
 *FROM (*
 *SELECT bid, pid, time*
 *FROM table1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
 '2012-07-09'*
 *DISTRIBUTE BY bid,pid,time*
 *SORT BY bid, time desc*
 *) a*
 *WHERE rank(bid)  3;*





 On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 I wrote this query after modifying it-

 *SELECT buyer_id, item_id, rank(buyer_id), created_time,
 UNIX_TIMESTAMP(created_time)*
 *FROM (*
 *SELECT buyer_id, item_id, created_time*
 *FROM testingtable1*
 * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
 int))) = '2012-07-09'*
 *DISTRIBUTE BY buyer_id,item_id*
 *SORT BY buyer_id, created_time desc*
 *) a*
 *WHERE rank(buyer_id)  3;*

 And the output I got is which is sligtly wrong as it is missing two
 rows-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*

 These two rows are missing-

 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48

  So full output should be like this-

 *1345653330760137950   2012-07-09 21:43:29*
 *1345653330760137950   2012-07-09 21:42:29*
 *1345653330760137950   2012-07-09 21:41:29*
 1345653110909316904   2012-07-09 21:29:06
 1345653221065796761   2012-07-09 19:31:48


 On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky 
 comptechge...@gmail.com wrote:

 Can you show me the exact query that I need to do for this
 

Re: Request write access to the Hive wiki

2012-07-19 Thread Carl Steinbach
Granted! :)

On Thu, Jul 19, 2012 at 12:56 AM, Lefty Leverenz le...@hortonworks.comwrote:

 Please grant me write access to the Hive wiki so that I can work on
 improving the documentation.

 Thank you.

 – Lefty Leverenz
le...@hortonworks.com





org/apache/thrift/TException error

2012-07-19 Thread iwannaplay games
Hi,


I have installed hive and set the environment variable
export JAVA_HOME=
export HADOOP_HOME=
export HIVE_HOME=

$ bin/hadoop fs -mkdir /tmp
$ bin/hadoop fs -mkdir /user/hive/warehouse
$ bin/hadoop fs -chmod g+w /tmp
$ bin/hadoop fs -chmod g+w /user/hive/warehouse

After this when i execute hive it shows this error
[root@slave bin]# hive
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/thrift/TException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
Caused by: java.lang.ClassNotFoundException: org.apache.thrift.TException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 3 more
Please help

Thanks

Regards
Prabhjot