Improving self join time

2014-03-20 Thread Jeff Storey
I have a table with 10 million rows and 2 columns - id (int) and element (string). I am trying to do a self join that finds any ids where the element values are the same, and my query looks like: select e1.id, e1.tag, e2.id as id2, e2.tag as tag2 from elements e1 JOIN elements e2 on e1.element =

Re: computing median and percentiles

2014-03-20 Thread Stephen Sprague
the short answer is there is no native hive UDF that solves your unique case. That means you have to solve it. i searched for something like you were looking for myself and found this general recipe: http://www.onlinemathlearning.com/median-frequency-table.html off the top of my head i'm not

Re: Improving self join time

2014-03-20 Thread Stephen Sprague
hmm. would this not fall under the general problem of identifying duplicates? Would something like this meet your needs? (untested) select -- outer query finds the ids for the duplicates key from ( -- inner query lists duplicate values select count(*) as cnt, value

Re: Improving self join time

2014-03-20 Thread Jeff Storey
I don't think this quite fits here..I think the inner query will give me a list of duplicate elements and their counts, but it loses the information as to what id had these elements. I'm trying to find which pairs of ids have any duplicate tags. On Thu, Mar 20, 2014 at 11:57 AM, Stephen Sprague

Re: Improving self join time

2014-03-20 Thread Stephen Sprague
I agree with your assessment of the inner query. why stop there though? Doesn't the outer query fetch the ids of the tags that the inner query identified? On Thu, Mar 20, 2014 at 9:54 AM, Jeff Storey storey.j...@gmail.com wrote: I don't think this quite fits here..I think the inner query will

HiveThrift Service Issue

2014-03-20 Thread Raj Hadoop
Hello everyone, The  HiveThrift Service was started succesfully. netstat -nl | grep 1 tcp    0  0 0.0.0.0:1   0.0.0.0:*   LISTEN I am able to read tables from Hive through Tableau. When executing queries through Tableau I am getting the

Re: Improving self join time

2014-03-20 Thread Jeff Storey
I don't think so since the inner result doesn't have the key field in it. It ends up being select key from (query result that doesn't contain the key field) ... On Thu, Mar 20, 2014 at 1:28 PM, Stephen Sprague sprag...@gmail.com wrote: I agree with your assessment of the inner query. why stop

Re: HiveThrift Service Issue

2014-03-20 Thread Szehon Ho
Hi Raj, There are map-reduce job logs generated if the MapRedTask fails, those might give some clue. Thanks, Szehon On Thu, Mar 20, 2014 at 12:29 PM, Raj Hadoop hadoop...@yahoo.com wrote: I am struggling on this one. Can any one throw some pointers on how to troubelshoot this issue please?

Re: HiveThrift Service Issue

2014-03-20 Thread Raj Hadoop
Hi Szehon, It is not showing on the http://xyzserver:50030/jobtracker.jsp. I checked this log. and this shows as - /tmp/root/hive.log      exec.ExecDriver (ExecDriver.java:addInputPaths(853)) - Processing alias table_emp exec.ExecDriver (ExecDriver.java:addInputPaths(871)) - Adding input file

Indexing in Hive 0.12 on a partitioned and bucketed table

2014-03-20 Thread Sagar Mehta
Hi Guys, We have a Hive 0.12 ORC table that is partitioned on year, month, day, hour and is bucketed by one column. So far so good - We are seeing good speed up improvements as compared to non-ORC format. - Now we want to add an index on another commonly used column. My question was -

Re: HiveThrift Service Issue

2014-03-20 Thread Szehon Ho
The last line seems to indicate a PrivilegedActionException, maybe you can look more at the rest of the stack if any to see why. On Thu, Mar 20, 2014 at 1:59 PM, Raj Hadoop hadoop...@yahoo.com wrote: Hi Szehon, It is not showing on the http://xyzserver:50030/jobtracker.jsp. *I checked

Re: Improving self join time

2014-03-20 Thread Stephen Sprague
so that's your final assessment, eh? :) What is your comment about the outer query _joining on value_ to get the key? On Thu, Mar 20, 2014 at 12:26 PM, Jeff Storey storey.j...@gmail.com wrote: I don't think so since the inner result doesn't have the key field in it. It ends up being