Get arguments' names in Hive's UDF

2013-07-21 Thread Felix .
Hi all,

Is there any api to retrieve the parameter's column name in GenericUDF?
For example:

Select UDFTEST(columnA,columnB) from test;

I want to get the column names(columnA and columnB) in
UDFTEST's initialize function via ObjectInspector but I did not find any
viable solution.


Possible to specify reducers for each stage?

2013-07-02 Thread Felix .
Hi all,

Is it possible to specify reducer number for each stage ? how?

thanks!


Re: Performance difference between tuning reducer num and partition table

2013-06-30 Thread Felix .
Hi Dean,

Thanks for your reply. If I don't set the number of reducers in the 1st run
, the number of reducers will be much smaller and the performance will be
worse. The total output file size is about 200MB, I see that many reduce
output files are empty, only 10 of them have data.

Another question is that , is there any documentation about the job
specific parameters of MapReduce and Hive?




2013/6/29 Dean Wampler deanwamp...@gmail.com

 What happens if you don't set the number of reducers in the 1st run? How
 many reducers are executed. If it's a much smaller number, the extra
 overhead could matter. Another clue is the size of the files the first run
 produced, i.e., do you have 30 small (much less than a block size) files?

 On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 ygnhz...@gmail.com wrote:

 Hi Stephen,

 My query is actually more complex , hive will generate 2 mapreduces,
 in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
 30 reducers (reducer num is set manually)
 in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
 reducers for each partition

 I do not know whether they could achieve the same performance if the
 reducers num is set properly.


 2013/6/29 Stephen Sprague sprag...@gmail.com

 great question.  your parallelization seems to trump hadoop's.I
 guess i'd ask what are the _total_ number of Mappers and Reducers that run
 on your cluster for these two scenarios?   I'd be curious if there are the
 same.




 On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote:

 Hi all,

 Here is the scenario, suppose I have 2 tables A and B, I would like to
 perform a simple join on them,

 We can do it like this:

 INSERT OVERWRITE TABLE C
 SELECT  FROM A JOIN B on A.id=B.id

 In order to speed up this query since table A and B have lots of data,
 another approach is :

 Say I partition table A and B into 10 partitions respectively, and
 write the query like this

 INSERT OVERWRITE TABLE C PARTITION(pid=1)
 SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

 then I run this query 10 times concurrently (pid ranges from 1 to 10)

 And my question is that , in my observation of some more complex
 queries, the second solution is about 15% faster than the first solution,
 is it simply because the setting of reducer num is not optimal?
 If the resource is not a limit and it is possible to set the proper
 reducer nums in the first solution , can they achieve the same performance?
 Is there any other fact that can cause performance difference between
 them(non-partition VS partition+concurrent) besides the job parameter
 issues?

 Thanks!






 --
 Dean Wampler, Ph.D.
 @deanwampler
 http://polyglotprogramming.com


Performance difference between tuning reducer num and partition table

2013-06-28 Thread Felix .
Hi all,

Here is the scenario, suppose I have 2 tables A and B, I would like to
perform a simple join on them,

We can do it like this:

INSERT OVERWRITE TABLE C
SELECT  FROM A JOIN B on A.id=B.id

In order to speed up this query since table A and B have lots of data,
another approach is :

Say I partition table A and B into 10 partitions respectively, and write
the query like this

INSERT OVERWRITE TABLE C PARTITION(pid=1)
SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

then I run this query 10 times concurrently (pid ranges from 1 to 10)

And my question is that , in my observation of some more complex queries,
the second solution is about 15% faster than the first solution,
is it simply because the setting of reducer num is not optimal?
If the resource is not a limit and it is possible to set the proper reducer
nums in the first solution , can they achieve the same performance? Is
there any other fact that can cause performance difference between
them(non-partition VS partition+concurrent) besides the job parameter
issues?

Thanks!


Re: Performance difference between tuning reducer num and partition table

2013-06-28 Thread Felix .
Hi Stephen,

My query is actually more complex , hive will generate 2 mapreduces,
in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
30 reducers (reducer num is set manually)
in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
reducers for each partition

I do not know whether they could achieve the same performance if the
reducers num is set properly.


2013/6/29 Stephen Sprague sprag...@gmail.com

 great question.  your parallelization seems to trump hadoop's.I guess
 i'd ask what are the _total_ number of Mappers and Reducers that run on
 your cluster for these two scenarios?   I'd be curious if there are the
 same.




 On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote:

 Hi all,

 Here is the scenario, suppose I have 2 tables A and B, I would like to
 perform a simple join on them,

 We can do it like this:

 INSERT OVERWRITE TABLE C
 SELECT  FROM A JOIN B on A.id=B.id

 In order to speed up this query since table A and B have lots of data,
 another approach is :

 Say I partition table A and B into 10 partitions respectively, and write
 the query like this

 INSERT OVERWRITE TABLE C PARTITION(pid=1)
 SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

 then I run this query 10 times concurrently (pid ranges from 1 to 10)

 And my question is that , in my observation of some more complex queries,
 the second solution is about 15% faster than the first solution,
 is it simply because the setting of reducer num is not optimal?
 If the resource is not a limit and it is possible to set the proper
 reducer nums in the first solution , can they achieve the same performance?
 Is there any other fact that can cause performance difference between
 them(non-partition VS partition+concurrent) besides the job parameter
 issues?

 Thanks!





How to change the separator of input reocrd in TRANSFORM of Hive

2013-05-23 Thread Felix .
Hi all,

I am trying to use Transform in Hive, but I do not find a way to change the
separator between fields of input records in Transform.

I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY '\0'

However, while using
SELECT TRANSFORM(id,name) USING 'python script.py'
AS (id,name)
FROM A

I find that the separator of input is TAB instead of '\0' ,
does anyone know how to change it to '\0'? Thanks.


Re: How to change the separator of input reocrd in TRANSFORM of Hive

2013-05-23 Thread Felix .
Oh sorry , I find the solution on wiki:
https://cwiki.apache.org/Hive/languagemanual-transform.html by specifying
the inrowformat and outrowformat .


2013/5/24 Felix.徐 ygnhz...@gmail.com

 Hi all,

 I am trying to use Transform in Hive, but I do not find a way to change
 the separator between fields of input records in Transform.

 I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY
 '\0'

 However, while using
 SELECT TRANSFORM(id,name) USING 'python script.py'
 AS (id,name)
 FROM A

 I find that the separator of input is TAB instead of '\0' ,
 does anyone know how to change it to '\0'? Thanks.



What is the rule of job name generation in Hive?

2012-03-22 Thread Felix .
Hi,all..I find that the job names of Hive are like this  INSERT OVERWRITE
TABLE u...userID,neighborid(Stage-4) 
What is the rule of generating such a name?


Re: How to get job names and stages of a query?

2012-03-20 Thread Felix .
I actually want to get the job name of stages by api..

在 2012年3月20日 下午2:23,Manish Bhoge manishbh...@rocketmail.com写道:

 **
 Whenever you submit a Sql a job I'd get generated. You can open the job
 tracker localhost:50030/jobtracker.asp
 It shows jobs are running and rest of the other details.
 Thanks,
 Manish
 Sent from my BlackBerry, pls excuse typo
 --
 *From: * Felix.徐 ygnhz...@gmail.com
 *Date: *Tue, 20 Mar 2012 12:58:53 +0800
 *To: *user@hive.apache.org
 *ReplyTo: * user@hive.apache.org
 *Subject: *How to get job names and stages of a query?

 Hi,all
 I want to track the progress of a query, how can I get the job name
 including stages of a query?



How to track query status in hive via thrift or anything else?

2012-03-14 Thread Felix .
Hi,all ..
I didn't find any helpful api in ThriftHive that can track the execution
status of hive(or job progress).I want to get execution progress of queries
from hive?How to do that?Thanks!


Is it possible to get the progress of a query through thrift server?

2012-03-12 Thread Felix .
Hi all,
I want to build a website to monitor the execution of queries sent to hive
, is there any way to realize it?


Re: Is it possible to get the progress of a query through thrift server?

2012-03-12 Thread Felix .
Can you provide some references? Thanks very much!

在 2012年3月12日 下午11:28,Edward Capriolo edlinuxg...@gmail.com写道:

 Yes. You have access to the job counters through thrift, as well as a
 method to test if query is done.

 Edward

 On Mon, Mar 12, 2012 at 11:12 AM, Felix.徐 ygnhz...@gmail.com wrote:
  Hi all,
  I want to build a website to monitor the execution of queries sent to
 hive ,
  is there any way to realize it?



Re: Showing wrong count after importing table in Hive

2012-02-08 Thread Felix .
Hi, I meet the same problem once, then I change the amount of imported
 columns it works fine. Sometimes blank rows would be generated by sqoop..I
do not actually know what the problem really is..

2012/2/9 Bhavesh Shah bhavesh25s...@gmail.com





Hello All,

 I have imported near about 10 tables in Hive from MS SQL Server. But when
 I try to cross check the records in Hive in one of the Table I have found
 more record when I run the query (select count(*) from tblName;).

 Then I have drop the that Table and again imported it in Hive. I have
 observed in Console Logs that (Retrieved 203 records). And then I tried
 again for (select count(*) from tblName;) and I got the count as 298.

 I dont understand this why this happens. Is anything is wrong in query or
 it happens due to some incorrect command of sqoop-import.

 All other table records are fine.

 I got stuck here and I had spend much time to search for this. Pls help me
 out from this.


 --
 Thanks and Regards,
 Bhavesh Shah