Get arguments' names in Hive's UDF
Hi all, Is there any api to retrieve the parameter's column name in GenericUDF? For example: Select UDFTEST(columnA,columnB) from test; I want to get the column names(columnA and columnB) in UDFTEST's initialize function via ObjectInspector but I did not find any viable solution.
Possible to specify reducers for each stage?
Hi all, Is it possible to specify reducer number for each stage ? how? thanks!
Re: Performance difference between tuning reducer num and partition table
Hi Dean, Thanks for your reply. If I don't set the number of reducers in the 1st run , the number of reducers will be much smaller and the performance will be worse. The total output file size is about 200MB, I see that many reduce output files are empty, only 10 of them have data. Another question is that , is there any documentation about the job specific parameters of MapReduce and Hive? 2013/6/29 Dean Wampler deanwamp...@gmail.com What happens if you don't set the number of reducers in the 1st run? How many reducers are executed. If it's a much smaller number, the extra overhead could matter. Another clue is the size of the files the first run produced, i.e., do you have 30 small (much less than a block size) files? On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi Stephen, My query is actually more complex , hive will generate 2 mapreduces, in the first solution , it runs 17 mappers / 30 reducers and 10 mappers / 30 reducers (reducer num is set manually) in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1 reducers for each partition I do not know whether they could achieve the same performance if the reducers num is set properly. 2013/6/29 Stephen Sprague sprag...@gmail.com great question. your parallelization seems to trump hadoop's.I guess i'd ask what are the _total_ number of Mappers and Reducers that run on your cluster for these two scenarios? I'd be curious if there are the same. On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi all, Here is the scenario, suppose I have 2 tables A and B, I would like to perform a simple join on them, We can do it like this: INSERT OVERWRITE TABLE C SELECT FROM A JOIN B on A.id=B.id In order to speed up this query since table A and B have lots of data, another approach is : Say I partition table A and B into 10 partitions respectively, and write the query like this INSERT OVERWRITE TABLE C PARTITION(pid=1) SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 then I run this query 10 times concurrently (pid ranges from 1 to 10) And my question is that , in my observation of some more complex queries, the second solution is about 15% faster than the first solution, is it simply because the setting of reducer num is not optimal? If the resource is not a limit and it is possible to set the proper reducer nums in the first solution , can they achieve the same performance? Is there any other fact that can cause performance difference between them(non-partition VS partition+concurrent) besides the job parameter issues? Thanks! -- Dean Wampler, Ph.D. @deanwampler http://polyglotprogramming.com
Performance difference between tuning reducer num and partition table
Hi all, Here is the scenario, suppose I have 2 tables A and B, I would like to perform a simple join on them, We can do it like this: INSERT OVERWRITE TABLE C SELECT FROM A JOIN B on A.id=B.id In order to speed up this query since table A and B have lots of data, another approach is : Say I partition table A and B into 10 partitions respectively, and write the query like this INSERT OVERWRITE TABLE C PARTITION(pid=1) SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 then I run this query 10 times concurrently (pid ranges from 1 to 10) And my question is that , in my observation of some more complex queries, the second solution is about 15% faster than the first solution, is it simply because the setting of reducer num is not optimal? If the resource is not a limit and it is possible to set the proper reducer nums in the first solution , can they achieve the same performance? Is there any other fact that can cause performance difference between them(non-partition VS partition+concurrent) besides the job parameter issues? Thanks!
Re: Performance difference between tuning reducer num and partition table
Hi Stephen, My query is actually more complex , hive will generate 2 mapreduces, in the first solution , it runs 17 mappers / 30 reducers and 10 mappers / 30 reducers (reducer num is set manually) in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1 reducers for each partition I do not know whether they could achieve the same performance if the reducers num is set properly. 2013/6/29 Stephen Sprague sprag...@gmail.com great question. your parallelization seems to trump hadoop's.I guess i'd ask what are the _total_ number of Mappers and Reducers that run on your cluster for these two scenarios? I'd be curious if there are the same. On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi all, Here is the scenario, suppose I have 2 tables A and B, I would like to perform a simple join on them, We can do it like this: INSERT OVERWRITE TABLE C SELECT FROM A JOIN B on A.id=B.id In order to speed up this query since table A and B have lots of data, another approach is : Say I partition table A and B into 10 partitions respectively, and write the query like this INSERT OVERWRITE TABLE C PARTITION(pid=1) SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 then I run this query 10 times concurrently (pid ranges from 1 to 10) And my question is that , in my observation of some more complex queries, the second solution is about 15% faster than the first solution, is it simply because the setting of reducer num is not optimal? If the resource is not a limit and it is possible to set the proper reducer nums in the first solution , can they achieve the same performance? Is there any other fact that can cause performance difference between them(non-partition VS partition+concurrent) besides the job parameter issues? Thanks!
How to change the separator of input reocrd in TRANSFORM of Hive
Hi all, I am trying to use Transform in Hive, but I do not find a way to change the separator between fields of input records in Transform. I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY '\0' However, while using SELECT TRANSFORM(id,name) USING 'python script.py' AS (id,name) FROM A I find that the separator of input is TAB instead of '\0' , does anyone know how to change it to '\0'? Thanks.
Re: How to change the separator of input reocrd in TRANSFORM of Hive
Oh sorry , I find the solution on wiki: https://cwiki.apache.org/Hive/languagemanual-transform.html by specifying the inrowformat and outrowformat . 2013/5/24 Felix.徐 ygnhz...@gmail.com Hi all, I am trying to use Transform in Hive, but I do not find a way to change the separator between fields of input records in Transform. I create a table A by specify ROW FORMAT DELIMITED FIELDS TERMINATED BY '\0' However, while using SELECT TRANSFORM(id,name) USING 'python script.py' AS (id,name) FROM A I find that the separator of input is TAB instead of '\0' , does anyone know how to change it to '\0'? Thanks.
What is the rule of job name generation in Hive?
Hi,all..I find that the job names of Hive are like this INSERT OVERWRITE TABLE u...userID,neighborid(Stage-4) What is the rule of generating such a name?
Re: How to get job names and stages of a query?
I actually want to get the job name of stages by api.. 在 2012年3月20日 下午2:23,Manish Bhoge manishbh...@rocketmail.com写道: ** Whenever you submit a Sql a job I'd get generated. You can open the job tracker localhost:50030/jobtracker.asp It shows jobs are running and rest of the other details. Thanks, Manish Sent from my BlackBerry, pls excuse typo -- *From: * Felix.徐 ygnhz...@gmail.com *Date: *Tue, 20 Mar 2012 12:58:53 +0800 *To: *user@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *How to get job names and stages of a query? Hi,all I want to track the progress of a query, how can I get the job name including stages of a query?
How to track query status in hive via thrift or anything else?
Hi,all .. I didn't find any helpful api in ThriftHive that can track the execution status of hive(or job progress).I want to get execution progress of queries from hive?How to do that?Thanks!
Is it possible to get the progress of a query through thrift server?
Hi all, I want to build a website to monitor the execution of queries sent to hive , is there any way to realize it?
Re: Is it possible to get the progress of a query through thrift server?
Can you provide some references? Thanks very much! 在 2012年3月12日 下午11:28,Edward Capriolo edlinuxg...@gmail.com写道: Yes. You have access to the job counters through thrift, as well as a method to test if query is done. Edward On Mon, Mar 12, 2012 at 11:12 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi all, I want to build a website to monitor the execution of queries sent to hive , is there any way to realize it?
Re: Showing wrong count after importing table in Hive
Hi, I meet the same problem once, then I change the amount of imported columns it works fine. Sometimes blank rows would be generated by sqoop..I do not actually know what the problem really is.. 2012/2/9 Bhavesh Shah bhavesh25s...@gmail.com Hello All, I have imported near about 10 tables in Hive from MS SQL Server. But when I try to cross check the records in Hive in one of the Table I have found more record when I run the query (select count(*) from tblName;). Then I have drop the that Table and again imported it in Hive. I have observed in Console Logs that (Retrieved 203 records). And then I tried again for (select count(*) from tblName;) and I got the count as 298. I dont understand this why this happens. Is anything is wrong in query or it happens due to some incorrect command of sqoop-import. All other table records are fine. I got stuck here and I had spend much time to search for this. Pls help me out from this. -- Thanks and Regards, Bhavesh Shah