Re: Performance difference between tuning reducer num and partition table
Hi Dean, Thanks for your reply. If I don't set the number of reducers in the 1st run , the number of reducers will be much smaller and the performance will be worse. The total output file size is about 200MB, I see that many reduce output files are empty, only 10 of them have data. Another question is that , is there any documentation about the job specific parameters of MapReduce and Hive? 2013/6/29 Dean Wampler deanwamp...@gmail.com What happens if you don't set the number of reducers in the 1st run? How many reducers are executed. If it's a much smaller number, the extra overhead could matter. Another clue is the size of the files the first run produced, i.e., do you have 30 small (much less than a block size) files? On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi Stephen, My query is actually more complex , hive will generate 2 mapreduces, in the first solution , it runs 17 mappers / 30 reducers and 10 mappers / 30 reducers (reducer num is set manually) in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1 reducers for each partition I do not know whether they could achieve the same performance if the reducers num is set properly. 2013/6/29 Stephen Sprague sprag...@gmail.com great question. your parallelization seems to trump hadoop's.I guess i'd ask what are the _total_ number of Mappers and Reducers that run on your cluster for these two scenarios? I'd be curious if there are the same. On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi all, Here is the scenario, suppose I have 2 tables A and B, I would like to perform a simple join on them, We can do it like this: INSERT OVERWRITE TABLE C SELECT FROM A JOIN B on A.id=B.id In order to speed up this query since table A and B have lots of data, another approach is : Say I partition table A and B into 10 partitions respectively, and write the query like this INSERT OVERWRITE TABLE C PARTITION(pid=1) SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 then I run this query 10 times concurrently (pid ranges from 1 to 10) And my question is that , in my observation of some more complex queries, the second solution is about 15% faster than the first solution, is it simply because the setting of reducer num is not optimal? If the resource is not a limit and it is possible to set the proper reducer nums in the first solution , can they achieve the same performance? Is there any other fact that can cause performance difference between them(non-partition VS partition+concurrent) besides the job parameter issues? Thanks! -- Dean Wampler, Ph.D. @deanwampler http://polyglotprogramming.com
Re: Performance difference between tuning reducer num and partition table
What happens if you don't set the number of reducers in the 1st run? How many reducers are executed. If it's a much smaller number, the extra overhead could matter. Another clue is the size of the files the first run produced, i.e., do you have 30 small (much less than a block size) files? On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi Stephen, My query is actually more complex , hive will generate 2 mapreduces, in the first solution , it runs 17 mappers / 30 reducers and 10 mappers / 30 reducers (reducer num is set manually) in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1 reducers for each partition I do not know whether they could achieve the same performance if the reducers num is set properly. 2013/6/29 Stephen Sprague sprag...@gmail.com great question. your parallelization seems to trump hadoop's.I guess i'd ask what are the _total_ number of Mappers and Reducers that run on your cluster for these two scenarios? I'd be curious if there are the same. On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi all, Here is the scenario, suppose I have 2 tables A and B, I would like to perform a simple join on them, We can do it like this: INSERT OVERWRITE TABLE C SELECT FROM A JOIN B on A.id=B.id In order to speed up this query since table A and B have lots of data, another approach is : Say I partition table A and B into 10 partitions respectively, and write the query like this INSERT OVERWRITE TABLE C PARTITION(pid=1) SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 then I run this query 10 times concurrently (pid ranges from 1 to 10) And my question is that , in my observation of some more complex queries, the second solution is about 15% faster than the first solution, is it simply because the setting of reducer num is not optimal? If the resource is not a limit and it is possible to set the proper reducer nums in the first solution , can they achieve the same performance? Is there any other fact that can cause performance difference between them(non-partition VS partition+concurrent) besides the job parameter issues? Thanks! -- Dean Wampler, Ph.D. @deanwampler http://polyglotprogramming.com
Performance difference between tuning reducer num and partition table
Hi all, Here is the scenario, suppose I have 2 tables A and B, I would like to perform a simple join on them, We can do it like this: INSERT OVERWRITE TABLE C SELECT FROM A JOIN B on A.id=B.id In order to speed up this query since table A and B have lots of data, another approach is : Say I partition table A and B into 10 partitions respectively, and write the query like this INSERT OVERWRITE TABLE C PARTITION(pid=1) SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 then I run this query 10 times concurrently (pid ranges from 1 to 10) And my question is that , in my observation of some more complex queries, the second solution is about 15% faster than the first solution, is it simply because the setting of reducer num is not optimal? If the resource is not a limit and it is possible to set the proper reducer nums in the first solution , can they achieve the same performance? Is there any other fact that can cause performance difference between them(non-partition VS partition+concurrent) besides the job parameter issues? Thanks!
Re: Performance difference between tuning reducer num and partition table
Hi Stephen, My query is actually more complex , hive will generate 2 mapreduces, in the first solution , it runs 17 mappers / 30 reducers and 10 mappers / 30 reducers (reducer num is set manually) in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1 reducers for each partition I do not know whether they could achieve the same performance if the reducers num is set properly. 2013/6/29 Stephen Sprague sprag...@gmail.com great question. your parallelization seems to trump hadoop's.I guess i'd ask what are the _total_ number of Mappers and Reducers that run on your cluster for these two scenarios? I'd be curious if there are the same. On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi all, Here is the scenario, suppose I have 2 tables A and B, I would like to perform a simple join on them, We can do it like this: INSERT OVERWRITE TABLE C SELECT FROM A JOIN B on A.id=B.id In order to speed up this query since table A and B have lots of data, another approach is : Say I partition table A and B into 10 partitions respectively, and write the query like this INSERT OVERWRITE TABLE C PARTITION(pid=1) SELECT FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1 then I run this query 10 times concurrently (pid ranges from 1 to 10) And my question is that , in my observation of some more complex queries, the second solution is about 15% faster than the first solution, is it simply because the setting of reducer num is not optimal? If the resource is not a limit and it is possible to set the proper reducer nums in the first solution , can they achieve the same performance? Is there any other fact that can cause performance difference between them(non-partition VS partition+concurrent) besides the job parameter issues? Thanks!