Re: Performance difference between tuning reducer num and partition table

2013-06-30 Thread Felix . 徐
Hi Dean,

Thanks for your reply. If I don't set the number of reducers in the 1st run
, the number of reducers will be much smaller and the performance will be
worse. The total output file size is about 200MB, I see that many reduce
output files are empty, only 10 of them have data.

Another question is that , is there any documentation about the job
specific parameters of MapReduce and Hive?




2013/6/29 Dean Wampler deanwamp...@gmail.com

 What happens if you don't set the number of reducers in the 1st run? How
 many reducers are executed. If it's a much smaller number, the extra
 overhead could matter. Another clue is the size of the files the first run
 produced, i.e., do you have 30 small (much less than a block size) files?

 On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 ygnhz...@gmail.com wrote:

 Hi Stephen,

 My query is actually more complex , hive will generate 2 mapreduces,
 in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
 30 reducers (reducer num is set manually)
 in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
 reducers for each partition

 I do not know whether they could achieve the same performance if the
 reducers num is set properly.


 2013/6/29 Stephen Sprague sprag...@gmail.com

 great question.  your parallelization seems to trump hadoop's.I
 guess i'd ask what are the _total_ number of Mappers and Reducers that run
 on your cluster for these two scenarios?   I'd be curious if there are the
 same.




 On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote:

 Hi all,

 Here is the scenario, suppose I have 2 tables A and B, I would like to
 perform a simple join on them,

 We can do it like this:

 INSERT OVERWRITE TABLE C
 SELECT  FROM A JOIN B on A.id=B.id

 In order to speed up this query since table A and B have lots of data,
 another approach is :

 Say I partition table A and B into 10 partitions respectively, and
 write the query like this

 INSERT OVERWRITE TABLE C PARTITION(pid=1)
 SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

 then I run this query 10 times concurrently (pid ranges from 1 to 10)

 And my question is that , in my observation of some more complex
 queries, the second solution is about 15% faster than the first solution,
 is it simply because the setting of reducer num is not optimal?
 If the resource is not a limit and it is possible to set the proper
 reducer nums in the first solution , can they achieve the same performance?
 Is there any other fact that can cause performance difference between
 them(non-partition VS partition+concurrent) besides the job parameter
 issues?

 Thanks!






 --
 Dean Wampler, Ph.D.
 @deanwampler
 http://polyglotprogramming.com


Re: Performance difference between tuning reducer num and partition table

2013-06-29 Thread Dean Wampler
What happens if you don't set the number of reducers in the 1st run? How
many reducers are executed. If it's a much smaller number, the extra
overhead could matter. Another clue is the size of the files the first run
produced, i.e., do you have 30 small (much less than a block size) files?

On Sat, Jun 29, 2013 at 12:27 AM, Felix.徐 ygnhz...@gmail.com wrote:

 Hi Stephen,

 My query is actually more complex , hive will generate 2 mapreduces,
 in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
 30 reducers (reducer num is set manually)
 in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
 reducers for each partition

 I do not know whether they could achieve the same performance if the
 reducers num is set properly.


 2013/6/29 Stephen Sprague sprag...@gmail.com

 great question.  your parallelization seems to trump hadoop's.I guess
 i'd ask what are the _total_ number of Mappers and Reducers that run on
 your cluster for these two scenarios?   I'd be curious if there are the
 same.




 On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote:

 Hi all,

 Here is the scenario, suppose I have 2 tables A and B, I would like to
 perform a simple join on them,

 We can do it like this:

 INSERT OVERWRITE TABLE C
 SELECT  FROM A JOIN B on A.id=B.id

 In order to speed up this query since table A and B have lots of data,
 another approach is :

 Say I partition table A and B into 10 partitions respectively, and write
 the query like this

 INSERT OVERWRITE TABLE C PARTITION(pid=1)
 SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

 then I run this query 10 times concurrently (pid ranges from 1 to 10)

 And my question is that , in my observation of some more complex
 queries, the second solution is about 15% faster than the first solution,
 is it simply because the setting of reducer num is not optimal?
 If the resource is not a limit and it is possible to set the proper
 reducer nums in the first solution , can they achieve the same performance?
 Is there any other fact that can cause performance difference between
 them(non-partition VS partition+concurrent) besides the job parameter
 issues?

 Thanks!






-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com


Performance difference between tuning reducer num and partition table

2013-06-28 Thread Felix . 徐
Hi all,

Here is the scenario, suppose I have 2 tables A and B, I would like to
perform a simple join on them,

We can do it like this:

INSERT OVERWRITE TABLE C
SELECT  FROM A JOIN B on A.id=B.id

In order to speed up this query since table A and B have lots of data,
another approach is :

Say I partition table A and B into 10 partitions respectively, and write
the query like this

INSERT OVERWRITE TABLE C PARTITION(pid=1)
SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

then I run this query 10 times concurrently (pid ranges from 1 to 10)

And my question is that , in my observation of some more complex queries,
the second solution is about 15% faster than the first solution,
is it simply because the setting of reducer num is not optimal?
If the resource is not a limit and it is possible to set the proper reducer
nums in the first solution , can they achieve the same performance? Is
there any other fact that can cause performance difference between
them(non-partition VS partition+concurrent) besides the job parameter
issues?

Thanks!


Re: Performance difference between tuning reducer num and partition table

2013-06-28 Thread Felix . 徐
Hi Stephen,

My query is actually more complex , hive will generate 2 mapreduces,
in the first solution , it runs 17 mappers / 30 reducers and 10 mappers /
30 reducers (reducer num is set manually)
in the second solution , it runs 6 mappers / 1 reducer and 4 mappers / 1
reducers for each partition

I do not know whether they could achieve the same performance if the
reducers num is set properly.


2013/6/29 Stephen Sprague sprag...@gmail.com

 great question.  your parallelization seems to trump hadoop's.I guess
 i'd ask what are the _total_ number of Mappers and Reducers that run on
 your cluster for these two scenarios?   I'd be curious if there are the
 same.




 On Fri, Jun 28, 2013 at 8:40 AM, Felix.徐 ygnhz...@gmail.com wrote:

 Hi all,

 Here is the scenario, suppose I have 2 tables A and B, I would like to
 perform a simple join on them,

 We can do it like this:

 INSERT OVERWRITE TABLE C
 SELECT  FROM A JOIN B on A.id=B.id

 In order to speed up this query since table A and B have lots of data,
 another approach is :

 Say I partition table A and B into 10 partitions respectively, and write
 the query like this

 INSERT OVERWRITE TABLE C PARTITION(pid=1)
 SELECT  FROM A JOIN B on A.id=B.id WHERE A.pid=1 AND B.pid=1

 then I run this query 10 times concurrently (pid ranges from 1 to 10)

 And my question is that , in my observation of some more complex queries,
 the second solution is about 15% faster than the first solution,
 is it simply because the setting of reducer num is not optimal?
 If the resource is not a limit and it is possible to set the proper
 reducer nums in the first solution , can they achieve the same performance?
 Is there any other fact that can cause performance difference between
 them(non-partition VS partition+concurrent) besides the job parameter
 issues?

 Thanks!