[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1249:


Release Note: 
In the previous versions of Pig, if the number of reducers was not specified 
(via PARALLEL or default_parallelism), the value of 1 was used which in many 
cases was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based 
on the size of the input data. There are several parameters that the user can 
control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; 
default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; 
default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per 
reducer.

This is a very simplistic formula that we would need to improve over time. 
Note, that the computed value takes all inputs within the script into account 
and applies the computed value to all the jobs within Pig script.

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, 
 PIG_1249_2.patch, PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1249:


Release Note: 
In the previous versions of Pig, if the number of reducers was not specified 
(via PARALLEL or default_parallel), the value of 1 was used which in many cases 
was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based 
on the size of the input data. There are several parameters that the user can 
control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; 
default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; 
default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per 
reducer.

This is a very simplistic formula that we would need to improve over time. 
Note, that the computed value takes all inputs within the script into account 
and applies the computed value to all the jobs within Pig script.

Note that this is not a backward compatible change and set default_parallel to 
restore the value to 1

  was:
In the previous versions of Pig, if the number of reducers was not specified 
(via PARALLEL or default_parallelism), the value of 1 was used which in many 
cases was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based 
on the size of the input data. There are several parameters that the user can 
control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; 
default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; 
default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per 
reducer.

This is a very simplistic formula that we would need to improve over time. 
Note, that the computed value takes all inputs within the script into account 
and applies the computed value to all the jobs within Pig script.


 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, 
 PIG_1249_2.patch, PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-07-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1249:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed. Thanks Jeff!

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, 
 PIG_1249_2.patch, PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-07-27 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1249:


Attachment: PIG-1249_5.patch

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, 
 PIG_1249_2.patch, PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-06-02 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1249:


Status: Open  (was: Patch Available)

The latest patch doesn't apply because of a merge conflict.  I'll attach a 
patch that addresses this.

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-06-02 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1249:


Attachment: PIG-1249-4.patch

Patch with merge conflict resolution.

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
 PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-06-02 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1249:


Status: Patch Available  (was: Open)

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
 PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-05-26 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1249:


Attachment: PIG_1249_3.patch

Update the patch, including testcase of non-dfs input and do path checking when 
doing estimation



 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-05-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1249:


Status: Open  (was: Patch Available)

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249.patch, PIG_1249_2.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-05-17 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1249:


Attachment: PIG-1249.patch

The current idea is borrowed from hive, use the input file size to estimate the 
reducer number.
Two parameters can been set for this purpose
pig.exec.reducers.bytes.per.reducer  // the number of bytes of input for each 
reducer
pig.exec.reducers.max  // the max number of reducer 
number

This only work for hdfs, won't work for other data source such as hbase or 
cassandra.



 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-05-17 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-1249:


   Status: Patch Available  (was: Open)
Affects Version/s: 0.8.0

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-04-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1249:


Fix Version/s: 0.8.0

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Reporter: Arun C Murthy
Priority: Critical
 Fix For: 0.8.0


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.