[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-29 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch, PIG-1306.patch, 
 PIG-1306.patch, PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Status: Open  (was: Patch Available)

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch, PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Attachment: PIG-1306.patch

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch, PIG-1306.patch, 
 PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Status: Patch Available  (was: Open)

Code cleanup a bit: a source of  white-space only changes is removed from the 
patch; one piece dead code is removed too.

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch, PIG-1306.patch, 
 PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Status: Open  (was: Patch Available)

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch, PIG-1306.patch, 
 PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Attachment: PIG-1306.patch

Fix a failure in a new test case.

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch, PIG-1306.patch, 
 PIG-1306.patch, PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Status: Patch Available  (was: Open)

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch, PIG-1306.patch, 
 PIG-1306.patch, PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-25 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Status: Open  (was: Patch Available)

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-25 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Attachment: PIG-1306.patch

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-25 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Status: Patch Available  (was: Open)

There is a test verification problem in the previous that does not create a 
single split correctly for sorted rows verification. Resubmitting now.

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-25 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Status: Open  (was: Patch Available)

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch, PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Attachment: PIG-1306.patch

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1306:
--

Status: Patch Available  (was: Open)

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1306.patch


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits

2010-03-22 Thread Jay Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Tang updated PIG-1306:
--

Fix Version/s: 0.7.0

 [zebra] Support of locally sorted input splits
 --

 Key: PIG-1306
 URL: https://issues.apache.org/jira/browse/PIG-1306
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0


 Current Zebra supports sorted or unsorted input splits on sorted table or 
 sorted table unions. The sorted input splits are based upon key ranges which 
 do not overlap. And the splits are basically globally sorted in that they are 
 locally sorted, and their key ranges do not overlap.
 The biggest problem of the key-range splits are performance hits suffered if 
 data skew is present, particularly if a key range contains a duplicate key 
 solely which makes the data trunk of the duplicate keys virtually 
 unsplittable regardless how many mappers are available: it just has to be 
 processed by a single mapper.
 On the other hand, there are scenarios when the globally sorted splits are a 
 over-kill and only locally sorted splits are good enough. Examples are the 
 use of Zebra sorted tables as the probe table in a map-side merge inner join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.