from:"Yan Zhou"

[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-10-01 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Committed to both trunk and the 0.8 branch.

> ORDER BY does not work properly on integer/short keys that are -1
> -
>
> Key: PIG-1658
> URL: https://issues.apache.org/jira/browse/PIG-1658
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1658.patch, PIG-1658.patch
>
>
> In fact, all these types of keys of values that are negative but within the 
> byte or short's range would have the problem.
> Basic cally, a byte value of -1 & 0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY

2010-10-01 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917012#action_12917012
 ] 

Yan Zhou commented on PIG-1659:
---

Need to make sure it is invoked after optimization in both old and new logical 
plans.

> sortinfo is not set for store if there is a filter after ORDER BY
> -
>
> Key: PIG-1659
> URL: https://issues.apache.org/jira/browse/PIG-1659
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> This has caused 6 (of 7) failures in the Zebra test 
> TestOrderPreserveVariableTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-10-01 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Attachment: PIG-1658.patch

Add Zebra test TestMergeJoinPartial to the "pigtest" target.

> ORDER BY does not work properly on integer/short keys that are -1
> -
>
> Key: PIG-1658
> URL: https://issues.apache.org/jira/browse/PIG-1658
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1658.patch, PIG-1658.patch
>
>
> In fact, all these types of keys of values that are negative but within the 
> byte or short's range would have the problem.
> Basic cally, a byte value of -1 & 0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Attachment: PIG-1658.patch

This problem is caused by the PIG-1295 patch.

test-core pass. Zebra's nightly pass too.

test-patch output:

 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no tests are needed for 
this patch.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

Zebra's TestMergeJoinPartial is used to verify the fix.

> ORDER BY does not work properly on integer/short keys that are -1
> -
>
> Key: PIG-1658
> URL: https://issues.apache.org/jira/browse/PIG-1658
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1658.patch
>
>
> In fact, all these types of keys of values that are negative but within the 
> byte or short's range would have the problem.
> Basic cally, a byte value of -1 & 0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Status: Patch Available  (was: Open)

> ORDER BY does not work properly on integer/short keys that are -1
> -
>
> Key: PIG-1658
> URL: https://issues.apache.org/jira/browse/PIG-1658
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1658.patch
>
>
> In fact, all these types of keys of values that are negative but within the 
> byte or short's range would have the problem.
> Basic cally, a byte value of -1 & 0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY

2010-09-30 Thread Yan Zhou (JIRA)

sortinfo is not set for store if there is a filter after ORDER BY
-

 Key: PIG-1659
 URL: https://issues.apache.org/jira/browse/PIG-1659
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Daniel Dai
 Fix For: 0.8.0


This has caused 6 (of 7) failures in the Zebra test 
TestOrderPreserveVariableTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou reassigned PIG-1658:
-

Assignee: Yan Zhou

> ORDER BY does not work properly on integer/short keys that are -1
> -
>
> Key: PIG-1658
> URL: https://issues.apache.org/jira/browse/PIG-1658
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> In fact, all these types of keys of values that are negative but within the 
> byte or short's range would have the problem.
> Basic cally, a byte value of -1 & 0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Fix Version/s: 0.8.0
Affects Version/s: 0.8.0

> ORDER BY does not work properly on integer/short keys that are -1
> -
>
> Key: PIG-1658
> URL: https://issues.apache.org/jira/browse/PIG-1658
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> In fact, all these types of keys of values that are negative but within the 
> byte or short's range would have the problem.
> Basic cally, a byte value of -1 & 0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)

ORDER BY does not work properly on integer/short keys that are -1
-

 Key: PIG-1658
 URL: https://issues.apache.org/jira/browse/PIG-1658
 Project: Pig
  Issue Type: Bug
Reporter: Yan Zhou


In fact, all these types of keys of values that are negative but within the 
byte or short's range would have the problem.

Basic cally, a byte value of -1 & 0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1648:
--

Status: Patch Available  (was: Open)

> Split combination may return too many block locations to map/reduce framework
> -
>
> Key: PIG-1648
> URL: https://issues.apache.org/jira/browse/PIG-1648
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1648.patch
>
>
> For instance, if a small split has block locations h1, h2 and h3; another 
> small split has h1, h3, h4. After combination, the composite split contains 4 
> block locations. If the number of component splits is big, then the number of 
> block locations could be big too. In fact, the  number of block locations 
> serves as a hint to M/R as the best hosts this composite split should be run 
> on so the list should contain a short list, say 5, of the hosts that contain 
> the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1648:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to both trunk and the 0.8 branch.

> Split combination may return too many block locations to map/reduce framework
> -
>
> Key: PIG-1648
> URL: https://issues.apache.org/jira/browse/PIG-1648
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1648.patch
>
>
> For instance, if a small split has block locations h1, h2 and h3; another 
> small split has h1, h3, h4. After combination, the composite split contains 4 
> block locations. If the number of component splits is big, then the number of 
> block locations could be big too. In fact, the  number of block locations 
> serves as a hint to M/R as the best hosts this composite split should be run 
> on so the list should contain a short list, say 5, of the hosts that contain 
> the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915852#action_12915852
 ] 

Yan Zhou commented on PIG-1648:
---

test-patch results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

test-core tests pass too.


> Split combination may return too many block locations to map/reduce framework
> -
>
> Key: PIG-1648
> URL: https://issues.apache.org/jira/browse/PIG-1648
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1648.patch
>
>
> For instance, if a small split has block locations h1, h2 and h3; another 
> small split has h1, h3, h4. After combination, the composite split contains 4 
> block locations. If the number of component splits is big, then the number of 
> block locations could be big too. In fact, the  number of block locations 
> serves as a hint to M/R as the best hosts this composite split should be run 
> on so the list should contain a short list, say 5, of the hosts that contain 
> the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1648:
--

Attachment: PIG-1648.patch

> Split combination may return too many block locations to map/reduce framework
> -
>
> Key: PIG-1648
> URL: https://issues.apache.org/jira/browse/PIG-1648
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1648.patch
>
>
> For instance, if a small split has block locations h1, h2 and h3; another 
> small split has h1, h3, h4. After combination, the composite split contains 4 
> block locations. If the number of component splits is big, then the number of 
> block locations could be big too. In fact, the  number of block locations 
> serves as a hint to M/R as the best hosts this composite split should be run 
> on so the list should contain a short list, say 5, of the hosts that contain 
> the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915815#action_12915815
 ] 

Yan Zhou commented on PIG-1648:
---

Top 5 locations with most data will be used. This has been agreed upon by the 
M/R dev.

> Split combination may return too many block locations to map/reduce framework
> -
>
> Key: PIG-1648
> URL: https://issues.apache.org/jira/browse/PIG-1648
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> For instance, if a small split has block locations h1, h2 and h3; another 
> small split has h1, h3, h4. After combination, the composite split contains 4 
> block locations. If the number of component splits is big, then the number of 
> block locations could be big too. In fact, the  number of block locations 
> serves as a hint to M/R as the best hosts this composite split should be run 
> on so the list should contain a short list, say 5, of the hosts that contain 
> the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1651) PIG class loading mishandled

2010-09-27 Thread Yan Zhou (JIRA)

PIG class loading mishandled


 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0


If just having zebra.jar as being registered in a PIG script but not in the 
CLASSPATH, the query using zebra fails since there appear to be multiple 
classes loaded into JVM, causing static variable set previously not seen after 
one instance of the class is created through reflection. (After the zebra.jar 
is specified in CLASSPATH, it works fine.) The exception stack is as follows:

ackend error message during job submission
---
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for: hdfs://hostname/pathto/zebra_dir :: null
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
at 
org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
at 
org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
at 
org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
... 7 more



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1647) Logical simplifier throws a NPE

2010-09-27 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1647:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to both trunk and the 0.8 branch.

> Logical simplifier throws a NPE
> ---
>
> Key: PIG-1647
> URL: https://issues.apache.org/jira/browse/PIG-1647
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1647.patch, PIG-1647.patch
>
>
> A query like:
> A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
> B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' 
> and ((d is not null and d != '') or (e is not null and e != ''));
> will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1647) Logical simplifier throws a NPE

2010-09-26 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1647:
--

Status: Patch Available  (was: Open)

> Logical simplifier throws a NPE
> ---
>
> Key: PIG-1647
> URL: https://issues.apache.org/jira/browse/PIG-1647
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1647.patch, PIG-1647.patch
>
>
> A query like:
> A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
> B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' 
> and ((d is not null and d != '') or (e is not null and e != ''));
> will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1647) Logical simplifier throws a NPE

2010-09-26 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1647:
--

Attachment: PIG-1647.patch

passes test-core.

test-patch results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


> Logical simplifier throws a NPE
> ---
>
> Key: PIG-1647
> URL: https://issues.apache.org/jira/browse/PIG-1647
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1647.patch, PIG-1647.patch
>
>
> A query like:
> A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
> B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' 
> and ((d is not null and d != '') or (e is not null and e != ''));
> will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-24 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1645:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to both trunk and the 0.8 branch.

> Using both small split combination and temporary file compression on a query 
> of ORDER BY may cause crash
> 
>
> Key: PIG-1645
> URL: https://issues.apache.org/jira/browse/PIG-1645
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1645.patch
>
>
> The stack looks like the following:
> java.lang.NullPointerException at 
> java.util.Arrays.binarySearch(Arrays.java:2043) at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
>  at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
>  at
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
>  at
> org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1647) Logical simplifier throws a NPE

2010-09-24 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1647:
--

Attachment: PIG-1647.patch

> Logical simplifier throws a NPE
> ---
>
> Key: PIG-1647
> URL: https://issues.apache.org/jira/browse/PIG-1647
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>    Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1647.patch
>
>
> A query like:
> A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
> B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' 
> and ((d is not null and d != '') or (e is not null and e != ''));
> will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-24 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1635:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to both trunk and the 0.8 branch.

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1635.patch
>
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-24 Thread Yan Zhou (JIRA)

Split combination may return too many block locations to map/reduce framework
-

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


For instance, if a small split has block locations h1, h2 and h3; another small 
split has h1, h3, h4. After combination, the composite split contains 4 block 
locations. If the number of component splits is big, then the number of block 
locations could be big too. In fact, the  number of block locations serves as a 
hint to M/R as the best hosts this composite split should be run on so the list 
should contain a short list, say 5, of the hosts that contain the most data in 
this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1647) Logical simplifier throws a NPE

2010-09-24 Thread Yan Zhou (JIRA)

Logical simplifier throws a NPE
---

 Key: PIG-1647
 URL: https://issues.apache.org/jira/browse/PIG-1647
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


A query like:

A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' and 
((d is not null and d != '') or (e is not null and e != ''));

will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-24 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914672#action_12914672
 ] 

Yan Zhou commented on PIG-1635:
---

I did a thorough check for this patch. Actually some of the ordering changes 
were caused by the mentioned misuse. Thanks.

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1635.patch
>
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-24 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914541#action_12914541
 ] 

Yan Zhou commented on PIG-1645:
---

The possibility of failure also depends upon the block distribution since the 
split combination makes use of that info.

> Using both small split combination and temporary file compression on a query 
> of ORDER BY may cause crash
> 
>
> Key: PIG-1645
> URL: https://issues.apache.org/jira/browse/PIG-1645
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1645.patch
>
>
> The stack looks like the following:
> java.lang.NullPointerException at 
> java.util.Arrays.binarySearch(Arrays.java:2043) at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
>  at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
>  at
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
>  at
> org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-24 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1645:
--

Status: Patch Available  (was: Open)

> Using both small split combination and temporary file compression on a query 
> of ORDER BY may cause crash
> 
>
> Key: PIG-1645
> URL: https://issues.apache.org/jira/browse/PIG-1645
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1645.patch
>
>
> The stack looks like the following:
> java.lang.NullPointerException at 
> java.util.Arrays.binarySearch(Arrays.java:2043) at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
>  at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
>  at
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
>  at
> org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-24 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1645:
--

Attachment: PIG-1645.patch

test-core passed.

test-patch results:

 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no tests are needed for 
this patch.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 459 release 
audit warnings (more than the trunk's current 457 warnings).

The scenario is trully a corner case. The following query *might* have caused 
the problem:

A = load '/tmp/test/jsTst2.txt' as (fn, age:int);
B = load '/tmp/test/sample.txt' as (fn, age:int);
C = join A by fn, B by fn USING 'replicated';
D = ORDER C BY B::age;
dump D;

where sample.txt has only one row that contains one record that has the same 
join key as a single record in jsTst2.txt which should have size of several 
HDFS blocks. Even so, it is random to see a failure, as it depends upon whether 
any of the logically empty files is placed in the first underlying split of the 
list of splits combined. Compute nodes' host names seem to play a role too.  
Running in local mode seems to see no failure.

The 2 release audit warnings are due to jdiff. No new file added.

> Using both small split combination and temporary file compression on a query 
> of ORDER BY may cause crash
> 
>
> Key: PIG-1645
> URL: https://issues.apache.org/jira/browse/PIG-1645
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1645.patch
>
>
> The stack looks like the following:
> java.lang.NullPointerException at 
> java.util.Arrays.binarySearch(Arrays.java:2043) at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
>  at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
>  at
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
>  at
> org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-23 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914150#action_12914150
 ] 

Yan Zhou commented on PIG-1635:
---

All test-core tests also run clean.

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1635.patch
>
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-23 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914145#action_12914145
 ] 

Yan Zhou commented on PIG-1635:
---

test-patch results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1635.patch
>
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-23 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914128#action_12914128
 ] 

Yan Zhou commented on PIG-1645:
---

The problem is that both RandomSampleLoader and PossionSampleLoader have 
internal states from the previous invocations that should be reset when a 
different underlying split is worked on under the same umbrella split when the 
split combination (PIG-1518) is on.

When temporary file compression is disabled, Pig internal storage will create 
empty files which will be discarded by split combiner, making the only 
non-empty split as the only split to be worked on, so it is ok in this case.

> Using both small split combination and temporary file compression on a query 
> of ORDER BY may cause crash
> 
>
> Key: PIG-1645
> URL: https://issues.apache.org/jira/browse/PIG-1645
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> The stack looks like the following:
> java.lang.NullPointerException at 
> java.util.Arrays.binarySearch(Arrays.java:2043) at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
>  at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
>  at
> org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
>  at
> org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-09-23 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Release Note: 
Feature: combine splits of sizes smaller than the value of property 
"pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property "pig.splitCombination" 
to "false". When such a combination is performed, a log message like "Total 
input paths (combined) to process : 7" will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more "under-fed" 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

This change also requires the loader to be stateless across the invocations to 
the prepareToRead method. That is, the method should reset any internal states 
that are not affected by the RecordReader argument.
Otherwise, this feature should be disabled.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.

  was:
Feature: combine splits of sizes smaller than the value of property 
"pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property "pig.noSplitCombination" 
to true. When such a combination is performed, a log message like "Total input 
paths (combined) to process : 7" will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more "under-fed" 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.


> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-22 Thread Yan Zhou (JIRA)

Using both small split combination and temporary file compression on a query of 
ORDER BY may cause crash


 Key: PIG-1645
 URL: https://issues.apache.org/jira/browse/PIG-1645
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


The stack looks like the following:

java.lang.NullPointerException at 
java.util.Arrays.binarySearch(Arrays.java:2043) at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
 at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
 at
org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1635:
--

Status: Patch Available  (was: Open)

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1635.patch
>
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1635:
--

Attachment: PIG-1635.patch

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1635.patch
>
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913036#action_12913036
 ] 

Yan Zhou commented on PIG-1635:
---

This is regarding a new feature (PIG-1399) added for 0.8.

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1635:
--

Affects Version/s: 0.8.0

> Logical simplifier does not simplify away constants under AND and OR; after 
> simplificaion the ordering of operands of AND and OR may get changed
> 
>
> Key: PIG-1635
> URL: https://issues.apache.org/jira/browse/PIG-1635
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Yan Zhou
>Assignee: Yan Zhou
>Priority: Minor
>
> b = FILTER a by (( f1 > 1) AND (1 == 1))
> or 
> b = FILTER a by ((f1 > 1) OR ( 1==0))
> should be simplified to
> b = FILTER a by f1 > 1;
> Regarding ordering change, an example is that 
> b = filter a by ((f1 is not null) AND (f2 is not null));
> Even without possible simplification, the expression is changed to
> b = filter a by ((f2 is not null) AND (f1 is not null));
> Even though the ordering change in this case, and probably in most other 
> cases, does not create any difference, but for two reasons some users might 
> care about the ordering: if stateful UDFs are used as operands of AND or OR; 
> and if the ordering is intended by the application designer to maximize the 
> chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)

Logical simplifier does not simplify away constants under AND and OR; after 
simplificaion the ordering of operands of AND and OR may get changed


 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor


b = FILTER a by (( f1 > 1) AND (1 == 1))

or 

b = FILTER a by ((f1 > 1) OR ( 1==0))

should be simplified to

b = FILTER a by f1 > 1;

Regarding ordering change, an example is that 

b = filter a by ((f1 is not null) AND (f2 is not null));

Even without possible simplification, the expression is changed to

b = filter a by ((f2 is not null) AND (f1 is not null));

Even though the ordering change in this case, and probably in most other cases, 
does not create any difference, but for two reasons some users might care about 
the ordering: if stateful UDFs are used as operands of AND or OR; and if the 
ordering is intended by the application designer to maximize the chances to 
shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1628) log this message at debug level : 'Pig Internal storage in use'

2010-09-21 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913029#action_12913029
 ] 

Yan Zhou commented on PIG-1628:
---

+1. Patch looks good.

> log this message at debug level : 'Pig Internal storage in use'
> ---
>
> Key: PIG-1628
> URL: https://issues.apache.org/jira/browse/PIG-1628
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1628.1.patch
>
>
> The temporary storage functions used are logging at the INFO level. This 
> should change to debug level, they are reducing the visibility of more useful 
> INFO messages. The messages include  'Pig Internal storage in use' from 
> InterStorage and  'TFile storage in use' from TFileStorage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-09-14 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909330#action_12909330
 ] 

Yan Zhou commented on PIG-366:
--

Robert,

Could you put down a step-by-step instruction on how to use this jar as an 
eclipse plug-in?  Thanks.

> PigPen - Eclipse plugin for a graphical PigLatin editor
> ---
>
> Key: PIG-366
> URL: https://issues.apache.org/jira/browse/PIG-366
> Project: Pig
>  Issue Type: New Feature
>Reporter: Shubham Chopra
>Assignee: Robert Gibbon
>Priority: Minor
> Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, 
> org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, 
> org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
> org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz
>
>
> This is an Eclipse plugin that provides a GUI that can help users create 
> PigLatin scripts and see the example generator outputs on the fly and submit 
> the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-239) illustrate followed by dump gives a runtime exception

2010-09-13 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou resolved PIG-239.
--

Fix Version/s: 0.8.0
   (was: 0.9.0)
   Resolution: Cannot Reproduce

Can not reproduce using 0.8.

> illustrate followed by dump gives a runtime exception
> -
>
> Key: PIG-239
> URL: https://issues.apache.org/jira/browse/PIG-239
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Pradeep Kamath
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> Here is a session which outlines the issue:
> grunt> a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, 
> age,gpa);
> grunt> b = filter a by name lt 'b';
> grunt> c = foreach b generate TOKENIZE(name);
> grunt> illustrate c;
> -
> | a | name  | age   | gpa   |
> -
> |   | tom xylophone | 69| 0.04  |
> |   | alice ovid| 75| 3.89  |
> -
> --
> | b | name   | age   | gpa   |
> --
> |   | alice ovid | 75| 3.89  |
> --
> -
> | c | (token )  |
> -
> |   | {(alice), (ovid)} |
> -
> grunt> dump c;
> 2008-05-15 14:35:54,476 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
> java.lang.RuntimeException: java.io.IOException: Serialization error: 
> org.apache.pig.impl.util.
> LineageTracer
> at 
> org.apache.pig.backend.hadoop.executionengine.POMapreduce.copy(POMapreduce.java:242)
> at 
> org.apache.pig.backend.hadoop.executionengine.MapreducePlanCompiler.compile(MapreducePlanCompiler.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:232)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:209)
> at org.apache.pig.PigServer.optimizeAndRunQuery(PigServer.java:410)
> at org.apache.pig.PigServer.openIterator(PigServer.java:332)
> at 
> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:265)
> at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:162)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseContOnError(GruntParser.java:73)
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:54)
> at org.apache.pig.Main.main(Main.java:270)
> Caused by: java.io.IOException: Serialization error: 
> org.apache.pig.impl.util.LineageTracer
> at 
> org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:16)
> at 
> org.apache.pig.impl.util.ObjectSerializer.serialize(ObjectSerializer.java:44)
> at 
> org.apache.pig.backend.hadoop.executionengine.POMapreduce.copy(POMapreduce.java:233)
> ... 10 more
> Caused by: java.io.NotSerializableException: 
> org.apache.pig.impl.util.LineageTracer
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1081)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1375)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1347)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1290)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1079)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:302)
> at java.util.ArrayList.writeObject(ArrayList.java:569)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:585)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:917)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1339)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1290)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1079)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java

[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-09-13 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908971#action_12908971
 ] 

Yan Zhou commented on PIG-366:
--

One more clearification: by design example generator does not submit any jobs 
to hadoop, it just runs at the client as a local application.

> PigPen - Eclipse plugin for a graphical PigLatin editor
> ---
>
> Key: PIG-366
> URL: https://issues.apache.org/jira/browse/PIG-366
> Project: Pig
>  Issue Type: New Feature
>Reporter: Shubham Chopra
>Assignee: Robert Gibbon
>Priority: Minor
> Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, 
> org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, 
> org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
> org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz
>
>
> This is an Eclipse plugin that provides a GUI that can help users create 
> PigLatin scripts and see the example generator outputs on the fly and submit 
> the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-09-13 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908962#action_12908962
 ] 

Yan Zhou commented on PIG-366:
--

Yes. But the original patch by Shubham had hooked the plugin to the example 
generator interface unless you will have found something funky in that patch. I 
have no intention to change the interface.

> PigPen - Eclipse plugin for a graphical PigLatin editor
> ---
>
> Key: PIG-366
> URL: https://issues.apache.org/jira/browse/PIG-366
> Project: Pig
>  Issue Type: New Feature
>Reporter: Shubham Chopra
>Assignee: Robert Gibbon
>Priority: Minor
> Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, 
> org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, 
> org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
> org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz
>
>
> This is an Eclipse plugin that provides a GUI that can help users create 
> PigLatin scripts and see the example generator outputs on the fly and submit 
> the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-09-13 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908926#action_12908926
 ] 

Yan Zhou commented on PIG-366:
--

Robert, first, thanks for your effort to pick up this feature.

You mentioned in your 09/08 Comment that you "stripped back" a lot of 
functionality and focused on the script editor.  I'm wondering if it is 
possible to add your fixes/improvements on top of Shubham's patch. 
Specifically, I'm interested in the example generator use in PigPen, which 
seems to absent from your patches. FYI, I'm currently working on improving and 
enhancing the example generator left over by Shubham about 2 years ago.

> PigPen - Eclipse plugin for a graphical PigLatin editor
> ---
>
> Key: PIG-366
> URL: https://issues.apache.org/jira/browse/PIG-366
> Project: Pig
>  Issue Type: New Feature
>Reporter: Shubham Chopra
>Assignee: Robert Gibbon
>Priority: Minor
> Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, 
> org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, 
> org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
> org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz
>
>
> This is an Eclipse plugin that provides a GUI that can help users create 
> PigLatin scripts and see the example generator outputs on the fly and submit 
> the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904868#action_12904868
 ] 

Yan Zhou commented on PIG-1501:
---

To be more eaccurate, the default compression would be gzip if the compression 
was made on by default.  Currently, the compression has to be specified and 
takes no default value. This is to ask user to take full appreciation of pros 
and cons of either compression method.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Release Note: 
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

  was:
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

[ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save 
HDFS space used to store the intermediate data used by PIG and potentially 
improve query execution speed. In general, the more intermediate data 
generated, the more storage and speedup benefits. There are no backward 
compatibility issues as result of this feature. An example is the following 
"test.pig" script: register pigperf.jar; A = load 
'/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, 
timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links); B1 = filter A by timespent == 4; B = load 
'/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by 
query_term, B by query_term using 'skewed' parallel 300; D = distinct C 
parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp 
/grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig 



> need to investigate the impact of compression on pig performance
> 
>
>     Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Release Note: 
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

[ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save 
HDFS space used to store the intermediate data used by PIG and potentially 
improve query execution speed. In general, the more intermediate data 
generated, the more storage and speedup benefits. There are no backward 
compatibility issues as result of this feature. An example is the following 
"test.pig" script: register pigperf.jar; A = load 
'/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, 
timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links); B1 = filter A by timespent == 4; B = load 
'/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by 
query_term, B by query_term using 'skewed' parallel 300; D = distinct C 
parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp 
/grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig 


> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-30 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

  Status: Patch Available  (was: Open)
Release Note: 
This logical simplification contains the following types of simplifications:

1) Constant pre-calculation
Example:
B = filter A by a0 > 5+7;

is simplified to

B = filter A by a0 > 12;


2) Elimination of negations
Example:
B = filter A by not (not(a0>5) or a>10);

is simplified to

B = filter A by a0>5 and a<=10;


3) Elimination of logical implied expression in AND
Example:
B = filter A by (a0 > 5 and a0 > 7);


is simplified to

B = filter A by a0 > 7;


4) Elimination of logical implied expression in OR
Example:
B = filter A by ((a0 > 5) or (a0 > 6 and a1 > 15);

is simplified to
B = filter C by a0 > 5;


5) Equivalence elimination
Example:
B = filter A by (a0 > 5 and a0 > 5);

is simplified to

B = filter A by a0 > 5;


6) Elimination of complementary expressions in OR
Example:
B = filter A by (a0 > 5 OR a0 <= 5);

is simplified to non-filtering


7) Elimination of naive TRUE expression
Example:

B = filter A by 1==1;

is simplified to non-filtering

> Logical Optimizer: Expression optimizor rule
> 
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: newPatchFindbugsWarnings.html, PIG-1399.patch, 
> PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
> PIG-1399.patch, PIG-1399.patch
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-30 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

I use findbugs 1.3.9 and it finds the patch clean. The attached findbugs 
results were generated using 1.3.8, it might be the difference. Anyways, I make 
a minor modification that should fix the warnings by 1.3.8.

> Logical Optimizer: Expression optimizor rule
> 
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: newPatchFindbugsWarnings.html, PIG-1399.patch, 
> PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
> PIG-1399.patch, PIG-1399.patch
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-27 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

rebased on the latest trunk.

> Logical Optimizer: Expression optimizor rule
> 
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
> PIG-1399.patch, PIG-1399.patch, PIG-1399.patch
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-27 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

Addressing the review comments except for not making several optimization rules 
since the ordering of the application of the rules is significant.

> Logical Optimizer: Expression optimizor rule
> 
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
> PIG-1399.patch, PIG-1399.patch
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-27 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903528#action_12903528
 ] 

Yan Zhou commented on PIG-1518:
---

All other functionalities except for the two mentioned in the previous comment 
will see splits combined by default, if necessary.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-27 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903525#action_12903525
 ] 

Yan Zhou commented on PIG-1518:
---

In summary, the following functionalities won't see splits combined on loads:

1) map-side cogroup;
2) merge join;


> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-27 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903423#action_12903423
 ] 

Yan Zhou commented on PIG-1518:
---

MergeJoinIndexer and IndexableLoadFunc are both not combinable.

Regarding orderedLoadFunc, the story is a bit more complex. First of all, it's 
only non-overriden method, getSplitComparable, is only used in MergeJoinIndexer 
which is already not combinable. 

The big issue is FileInputLoadFunc which is extended by BinStorage, PigStorage 
and InterStorage. Semantically, I agree OrderedLoadFunc should not be 
combinable. However, FileInputFormat's implementation of OrderedLoadFunc makes 
little sense in that its ordering is based on the  (path, offset) pair. This is 
an ordering but just an arbitrary ordering. Mathematically one can establish 
any arbitrary ordering over a discrete set of data. But the point is how is the 
ordering used. For our purpose, the ordering should be related to some keys 
used in data manipulation for which (path, offset) does not serve the purpose. 
Or implicitly a FileInputLoadFunc still requires the storage gives out splits 
in some key ordering. If that storage ordering does not actually exist, 
FileInputLoadFunc as an OrderedLoadFunc will have no use of its "sortness"
because the ordering is just, well, arbitray. The three extensions of 
FileInputLoadFunc work on generic data storage. Unless they work on sorted data 
in general, they should not be an OrderedLoadFunc.

The other use of OrderedLoadFunc, not its non-overriden method, 
getSplitComparable, is by map-side cogroup. But it does not check if the sort 
key is the join key which is critical for correctness.  It also requires to be 
a CollectableLoadFunc to work properly.

Since we do not want to break backward compatibility, and the only use of 
OrderLoadFunc in Pig, except for MergeJinIndexer which is already excluded from 
combining, is in map side cogroup with CollectableLoadFunc, I mark 
"CollectableLoadFunc AND an OrderedLoadFunc" as non-combinable.

In the future, we should really clean up the the OrderedLoadFunc from 
FileInputLoadFunc and let the getSplitComparable method provide key-related 
info and not the (path, offset) pair. Backward compatibility may need to be 
addressed too. Only then will the water become clearer and I be ok to adjust 
the noncombinable setting accordingly.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>        Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903102#action_12903102
 ] 

Yan Zhou commented on PIG-1518:
---

It is not combinable if the loader is a CollectableLoadFunc AND a 
OrderedLoadFunc. Since PigStorage is a CollectableLoadFunc  but not a 
OrderedLoadFunc, it is combinable.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

rebased on the latest trunk

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-26 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Status: Patch Available  (was: Open)

This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more  storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

An example is the following "test.pig" script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-25 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Status: Open  (was: Patch Available)

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-25 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Improvement on logging info.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-25 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

rebasing on the latest trunk

> Logical Optimizer: Expression optimizor rule
> 
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
> PIG-1399.patch
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-25 Thread Yan Zhou

Thank for quick turnaround Tejas.

Yan

-Original Message-
From: Thejas M Nair (JIRA) [mailto:j...@apache.org] 
Sent: Wednesday, August 25, 2010 8:54 AM
To: pig-dev@hadoop.apache.org
Subject: [jira] Commented: (PIG-1501) need to investigate the impact of 
compression on pig performance


[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902484#action_12902484
 ] 

Thejas M Nair commented on PIG-1501:


+1

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-25 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Attachment: PIG-1501.patch

Address the review comments, code rebasing on the latest trunk.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-24 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Minor polish of a debugging code inside comments

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-24 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Status: Patch Available  (was: Open)

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-24 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Status: Open  (was: Patch Available)

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

  Status: Patch Available  (was: Open)
Release Note: 
Feature: combine splits of sizes smaller than the value of property 
"pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property "pig.noSplitCombination" 
to true. When such a combination is performed, a log message like "Total input 
paths (combined) to process : 7" will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more "under-fed" 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Fix a typo; rebase on the latest trunk.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

The add method if PigSplit is removed. The debug code is left to facilitate 
future debugging work. The use of initNextRecordReader is pretty cloned from 
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader and I'll leave it 
as is too.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-23 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

Internal Hudson results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

All core tests also pass.

> Logical Optimizer: Expression optimizor rule
> 
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-20 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900950#action_12900950
 ] 

Yan Zhou commented on PIG-1501:
---

The internal Hudson results are as follows:

 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 9 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] -1 javac.  The applied patch generated 162 javac compiler 
warnings (more than the trunk's current 156 warnings).
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 427 release 
audit warnings (more than the trunk's current 425 warnings).

The 6 javac warnings are from the use of a deprecated PigMapReduce.sJobConf 
field. But that deprecation is for intended for external use only and internal 
use should be ok.

The 2 release audit warnings are on two html files, SampleOptimizer.html and 
org.apache.pig.impl.util.Utils.html.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-20 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Attachment: PIG-1501.patch

the compression codec is configurable on gzip or lzo; plus some minor changes

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-20 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

rebased on the latest trunk.

> Logical Optimizer: Expression optimizor rule
> 
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1399.patch, PIG-1399.patch
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-20 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Style changes, Hudson pass, plus other minor changes. Internal Hudson results:

[exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 427 release 
audit warnings (more than the trunk's current 425 warnings).


The release audit warnings are on two html files: PigInputFormat.html and 
PiRecordReader.html

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900123#action_12900123
 ] 

Yan Zhou commented on PIG-1518:
---

No. It does not work inside an optimizer as logical/physical plans are not 
changed as the other optimizers do.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899888#action_12899888
 ] 

Yan Zhou commented on PIG-1518:
---

In summary, the split combination's controllables are through the following jvm 
properties:

pig.maxCombinedSplitSize: by default, it is the load filesystem's default block 
size. This specifies the maximum combined split size in unit of bytes;

pig.splitCombination: takes values of "false" and "true". The default is 
"true". "false" will disable the split combination.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899609#action_12899609
 ] 

Yan Zhou commented on PIG-1518:
---

The formatting of the table of the last comment is a bit off: both headers 
should be be right-shifted by one column.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899605#action_12899605
 ] 

Yan Zhou commented on PIG-1518:
---

One experimental result on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM 
boxes is as follows:

Query:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp,
estimated_revenue, page_info, page_links);
B = foreach A generate user, (double)estimated_revenue;
B1 = distinct B;
alpha = load '/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as 
(name, phone, address,
city, state, zip);
beta = foreach alpha generate name;
C = join beta by name, B1 by user parallel 300;
D = group C by $0 parallel 40;
E = foreach D generate group, SUM(C.estimated_revenue);
store E into 'spliCombo2.out';

It creates 3 map/reduce jobs.

No Split Combination:

||Mappers|Reducers|
|number|120|300|
|elapsed time|24s|2m43s|
|number|301|300|
|elapsed time|46s|3m11s|
|number|300|40|
|elapsed time|38s|53s|
|Total elapsed time|7m36s|


With Split Combination:

||mappers|Reducers|
|number|120|300|
|elapsed time|22s|2m49s|
|number|3|300|
|elapsed time|27s|2m46s|
|number|1|40|
|elapsed time|17s|24s|
|Total elapsed time|7m5s|

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899445#action_12899445
 ] 

Yan Zhou commented on PIG-1518:
---

Another approach is to mark splits as uncombinable only when necessary. 
Specifically, MergeJoinIndexer and the base load in mapside cogroup need to be 
excluded from the split combination. 

Breaking backward compatinility is probably too much a risk to take. In the 
meanwhile, OrderedLoadFunc has a notion of "being evolving" that will leave 
some headroom for future semantic polishes.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-13 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898490#action_12898490
 ] 

Yan Zhou commented on PIG-1518:
---

There is a bigger question at hand. The semantics of OrderedLoadFunc is that 
the splits are totally ordered. And BinStorage, InterStorage and PigStorage all 
implement that interface through FileInputLoadFunc. Since the combination of 
splits as conceived here will definitely destroy the split ordering, if the 
combination is disabled for these storages, the feature would be virtually 
useless for a majority of use cases.

On the other hand, I'm seeing no use of the comparison capability except for 
MergeJoinIndexer's getNext() method, which makes me wonder if the 
OrderedLoadFunc can be removed from the FileInputLoadFunc.  Semantically, 
FileInputLoadFunc should not support the ordering of splits, as Hadoop's 
FileInputFormat doesn't. When a need arises like in MergeJoinIndexer, we can 
add that extension on. But the change may incur some backward compatibility 
issues.
I'm now soliciting comments in this area.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>        Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-12 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897887#action_12897887
 ] 

Yan Zhou commented on PIG-1518:
---

During the merge process, any empty splits will be skipped. Currently empty 
splits will be generated on empty files, which is not necessary at the first 
place.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-11 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897496#action_12897496
 ] 

Yan Zhou commented on PIG-1501:
---

Please refer to HADOOP-3315 for overall Sequence File vs TFile comparison. It 
appears for compressed data, TFile performs better than SeqFile.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-11 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897493#action_12897493
 ] 

Yan Zhou commented on PIG-1518:
---

Right, map side cogroup needs the sortness of the input, but just the "side 
inputs" need the feature to be able to seek on a key; the "base input" will 
only need presence of all duplicate keys in a mapper. I'll mark the "side 
inputs" as non-combinable.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>    Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-10 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897085#action_12897085
 ] 

Yan Zhou commented on PIG-1518:
---

The pseudo code of the combination op is as follows:

for each node of the nodes (sorted in the order of ascending sizes) {
while the node's split list (sorted in the order of descending sizes) is not 
empty {
find the biggest splits that can be combined with the first split of the list 
of the splits;
if  the accumulated split size is >= half of the limit {
  generate a combined split;
  remove the accumulated splits from the node's split list;
  clear the accumulated split list;
} else {
  break;
}
}
}

// leftover combination
for each node of the nodes {
for each split of the node's split list {
  add the split to a leftover list;
}
}

for each split in the leftover list {
if accumulated split size is >= limit {
   generate a combined split;
   remove the accumulated splits from the node's split list;
   clear the accumulated split list;
}
if it is the last split in the leftover list {
  try to see if it can be added with an existing combined split;
  if not, generate a combined split on the accumulated splits;
}
}

The complexity is n*log(n) with n being the number of original splits that are 
smaller than the limit.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>        Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-10 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Attachment: PIG-1501.patch

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-10 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897005#action_12897005
 ] 

Yan Zhou commented on PIG-1501:
---

The default is *not* using the compression on the intermediate data, which is 
the existing behavoir.

For RC file, it is just a bit better in terms of compression ration  than 
TFile. In terms of performance, the difference is within background noise. 
Stitching costs should be minimal. Actually, the full "projection" is the 
biggest advantage of RCFile over other columnar storage like  zebra. I was 
surprised to see the compression improvement over TFile is marginal. The only 
cause I can think of is that the compression ratio is too sensitive to the data 
to pre-determine or even pre-estimate.

lzo is under GPL. But it appears that Hadoop installation has it, at least in 
my test cluster.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-09 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896620#action_12896620
 ] 

Yan Zhou commented on PIG-1501:
---

Unless there is any objection raised in the coming week, I'll go with LZO 
compression on TFile with the default option to disable compression that will 
be the old behavoir.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-09 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Attachment: compress_perf_data_2.txt

The data set in the last tests are small such that the performance difference 
was lost in background noise.  This test case generates more temporary data.

In summary, lzo generates about 3% compression ration and sees 4x  speed 
improvement than uncompressed;  gzip generates less than 1% compress ratio but 
the speed is 1%-2% slower than uncompressed. This observation is in line with 
the general observation that gzip compresses better but performs worse.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>    Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Attachment: PIG-1496.patch

> Mandatory rule ImplicitSplitInserter
> 
>
> Key: PIG-1496
> URL: https://issues.apache.org/jira/browse/PIG-1496
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1496.patch, PIG-1496.patch
>
>
> Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Status: Patch Available  (was: Open)

> Mandatory rule ImplicitSplitInserter
> 
>
> Key: PIG-1496
> URL: https://issues.apache.org/jira/browse/PIG-1496
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1496.patch, PIG-1496.patch
>
>
> Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Attachment: PIG-1496.patch

More comments in code per the reviewer's comment.

> Mandatory rule ImplicitSplitInserter
> 
>
> Key: PIG-1496
> URL: https://issues.apache.org/jira/browse/PIG-1496
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1496.patch
>
>
> Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Attachment: (was: PIG-1496.patch)

> Mandatory rule ImplicitSplitInserter
> 
>
> Key: PIG-1496
> URL: https://issues.apache.org/jira/browse/PIG-1496
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1496.patch
>
>
> Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-04 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895338#action_12895338
 ] 

Yan Zhou commented on PIG-1518:
---

To provide a safe valve for any input fomats that might dislike the combination 
of their splits, a boolean property of pig.splitcombinaton is to be provided to 
allow for disabling this feature. The default value will be true.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-04 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895335#action_12895335
 ] 

Yan Zhou commented on PIG-1518:
---

The combination algorithm currently does not consider rack-locality as the 
generic underlying input splits do not carry the rack info. For more specific 
input splits like FileSplit, the rack info is available, thus allowing for 
generation of combined splits with consideration of rack-locality. But this 
might be out of scope for 0.8 and a seperate JIRA, PIG-1535, has been filed for 
that purpose.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1535) Combined input splits need to consider rack-locality for the underlying splits of rack info.

2010-08-04 Thread Yan Zhou (JIRA)

Combined input splits need to consider rack-locality for the underlying splits 
of rack info.


 Key: PIG-1535
 URL: https://issues.apache.org/jira/browse/PIG-1535
 Project: Pig
  Issue Type: Improvement
Reporter: Yan Zhou


PIG-1518 will add support to incorporate multiple small splits into bigger yet 
less splits. In doing so, the underlying generic input split's node-locality is 
consulted  to maximize the data node-locality for the "big" splits. The 
rack-locality info is unavailable because the generic input splits do not have 
the info currently. MAPREDUCE-1698 is filed to address the lack of rack info in 
InputSplit. On the other hand, for many other types of input splits the rack 
info is available. FileSplit is an example. Future Howl's input splits will 
also contain the rack-locality info. 

In summary, before MAPREDUCE-1698 is resolved if ever, for some specific types 
of input splits, the small splits could be combined with the awareness of the 
rack-locality, by, probably, the same or similar algorithms by the 
CombineFileInputFormat.

But it would mean non-trivial extra work on top of PIG-1518 and may be out of 
reach of 0.8, hence a separate JIRA.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-02 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894778#action_12894778
 ] 

Yan Zhou commented on PIG-1518:
---

In contrast with Hive, where the CombineFileInputFormat is used to generate 
input splits on the underlying storage formats, this PIG's combined splits work 
on top of the splits generated by the underlying loaders. In other words, 
Hive's input splits are CombineFileSplits that create record readers of 
underlying storage formats; while Pig's combined input splits contain 
underlying storage's splits.

CombineFileRecordReader would have been reusable if not for its support only in 
0.18 and the need of  CombineFIleSplit as an argument to its constructor 
instead of InputSplit (MAPREDUCE-955).

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-07-30 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894205#action_12894205
 ] 

Yan Zhou commented on PIG-1518:
---

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat,  batches 
small files on the basis of block locality. For PIG, this umbrella input format 
will have to work with the generic input formats for which the block info is 
not available but the data node and size info are present to let the M/R make 
scheduling decisions.

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat,  batches 
small files on the basis of block locality. For PIG, this umbrella input format 
will have to work with the generic input formats for which the block info is 
unavailable but the data node and size info are present to let the M/R make 
scheduling decisions. In other words, PIG can not
break the original splits to "work inside" but can just use the original splits 
as building block for the combined input splits.

Consequently, this combine input format will be holding multiple generic input 
splits so that each combined split's size is bound by a configured limit of, 
say, pig.maxsplitsize, with the default value of the HDFS block size of the 
file system the load source sits in.

However, due to the constrains of sortness in the tables in merge join, the 
split combination will not be used for any loads that will be used in merge 
join. For mapside cogroup or mapside group by, though, the splits can be 
combined because the splits are only required to contain the all duplicate keys 
per instance and combination of splits will still preserve that invariant.

During combination, the splits on the same data nodes will be merged as much as 
possible. Leftovers will be merged without regarding to the data localities. Of 
all the used data nodes, those of less splits will be merged before considering 
those of more splits so as to minimize the leftovers on the data nodes of less 
splits. On each data node,  a greedy approach is adopted so that largest splits 
are tried to be merged before smaller ones. This is because smaller splits are 
easier merged later among themselves. 
As result, in implementation, a sorted list of data hosts (on the number of 
splits) of sorted lists (on the split size) of the original splits will be 
maintained to efficiently perform the above operations. The complexity should 
be linear with the number of the original splits.

Note that for data locality, we just honor whatever the generic input split's 
getLocations() method produces. Any particular input split's implementation 
actually may or may not hold that property. For instance, CombinedInputFormat 
will combine 
node-local or rack-local blocks into a split. Essentially, this PIG container 
input split works on whatever data locality perception the underlying loader 
provides.

On the implementation side, PigSplit will not hold a single wrapped InputSplit 
instance but a new CombinedInputSplit instance. Accordingly, PigRecordReader 
will hold a list
of wrapped record readers and not just a single one. Correspondingly 
PigRecordReader's nextKeyValue() will use the wrapped record reader in order to 
fetch the next values.

Risks include 1) the test verifications may need major changes since this 
optimization may cause major ordering changes in results; 2) since 
LoadFunc.prepareRead() takes a PigSplit argument, there might be a backward 
compatibility issue as PigSplit changes its wrapped input split to the combined 
input split. But this should be very unlikely as the only known
use of the PigSplit argument is the internal  "index loader" for the right 
table in merge join.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-07-29 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Attachment: compress_perf_data.txt

The format in JIRA comment seems to be off mark. I'm attching the test results 
as an attachment.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-07-29 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893746#action_12893746
 ] 

Yan Zhou commented on PIG-1501:
---

gzip and lzo2 are tried as the compression codecs;  TFile and RCFile are used 
as storage formats. The tests are PigMix's L3 and L11, and a variation of L3 
with full projection, hereafter referred as L3_1,  in order to expand the 
temporary data size. (In some cases, multiple runs are executed, particularly 
in presence of doubted system fluctuations.)  End-to-end elapsed times are 
recorded.

The results are on a 15-node cluster of  2 x Xeon L5420 2.50GHz/16G RAM boxes:

  uncompressedTFile(lzo)  TFile(gzip)   
   RCFile(lzo2)
L3133684504   19674398 11513958 
   18092681
 1'40"  1'45"   
1'40" 1'56"

   18094161

 1'46"

L3_13889095541  36976818752637742581 
3675818160
 3'10"   4'4"   
 3'25"3'58"
  3697666122
 3675816707
   3'10"
3'22"
  3697674414
   3'5"

L11   25878480   21368784 15233146  
   21112892
 1'52" 1'52"
  1'57"1'59"

   21112892

  1'59"

A few observations are in order:

1) L3 has the highest compress ratio; while L3_1 and L11 much lower compression 
ratio;
2) gzip compress better compared with LZO2 with a little perf cost;
3) RC file should have seen much better compression as it's a columnar store. 
But the actual difference is marginal. It is probably because of L11's unique 
values, and many of  L3_1's random values like time stamp, plus the presence of 
map-typed columns. The conclusion from this observation is that compression of 
temporary intermediate data is not guaranteed to save disk space to a desired 
degree. It's subject to temporary data values being compressed upon. As result, 
this feature should be made configurable;
4)  The performance implications from these tests seem to be negligible within 
background noise or within a few percentages of the overall run times. But this 
is not conclusive yet. Larger and more real life queries would be more suitable 
for the comparison purpose ;
5) RCFile as above has not shown clear advantage in terms of better columnar 
compression ratio. Bu this observation could be data-sensitive.

> need to investigate the impact of compression on pig performance
> 
>
>         Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1453) [zebra] Intermittent failure for TestOrderPreserveUnionHDFS

2010-07-23 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1453:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Committed to the trunk.

> [zebra] Intermittent failure for TestOrderPreserveUnionHDFS
> ---
>
> Key: PIG-1453
> URL: https://issues.apache.org/jira/browse/PIG-1453
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1453.patch, PIG-1453.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1 2 3 4 5 6 >

1 - 100 of 512 matches

Mail list logo