[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1
[ https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1658: -- Status: Resolved (was: Patch Available) Resolution: Fixed Committed to both trunk and the 0.8 branch. > ORDER BY does not work properly on integer/short keys that are -1 > - > > Key: PIG-1658 > URL: https://issues.apache.org/jira/browse/PIG-1658 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1658.patch, PIG-1658.patch > > > In fact, all these types of keys of values that are negative but within the > byte or short's range would have the problem. > Basic cally, a byte value of -1 & 0xff will return 255 not -1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY
[ https://issues.apache.org/jira/browse/PIG-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917012#action_12917012 ] Yan Zhou commented on PIG-1659: --- Need to make sure it is invoked after optimization in both old and new logical plans. > sortinfo is not set for store if there is a filter after ORDER BY > - > > Key: PIG-1659 > URL: https://issues.apache.org/jira/browse/PIG-1659 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Daniel Dai > Fix For: 0.8.0 > > > This has caused 6 (of 7) failures in the Zebra test > TestOrderPreserveVariableTable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1
[ https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1658: -- Attachment: PIG-1658.patch Add Zebra test TestMergeJoinPartial to the "pigtest" target. > ORDER BY does not work properly on integer/short keys that are -1 > - > > Key: PIG-1658 > URL: https://issues.apache.org/jira/browse/PIG-1658 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1658.patch, PIG-1658.patch > > > In fact, all these types of keys of values that are negative but within the > byte or short's range would have the problem. > Basic cally, a byte value of -1 & 0xff will return 255 not -1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1
[ https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1658: -- Attachment: PIG-1658.patch This problem is caused by the PIG-1295 patch. test-core pass. Zebra's nightly pass too. test-patch output: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Zebra's TestMergeJoinPartial is used to verify the fix. > ORDER BY does not work properly on integer/short keys that are -1 > - > > Key: PIG-1658 > URL: https://issues.apache.org/jira/browse/PIG-1658 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1658.patch > > > In fact, all these types of keys of values that are negative but within the > byte or short's range would have the problem. > Basic cally, a byte value of -1 & 0xff will return 255 not -1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1
[ https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1658: -- Status: Patch Available (was: Open) > ORDER BY does not work properly on integer/short keys that are -1 > - > > Key: PIG-1658 > URL: https://issues.apache.org/jira/browse/PIG-1658 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1658.patch > > > In fact, all these types of keys of values that are negative but within the > byte or short's range would have the problem. > Basic cally, a byte value of -1 & 0xff will return 255 not -1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY
sortinfo is not set for store if there is a filter after ORDER BY - Key: PIG-1659 URL: https://issues.apache.org/jira/browse/PIG-1659 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Daniel Dai Fix For: 0.8.0 This has caused 6 (of 7) failures in the Zebra test TestOrderPreserveVariableTable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1
[ https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou reassigned PIG-1658: - Assignee: Yan Zhou > ORDER BY does not work properly on integer/short keys that are -1 > - > > Key: PIG-1658 > URL: https://issues.apache.org/jira/browse/PIG-1658 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > > In fact, all these types of keys of values that are negative but within the > byte or short's range would have the problem. > Basic cally, a byte value of -1 & 0xff will return 255 not -1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1
[ https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1658: -- Fix Version/s: 0.8.0 Affects Version/s: 0.8.0 > ORDER BY does not work properly on integer/short keys that are -1 > - > > Key: PIG-1658 > URL: https://issues.apache.org/jira/browse/PIG-1658 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > > In fact, all these types of keys of values that are negative but within the > byte or short's range would have the problem. > Basic cally, a byte value of -1 & 0xff will return 255 not -1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1
ORDER BY does not work properly on integer/short keys that are -1 - Key: PIG-1658 URL: https://issues.apache.org/jira/browse/PIG-1658 Project: Pig Issue Type: Bug Reporter: Yan Zhou In fact, all these types of keys of values that are negative but within the byte or short's range would have the problem. Basic cally, a byte value of -1 & 0xff will return 255 not -1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework
[ https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1648: -- Status: Patch Available (was: Open) > Split combination may return too many block locations to map/reduce framework > - > > Key: PIG-1648 > URL: https://issues.apache.org/jira/browse/PIG-1648 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1648.patch > > > For instance, if a small split has block locations h1, h2 and h3; another > small split has h1, h3, h4. After combination, the composite split contains 4 > block locations. If the number of component splits is big, then the number of > block locations could be big too. In fact, the number of block locations > serves as a hint to M/R as the best hosts this composite split should be run > on so the list should contain a short list, say 5, of the hosts that contain > the most data in this composite split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework
[ https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1648: -- Status: Resolved (was: Patch Available) Resolution: Fixed Patch committed to both trunk and the 0.8 branch. > Split combination may return too many block locations to map/reduce framework > - > > Key: PIG-1648 > URL: https://issues.apache.org/jira/browse/PIG-1648 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1648.patch > > > For instance, if a small split has block locations h1, h2 and h3; another > small split has h1, h3, h4. After combination, the composite split contains 4 > block locations. If the number of component splits is big, then the number of > block locations could be big too. In fact, the number of block locations > serves as a hint to M/R as the best hosts this composite split should be run > on so the list should contain a short list, say 5, of the hosts that contain > the most data in this composite split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework
[ https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915852#action_12915852 ] Yan Zhou commented on PIG-1648: --- test-patch results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. test-core tests pass too. > Split combination may return too many block locations to map/reduce framework > - > > Key: PIG-1648 > URL: https://issues.apache.org/jira/browse/PIG-1648 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1648.patch > > > For instance, if a small split has block locations h1, h2 and h3; another > small split has h1, h3, h4. After combination, the composite split contains 4 > block locations. If the number of component splits is big, then the number of > block locations could be big too. In fact, the number of block locations > serves as a hint to M/R as the best hosts this composite split should be run > on so the list should contain a short list, say 5, of the hosts that contain > the most data in this composite split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework
[ https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1648: -- Attachment: PIG-1648.patch > Split combination may return too many block locations to map/reduce framework > - > > Key: PIG-1648 > URL: https://issues.apache.org/jira/browse/PIG-1648 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1648.patch > > > For instance, if a small split has block locations h1, h2 and h3; another > small split has h1, h3, h4. After combination, the composite split contains 4 > block locations. If the number of component splits is big, then the number of > block locations could be big too. In fact, the number of block locations > serves as a hint to M/R as the best hosts this composite split should be run > on so the list should contain a short list, say 5, of the hosts that contain > the most data in this composite split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework
[ https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915815#action_12915815 ] Yan Zhou commented on PIG-1648: --- Top 5 locations with most data will be used. This has been agreed upon by the M/R dev. > Split combination may return too many block locations to map/reduce framework > - > > Key: PIG-1648 > URL: https://issues.apache.org/jira/browse/PIG-1648 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > > For instance, if a small split has block locations h1, h2 and h3; another > small split has h1, h3, h4. After combination, the composite split contains 4 > block locations. If the number of component splits is big, then the number of > block locations could be big too. In fact, the number of block locations > serves as a hint to M/R as the best hosts this composite split should be run > on so the list should contain a short list, say 5, of the hosts that contain > the most data in this composite split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1651) PIG class loading mishandled
PIG class loading mishandled Key: PIG-1651 URL: https://issues.apache.org/jira/browse/PIG-1651 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Richard Ding Fix For: 0.8.0 If just having zebra.jar as being registered in a PIG script but not in the CLASSPATH, the query using zebra fails since there appear to be multiple classes loaded into JVM, causing static variable set previously not seen after one instance of the class is created through reflection. (After the zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is as follows: ackend error message during job submission --- org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://hostname/pathto/zebra_dir :: null at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException at org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123) at org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718) at org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866) at org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269) ... 7 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1647) Logical simplifier throws a NPE
[ https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1647: -- Status: Resolved (was: Patch Available) Resolution: Fixed Patch committed to both trunk and the 0.8 branch. > Logical simplifier throws a NPE > --- > > Key: PIG-1647 > URL: https://issues.apache.org/jira/browse/PIG-1647 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1647.patch, PIG-1647.patch > > > A query like: > A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray); > B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' > and ((d is not null and d != '') or (e is not null and e != '')); > will cause the logical expression simplifier to throw a NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1647) Logical simplifier throws a NPE
[ https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1647: -- Status: Patch Available (was: Open) > Logical simplifier throws a NPE > --- > > Key: PIG-1647 > URL: https://issues.apache.org/jira/browse/PIG-1647 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1647.patch, PIG-1647.patch > > > A query like: > A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray); > B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' > and ((d is not null and d != '') or (e is not null and e != '')); > will cause the logical expression simplifier to throw a NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1647) Logical simplifier throws a NPE
[ https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1647: -- Attachment: PIG-1647.patch passes test-core. test-patch results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. > Logical simplifier throws a NPE > --- > > Key: PIG-1647 > URL: https://issues.apache.org/jira/browse/PIG-1647 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1647.patch, PIG-1647.patch > > > A query like: > A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray); > B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' > and ((d is not null and d != '') or (e is not null and e != '')); > will cause the logical expression simplifier to throw a NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash
[ https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1645: -- Status: Resolved (was: Patch Available) Resolution: Fixed Patch committed to both trunk and the 0.8 branch. > Using both small split combination and temporary file compression on a query > of ORDER BY may cause crash > > > Key: PIG-1645 > URL: https://issues.apache.org/jira/browse/PIG-1645 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1645.patch > > > The stack looks like the following: > java.lang.NullPointerException at > java.util.Arrays.binarySearch(Arrays.java:2043) at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at > org.apache.hadoop.mapred.Child$4.run(Child.java:217) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at > org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1647) Logical simplifier throws a NPE
[ https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1647: -- Attachment: PIG-1647.patch > Logical simplifier throws a NPE > --- > > Key: PIG-1647 > URL: https://issues.apache.org/jira/browse/PIG-1647 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 > Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1647.patch > > > A query like: > A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray); > B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' > and ((d is not null and d != '') or (e is not null and e != '')); > will cause the logical expression simplifier to throw a NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1635: -- Status: Resolved (was: Patch Available) Resolution: Fixed Patch committed to both trunk and the 0.8 branch. > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1648) Split combination may return too many block locations to map/reduce framework
Split combination may return too many block locations to map/reduce framework - Key: PIG-1648 URL: https://issues.apache.org/jira/browse/PIG-1648 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.8.0 For instance, if a small split has block locations h1, h2 and h3; another small split has h1, h3, h4. After combination, the composite split contains 4 block locations. If the number of component splits is big, then the number of block locations could be big too. In fact, the number of block locations serves as a hint to M/R as the best hosts this composite split should be run on so the list should contain a short list, say 5, of the hosts that contain the most data in this composite split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1647) Logical simplifier throws a NPE
Logical simplifier throws a NPE --- Key: PIG-1647 URL: https://issues.apache.org/jira/browse/PIG-1647 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.8.0 A query like: A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray); B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' and ((d is not null and d != '') or (e is not null and e != '')); will cause the logical expression simplifier to throw a NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914672#action_12914672 ] Yan Zhou commented on PIG-1635: --- I did a thorough check for this patch. Actually some of the ordering changes were caused by the mentioned misuse. Thanks. > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash
[ https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914541#action_12914541 ] Yan Zhou commented on PIG-1645: --- The possibility of failure also depends upon the block distribution since the split combination makes use of that info. > Using both small split combination and temporary file compression on a query > of ORDER BY may cause crash > > > Key: PIG-1645 > URL: https://issues.apache.org/jira/browse/PIG-1645 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1645.patch > > > The stack looks like the following: > java.lang.NullPointerException at > java.util.Arrays.binarySearch(Arrays.java:2043) at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at > org.apache.hadoop.mapred.Child$4.run(Child.java:217) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at > org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash
[ https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1645: -- Status: Patch Available (was: Open) > Using both small split combination and temporary file compression on a query > of ORDER BY may cause crash > > > Key: PIG-1645 > URL: https://issues.apache.org/jira/browse/PIG-1645 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1645.patch > > > The stack looks like the following: > java.lang.NullPointerException at > java.util.Arrays.binarySearch(Arrays.java:2043) at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at > org.apache.hadoop.mapred.Child$4.run(Child.java:217) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at > org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash
[ https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1645: -- Attachment: PIG-1645.patch test-core passed. test-patch results: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 459 release audit warnings (more than the trunk's current 457 warnings). The scenario is trully a corner case. The following query *might* have caused the problem: A = load '/tmp/test/jsTst2.txt' as (fn, age:int); B = load '/tmp/test/sample.txt' as (fn, age:int); C = join A by fn, B by fn USING 'replicated'; D = ORDER C BY B::age; dump D; where sample.txt has only one row that contains one record that has the same join key as a single record in jsTst2.txt which should have size of several HDFS blocks. Even so, it is random to see a failure, as it depends upon whether any of the logically empty files is placed in the first underlying split of the list of splits combined. Compute nodes' host names seem to play a role too. Running in local mode seems to see no failure. The 2 release audit warnings are due to jdiff. No new file added. > Using both small split combination and temporary file compression on a query > of ORDER BY may cause crash > > > Key: PIG-1645 > URL: https://issues.apache.org/jira/browse/PIG-1645 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1645.patch > > > The stack looks like the following: > java.lang.NullPointerException at > java.util.Arrays.binarySearch(Arrays.java:2043) at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at > org.apache.hadoop.mapred.Child$4.run(Child.java:217) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at > org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914150#action_12914150 ] Yan Zhou commented on PIG-1635: --- All test-core tests also run clean. > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914145#action_12914145 ] Yan Zhou commented on PIG-1635: --- test-patch results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash
[ https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914128#action_12914128 ] Yan Zhou commented on PIG-1645: --- The problem is that both RandomSampleLoader and PossionSampleLoader have internal states from the previous invocations that should be reset when a different underlying split is worked on under the same umbrella split when the split combination (PIG-1518) is on. When temporary file compression is disabled, Pig internal storage will create empty files which will be discarded by split combiner, making the only non-empty split as the only split to be worked on, so it is ok in this case. > Using both small split combination and temporary file compression on a query > of ORDER BY may cause crash > > > Key: PIG-1645 > URL: https://issues.apache.org/jira/browse/PIG-1645 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou > Fix For: 0.8.0 > > > The stack looks like the following: > java.lang.NullPointerException at > java.util.Arrays.binarySearch(Arrays.java:2043) at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at > org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at > org.apache.hadoop.mapred.Child$4.run(Child.java:217) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) > at > org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Release Note: Feature: combine splits of sizes smaller than the value of property "pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is not set, the file system default block size of the load's location. This feature can be turned off through setting the property "pig.splitCombination" to "false". When such a combination is performed, a log message like "Total input paths (combined) to process : 7" will be logged. This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more "under-fed" mappers to be launched and potentially slowdown of the execution. This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object. This change also requires the loader to be stateless across the invocations to the prepareToRead method. That is, the method should reset any internal states that are not affected by the RecordReader argument. Otherwise, this feature should be disabled. In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations. was: Feature: combine splits of sizes smaller than the value of property "pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is not set, the file system default block size of the load's location. This feature can be turned off through setting the property "pig.noSplitCombination" to true. When such a combination is performed, a log message like "Total input paths (combined) to process : 7" will be logged. This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more "under-fed" mappers to be launched and potentially slowdown of the execution. This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object. In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash
Using both small split combination and temporary file compression on a query of ORDER BY may cause crash Key: PIG-1645 URL: https://issues.apache.org/jira/browse/PIG-1645 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.8.0 The stack looks like the following: java.lang.NullPointerException at java.util.Arrays.binarySearch(Arrays.java:2043) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062) at org.apache.hadoop.mapred.Child.main(Child.java:211) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1635: -- Status: Patch Available (was: Open) > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1635: -- Attachment: PIG-1635.patch > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > Fix For: 0.8.0 > > Attachments: PIG-1635.patch > > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913036#action_12913036 ] Yan Zhou commented on PIG-1635: --- This is regarding a new feature (PIG-1399) added for 0.8. > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
[ https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1635: -- Affects Version/s: 0.8.0 > Logical simplifier does not simplify away constants under AND and OR; after > simplificaion the ordering of operands of AND and OR may get changed > > > Key: PIG-1635 > URL: https://issues.apache.org/jira/browse/PIG-1635 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Yan Zhou >Assignee: Yan Zhou >Priority: Minor > > b = FILTER a by (( f1 > 1) AND (1 == 1)) > or > b = FILTER a by ((f1 > 1) OR ( 1==0)) > should be simplified to > b = FILTER a by f1 > 1; > Regarding ordering change, an example is that > b = filter a by ((f1 is not null) AND (f2 is not null)); > Even without possible simplification, the expression is changed to > b = filter a by ((f2 is not null) AND (f1 is not null)); > Even though the ordering change in this case, and probably in most other > cases, does not create any difference, but for two reasons some users might > care about the ordering: if stateful UDFs are used as operands of AND or OR; > and if the ordering is intended by the application designer to maximize the > chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed
Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed Key: PIG-1635 URL: https://issues.apache.org/jira/browse/PIG-1635 Project: Pig Issue Type: Bug Components: impl Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor b = FILTER a by (( f1 > 1) AND (1 == 1)) or b = FILTER a by ((f1 > 1) OR ( 1==0)) should be simplified to b = FILTER a by f1 > 1; Regarding ordering change, an example is that b = filter a by ((f1 is not null) AND (f2 is not null)); Even without possible simplification, the expression is changed to b = filter a by ((f2 is not null) AND (f1 is not null)); Even though the ordering change in this case, and probably in most other cases, does not create any difference, but for two reasons some users might care about the ordering: if stateful UDFs are used as operands of AND or OR; and if the ordering is intended by the application designer to maximize the chances to shortcut the composite boolean evaluation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1628) log this message at debug level : 'Pig Internal storage in use'
[ https://issues.apache.org/jira/browse/PIG-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913029#action_12913029 ] Yan Zhou commented on PIG-1628: --- +1. Patch looks good. > log this message at debug level : 'Pig Internal storage in use' > --- > > Key: PIG-1628 > URL: https://issues.apache.org/jira/browse/PIG-1628 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1628.1.patch > > > The temporary storage functions used are logging at the INFO level. This > should change to debug level, they are reducing the visibility of more useful > INFO messages. The messages include 'Pig Internal storage in use' from > InterStorage and 'TFile storage in use' from TFileStorage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor
[ https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909330#action_12909330 ] Yan Zhou commented on PIG-366: -- Robert, Could you put down a step-by-step instruction on how to use this jar as an eclipse plug-in? Thanks. > PigPen - Eclipse plugin for a graphical PigLatin editor > --- > > Key: PIG-366 > URL: https://issues.apache.org/jira/browse/PIG-366 > Project: Pig > Issue Type: New Feature >Reporter: Shubham Chopra >Assignee: Robert Gibbon >Priority: Minor > Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, > org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, > org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, > org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz > > > This is an Eclipse plugin that provides a GUI that can help users create > PigLatin scripts and see the example generator outputs on the fly and submit > the jobs to hadoop clusters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-239) illustrate followed by dump gives a runtime exception
[ https://issues.apache.org/jira/browse/PIG-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou resolved PIG-239. -- Fix Version/s: 0.8.0 (was: 0.9.0) Resolution: Cannot Reproduce Can not reproduce using 0.8. > illustrate followed by dump gives a runtime exception > - > > Key: PIG-239 > URL: https://issues.apache.org/jira/browse/PIG-239 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Pradeep Kamath >Assignee: Yan Zhou > Fix For: 0.8.0 > > > Here is a session which outlines the issue: > grunt> a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, > age,gpa); > grunt> b = filter a by name lt 'b'; > grunt> c = foreach b generate TOKENIZE(name); > grunt> illustrate c; > - > | a | name | age | gpa | > - > | | tom xylophone | 69| 0.04 | > | | alice ovid| 75| 3.89 | > - > -- > | b | name | age | gpa | > -- > | | alice ovid | 75| 3.89 | > -- > - > | c | (token ) | > - > | | {(alice), (ovid)} | > - > grunt> dump c; > 2008-05-15 14:35:54,476 [main] ERROR org.apache.pig.tools.grunt.GruntParser - > java.lang.RuntimeException: java.io.IOException: Serialization error: > org.apache.pig.impl.util. > LineageTracer > at > org.apache.pig.backend.hadoop.executionengine.POMapreduce.copy(POMapreduce.java:242) > at > org.apache.pig.backend.hadoop.executionengine.MapreducePlanCompiler.compile(MapreducePlanCompiler.java:115) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:232) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:209) > at org.apache.pig.PigServer.optimizeAndRunQuery(PigServer.java:410) > at org.apache.pig.PigServer.openIterator(PigServer.java:332) > at > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:265) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseContOnError(GruntParser.java:73) > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:54) > at org.apache.pig.Main.main(Main.java:270) > Caused by: java.io.IOException: Serialization error: > org.apache.pig.impl.util.LineageTracer > at > org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:16) > at > org.apache.pig.impl.util.ObjectSerializer.serialize(ObjectSerializer.java:44) > at > org.apache.pig.backend.hadoop.executionengine.POMapreduce.copy(POMapreduce.java:233) > ... 10 more > Caused by: java.io.NotSerializableException: > org.apache.pig.impl.util.LineageTracer > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1081) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1375) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1347) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1290) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1079) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:302) > at java.util.ArrayList.writeObject(ArrayList.java:569) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:917) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1339) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1290) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1079) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java
[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor
[ https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908971#action_12908971 ] Yan Zhou commented on PIG-366: -- One more clearification: by design example generator does not submit any jobs to hadoop, it just runs at the client as a local application. > PigPen - Eclipse plugin for a graphical PigLatin editor > --- > > Key: PIG-366 > URL: https://issues.apache.org/jira/browse/PIG-366 > Project: Pig > Issue Type: New Feature >Reporter: Shubham Chopra >Assignee: Robert Gibbon >Priority: Minor > Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, > org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, > org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, > org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz > > > This is an Eclipse plugin that provides a GUI that can help users create > PigLatin scripts and see the example generator outputs on the fly and submit > the jobs to hadoop clusters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor
[ https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908962#action_12908962 ] Yan Zhou commented on PIG-366: -- Yes. But the original patch by Shubham had hooked the plugin to the example generator interface unless you will have found something funky in that patch. I have no intention to change the interface. > PigPen - Eclipse plugin for a graphical PigLatin editor > --- > > Key: PIG-366 > URL: https://issues.apache.org/jira/browse/PIG-366 > Project: Pig > Issue Type: New Feature >Reporter: Shubham Chopra >Assignee: Robert Gibbon >Priority: Minor > Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, > org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, > org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, > org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz > > > This is an Eclipse plugin that provides a GUI that can help users create > PigLatin scripts and see the example generator outputs on the fly and submit > the jobs to hadoop clusters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor
[ https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908926#action_12908926 ] Yan Zhou commented on PIG-366: -- Robert, first, thanks for your effort to pick up this feature. You mentioned in your 09/08 Comment that you "stripped back" a lot of functionality and focused on the script editor. I'm wondering if it is possible to add your fixes/improvements on top of Shubham's patch. Specifically, I'm interested in the example generator use in PigPen, which seems to absent from your patches. FYI, I'm currently working on improving and enhancing the example generator left over by Shubham about 2 years ago. > PigPen - Eclipse plugin for a graphical PigLatin editor > --- > > Key: PIG-366 > URL: https://issues.apache.org/jira/browse/PIG-366 > Project: Pig > Issue Type: New Feature >Reporter: Shubham Chopra >Assignee: Robert Gibbon >Priority: Minor > Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, > org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, > org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, > org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz > > > This is an Eclipse plugin that provides a GUI that can help users create > PigLatin scripts and see the example generator outputs on the fly and submit > the jobs to hadoop clusters. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904868#action_12904868 ] Yan Zhou commented on PIG-1501: --- To be more eaccurate, the default compression would be gzip if the compression was made on by default. Currently, the compression has to be specified and takes no default value. This is to ask user to take full appreciation of pros and cons of either compression method. > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt, > PIG-1501.patch, PIG-1501.patch, PIG-1501.patch > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Release Note: This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. Two java properties are used to control the behavoir: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig was: This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. Two java properties are used to control the behavoir: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig [ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Release Note: This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. Two java properties are used to control the behavoir: pig.tmpfilecompression, default to false, tells if the temporary files should be compressed or not. If true, then pig.tmpfilecompression.codec specifies which compression codec to use. Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is under GPL license, Hadoop may need to be configured to use LZO codec. Please refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig [ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt, > PIG-1501.patch, PIG-1501.patch, PIG-1501.patch > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1399: -- Status: Patch Available (was: Open) Release Note: This logical simplification contains the following types of simplifications: 1) Constant pre-calculation Example: B = filter A by a0 > 5+7; is simplified to B = filter A by a0 > 12; 2) Elimination of negations Example: B = filter A by not (not(a0>5) or a>10); is simplified to B = filter A by a0>5 and a<=10; 3) Elimination of logical implied expression in AND Example: B = filter A by (a0 > 5 and a0 > 7); is simplified to B = filter A by a0 > 7; 4) Elimination of logical implied expression in OR Example: B = filter A by ((a0 > 5) or (a0 > 6 and a1 > 15); is simplified to B = filter C by a0 > 5; 5) Equivalence elimination Example: B = filter A by (a0 > 5 and a0 > 5); is simplified to B = filter A by a0 > 5; 6) Elimination of complementary expressions in OR Example: B = filter A by (a0 > 5 OR a0 <= 5); is simplified to non-filtering 7) Elimination of naive TRUE expression Example: B = filter A by 1==1; is simplified to non-filtering > Logical Optimizer: Expression optimizor rule > > > Key: PIG-1399 > URL: https://issues.apache.org/jira/browse/PIG-1399 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: newPatchFindbugsWarnings.html, PIG-1399.patch, > PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, > PIG-1399.patch, PIG-1399.patch > > > We can optimize expression in several ways: > 1. Constant pre-calculation > Example: > B = filter A by a0 > 5+7; > => B = filter A by a0 > 12; > 2. Boolean expression optimization > Example: > B = filter A by not (not(a0>5) or a>10); > => B = filter A by a0>5 and a<=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1399: -- Attachment: PIG-1399.patch I use findbugs 1.3.9 and it finds the patch clean. The attached findbugs results were generated using 1.3.8, it might be the difference. Anyways, I make a minor modification that should fix the warnings by 1.3.8. > Logical Optimizer: Expression optimizor rule > > > Key: PIG-1399 > URL: https://issues.apache.org/jira/browse/PIG-1399 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: newPatchFindbugsWarnings.html, PIG-1399.patch, > PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, > PIG-1399.patch, PIG-1399.patch > > > We can optimize expression in several ways: > 1. Constant pre-calculation > Example: > B = filter A by a0 > 5+7; > => B = filter A by a0 > 12; > 2. Boolean expression optimization > Example: > B = filter A by not (not(a0>5) or a>10); > => B = filter A by a0>5 and a<=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1399: -- Attachment: PIG-1399.patch rebased on the latest trunk. > Logical Optimizer: Expression optimizor rule > > > Key: PIG-1399 > URL: https://issues.apache.org/jira/browse/PIG-1399 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, > PIG-1399.patch, PIG-1399.patch, PIG-1399.patch > > > We can optimize expression in several ways: > 1. Constant pre-calculation > Example: > B = filter A by a0 > 5+7; > => B = filter A by a0 > 12; > 2. Boolean expression optimization > Example: > B = filter A by not (not(a0>5) or a>10); > => B = filter A by a0>5 and a<=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1399: -- Attachment: PIG-1399.patch Addressing the review comments except for not making several optimization rules since the ordering of the application of the rules is significant. > Logical Optimizer: Expression optimizor rule > > > Key: PIG-1399 > URL: https://issues.apache.org/jira/browse/PIG-1399 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, > PIG-1399.patch, PIG-1399.patch > > > We can optimize expression in several ways: > 1. Constant pre-calculation > Example: > B = filter A by a0 > 5+7; > => B = filter A by a0 > 12; > 2. Boolean expression optimization > Example: > B = filter A by not (not(a0>5) or a>10); > => B = filter A by a0>5 and a<=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903528#action_12903528 ] Yan Zhou commented on PIG-1518: --- All other functionalities except for the two mentioned in the previous comment will see splits combined by default, if necessary. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903525#action_12903525 ] Yan Zhou commented on PIG-1518: --- In summary, the following functionalities won't see splits combined on loads: 1) map-side cogroup; 2) merge join; > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903423#action_12903423 ] Yan Zhou commented on PIG-1518: --- MergeJoinIndexer and IndexableLoadFunc are both not combinable. Regarding orderedLoadFunc, the story is a bit more complex. First of all, it's only non-overriden method, getSplitComparable, is only used in MergeJoinIndexer which is already not combinable. The big issue is FileInputLoadFunc which is extended by BinStorage, PigStorage and InterStorage. Semantically, I agree OrderedLoadFunc should not be combinable. However, FileInputFormat's implementation of OrderedLoadFunc makes little sense in that its ordering is based on the (path, offset) pair. This is an ordering but just an arbitrary ordering. Mathematically one can establish any arbitrary ordering over a discrete set of data. But the point is how is the ordering used. For our purpose, the ordering should be related to some keys used in data manipulation for which (path, offset) does not serve the purpose. Or implicitly a FileInputLoadFunc still requires the storage gives out splits in some key ordering. If that storage ordering does not actually exist, FileInputLoadFunc as an OrderedLoadFunc will have no use of its "sortness" because the ordering is just, well, arbitray. The three extensions of FileInputLoadFunc work on generic data storage. Unless they work on sorted data in general, they should not be an OrderedLoadFunc. The other use of OrderedLoadFunc, not its non-overriden method, getSplitComparable, is by map-side cogroup. But it does not check if the sort key is the join key which is critical for correctness. It also requires to be a CollectableLoadFunc to work properly. Since we do not want to break backward compatibility, and the only use of OrderLoadFunc in Pig, except for MergeJinIndexer which is already excluded from combining, is in map side cogroup with CollectableLoadFunc, I mark "CollectableLoadFunc AND an OrderedLoadFunc" as non-combinable. In the future, we should really clean up the the OrderedLoadFunc from FileInputLoadFunc and let the getSplitComparable method provide key-related info and not the (path, offset) pair. Backward compatibility may need to be addressed too. Only then will the water become clearer and I be ok to adjust the noncombinable setting accordingly. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903102#action_12903102 ] Yan Zhou commented on PIG-1518: --- It is not combinable if the loader is a CollectableLoadFunc AND a OrderedLoadFunc. Since PigStorage is a CollectableLoadFunc but not a OrderedLoadFunc, it is combinable. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch rebased on the latest trunk > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Status: Patch Available (was: Open) This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. An example is the following "test.pig" script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt, > PIG-1501.patch, PIG-1501.patch, PIG-1501.patch > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Status: Open (was: Patch Available) > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch Improvement on logging info. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1399: -- Attachment: PIG-1399.patch rebasing on the latest trunk > Logical Optimizer: Expression optimizor rule > > > Key: PIG-1399 > URL: https://issues.apache.org/jira/browse/PIG-1399 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, > PIG-1399.patch > > > We can optimize expression in several ways: > 1. Constant pre-calculation > Example: > B = filter A by a0 > 5+7; > => B = filter A by a0 > 12; > 2. Boolean expression optimization > Example: > B = filter A by not (not(a0>5) or a>10); > => B = filter A by a0>5 and a<=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
Thank for quick turnaround Tejas. Yan -Original Message- From: Thejas M Nair (JIRA) [mailto:j...@apache.org] Sent: Wednesday, August 25, 2010 8:54 AM To: pig-dev@hadoop.apache.org Subject: [jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance [ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902484#action_12902484 ] Thejas M Nair commented on PIG-1501: +1 > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt, > PIG-1501.patch, PIG-1501.patch, PIG-1501.patch > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Attachment: PIG-1501.patch Address the review comments, code rebasing on the latest trunk. > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt, > PIG-1501.patch, PIG-1501.patch, PIG-1501.patch > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch Minor polish of a debugging code inside comments > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Status: Patch Available (was: Open) > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Status: Open (was: Patch Available) > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Status: Patch Available (was: Open) Release Note: Feature: combine splits of sizes smaller than the value of property "pig.maxCombinedSplitSize" or, if the property of "pig.maxCombinedSplitSize" is not set, the file system default block size of the load's location. This feature can be turned off through setting the property "pig.noSplitCombination" to true. When such a combination is performed, a log message like "Total input paths (combined) to process : 7" will be logged. This feature will be applicable if a user input, or an intermediate input, has many small files to be loaded that would otherwise cause many more "under-fed" mappers to be launched and potentially slowdown of the execution. This change will not cause any backward compatibility issue except if a loader implementation makes use of the PigSplit object passed through the prepareToRead method where a rebuild of the loader might be necessary as PigSplit's definition has been modified. However, currently we know of no external use of the object. In addition, if a loader implements IndexableLoadFunc, or implements OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to possible combinations. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch Fix a typo; rebase on the latest trunk. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch The add method if PigSplit is removed. The debug code is left to facilitate future debugging work. The use of initNextRecordReader is pretty cloned from org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader and I'll leave it as is too. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1399: -- Attachment: PIG-1399.patch Internal Hudson results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. All core tests also pass. > Logical Optimizer: Expression optimizor rule > > > Key: PIG-1399 > URL: https://issues.apache.org/jira/browse/PIG-1399 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch > > > We can optimize expression in several ways: > 1. Constant pre-calculation > Example: > B = filter A by a0 > 5+7; > => B = filter A by a0 > 12; > 2. Boolean expression optimization > Example: > B = filter A by not (not(a0>5) or a>10); > => B = filter A by a0>5 and a<=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900950#action_12900950 ] Yan Zhou commented on PIG-1501: --- The internal Hudson results are as follows: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 9 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 162 javac compiler warnings (more than the trunk's current 156 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 427 release audit warnings (more than the trunk's current 425 warnings). The 6 javac warnings are from the use of a deprecated PigMapReduce.sJobConf field. But that deprecation is for intended for external use only and internal use should be ok. The 2 release audit warnings are on two html files, SampleOptimizer.html and org.apache.pig.impl.util.Utils.html. > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt, > PIG-1501.patch, PIG-1501.patch > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Attachment: PIG-1501.patch the compression codec is configurable on gzip or lzo; plus some minor changes > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt, > PIG-1501.patch, PIG-1501.patch > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule
[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1399: -- Attachment: PIG-1399.patch rebased on the latest trunk. > Logical Optimizer: Expression optimizor rule > > > Key: PIG-1399 > URL: https://issues.apache.org/jira/browse/PIG-1399 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1399.patch, PIG-1399.patch > > > We can optimize expression in several ways: > 1. Constant pre-calculation > Example: > B = filter A by a0 > 5+7; > => B = filter A by a0 > 12; > 2. Boolean expression optimization > Example: > B = filter A by not (not(a0>5) or a>10); > => B = filter A by a0>5 and a<=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch Style changes, Hudson pass, plus other minor changes. Internal Hudson results: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 427 release audit warnings (more than the trunk's current 425 warnings). The release audit warnings are on two html files: PigInputFormat.html and PiRecordReader.html > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900123#action_12900123 ] Yan Zhou commented on PIG-1518: --- No. It does not work inside an optimizer as logical/physical plans are not changed as the other optimizers do. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899888#action_12899888 ] Yan Zhou commented on PIG-1518: --- In summary, the split combination's controllables are through the following jvm properties: pig.maxCombinedSplitSize: by default, it is the load filesystem's default block size. This specifies the maximum combined split size in unit of bytes; pig.splitCombination: takes values of "false" and "true". The default is "true". "false" will disable the split combination. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899609#action_12899609 ] Yan Zhou commented on PIG-1518: --- The formatting of the table of the last comment is a bit off: both headers should be be right-shifted by one column. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899605#action_12899605 ] Yan Zhou commented on PIG-1518: --- One experimental result on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM boxes is as follows: Query: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (double)estimated_revenue; B1 = distinct B; alpha = load '/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); beta = foreach alpha generate name; C = join beta by name, B1 by user parallel 300; D = group C by $0 parallel 40; E = foreach D generate group, SUM(C.estimated_revenue); store E into 'spliCombo2.out'; It creates 3 map/reduce jobs. No Split Combination: ||Mappers|Reducers| |number|120|300| |elapsed time|24s|2m43s| |number|301|300| |elapsed time|46s|3m11s| |number|300|40| |elapsed time|38s|53s| |Total elapsed time|7m36s| With Split Combination: ||mappers|Reducers| |number|120|300| |elapsed time|22s|2m49s| |number|3|300| |elapsed time|27s|2m46s| |number|1|40| |elapsed time|17s|24s| |Total elapsed time|7m5s| > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899445#action_12899445 ] Yan Zhou commented on PIG-1518: --- Another approach is to mark splits as uncombinable only when necessary. Specifically, MergeJoinIndexer and the base load in mapside cogroup need to be excluded from the split combination. Breaking backward compatinility is probably too much a risk to take. In the meanwhile, OrderedLoadFunc has a notion of "being evolving" that will leave some headroom for future semantic polishes. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898490#action_12898490 ] Yan Zhou commented on PIG-1518: --- There is a bigger question at hand. The semantics of OrderedLoadFunc is that the splits are totally ordered. And BinStorage, InterStorage and PigStorage all implement that interface through FileInputLoadFunc. Since the combination of splits as conceived here will definitely destroy the split ordering, if the combination is disabled for these storages, the feature would be virtually useless for a majority of use cases. On the other hand, I'm seeing no use of the comparison capability except for MergeJoinIndexer's getNext() method, which makes me wonder if the OrderedLoadFunc can be removed from the FileInputLoadFunc. Semantically, FileInputLoadFunc should not support the ordering of splits, as Hadoop's FileInputFormat doesn't. When a need arises like in MergeJoinIndexer, we can add that extension on. But the change may incur some backward compatibility issues. I'm now soliciting comments in this area. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897887#action_12897887 ] Yan Zhou commented on PIG-1518: --- During the merge process, any empty splits will be skipped. Currently empty splits will be generated on empty files, which is not necessary at the first place. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897496#action_12897496 ] Yan Zhou commented on PIG-1501: --- Please refer to HADOOP-3315 for overall Sequence File vs TFile comparison. It appears for compressed data, TFile performs better than SeqFile. > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt, > PIG-1501.patch > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897493#action_12897493 ] Yan Zhou commented on PIG-1518: --- Right, map side cogroup needs the sortness of the input, but just the "side inputs" need the feature to be able to seek on a key; the "base input" will only need presence of all duplicate keys in a mapper. I'll mark the "side inputs" as non-combinable. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897085#action_12897085 ] Yan Zhou commented on PIG-1518: --- The pseudo code of the combination op is as follows: for each node of the nodes (sorted in the order of ascending sizes) { while the node's split list (sorted in the order of descending sizes) is not empty { find the biggest splits that can be combined with the first split of the list of the splits; if the accumulated split size is >= half of the limit { generate a combined split; remove the accumulated splits from the node's split list; clear the accumulated split list; } else { break; } } } // leftover combination for each node of the nodes { for each split of the node's split list { add the split to a leftover list; } } for each split in the leftover list { if accumulated split size is >= limit { generate a combined split; remove the accumulated splits from the node's split list; clear the accumulated split list; } if it is the last split in the leftover list { try to see if it can be added with an existing combined split; if not, generate a combined split on the accumulated splits; } } The complexity is n*log(n) with n being the number of original splits that are smaller than the limit. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Attachment: PIG-1501.patch > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt, > PIG-1501.patch > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897005#action_12897005 ] Yan Zhou commented on PIG-1501: --- The default is *not* using the compression on the intermediate data, which is the existing behavoir. For RC file, it is just a bit better in terms of compression ration than TFile. In terms of performance, the difference is within background noise. Stitching costs should be minimal. Actually, the full "projection" is the biggest advantage of RCFile over other columnar storage like zebra. I was surprised to see the compression improvement over TFile is marginal. The only cause I can think of is that the compression ratio is too sensitive to the data to pre-determine or even pre-estimate. lzo is under GPL. But it appears that Hadoop installation has it, at least in my test cluster. > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896620#action_12896620 ] Yan Zhou commented on PIG-1501: --- Unless there is any objection raised in the coming week, I'll go with LZO compression on TFile with the default option to disable compression that will be the old behavoir. > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Attachment: compress_perf_data_2.txt The data set in the last tests are small such that the performance difference was lost in background noise. This test case generates more temporary data. In summary, lzo generates about 3% compression ration and sees 4x speed improvement than uncompressed; gzip generates less than 1% compress ratio but the speed is 1%-2% slower than uncompressed. This observation is in line with the general observation that gzip compresses better but performs worse. > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt, compress_perf_data_2.txt > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter
[ https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1496: -- Attachment: PIG-1496.patch > Mandatory rule ImplicitSplitInserter > > > Key: PIG-1496 > URL: https://issues.apache.org/jira/browse/PIG-1496 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1496.patch, PIG-1496.patch > > > Need to migrate ImplicitSplitInserter to new logical optimizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter
[ https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1496: -- Status: Patch Available (was: Open) > Mandatory rule ImplicitSplitInserter > > > Key: PIG-1496 > URL: https://issues.apache.org/jira/browse/PIG-1496 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1496.patch, PIG-1496.patch > > > Need to migrate ImplicitSplitInserter to new logical optimizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter
[ https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1496: -- Attachment: PIG-1496.patch More comments in code per the reviewer's comment. > Mandatory rule ImplicitSplitInserter > > > Key: PIG-1496 > URL: https://issues.apache.org/jira/browse/PIG-1496 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1496.patch > > > Need to migrate ImplicitSplitInserter to new logical optimizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter
[ https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1496: -- Attachment: (was: PIG-1496.patch) > Mandatory rule ImplicitSplitInserter > > > Key: PIG-1496 > URL: https://issues.apache.org/jira/browse/PIG-1496 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1496.patch > > > Need to migrate ImplicitSplitInserter to new logical optimizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895338#action_12895338 ] Yan Zhou commented on PIG-1518: --- To provide a safe valve for any input fomats that might dislike the combination of their splits, a boolean property of pig.splitcombinaton is to be provided to allow for disabling this feature. The default value will be true. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895335#action_12895335 ] Yan Zhou commented on PIG-1518: --- The combination algorithm currently does not consider rack-locality as the generic underlying input splits do not carry the rack info. For more specific input splits like FileSplit, the rack info is available, thus allowing for generation of combined splits with consideration of rack-locality. But this might be out of scope for 0.8 and a seperate JIRA, PIG-1535, has been filed for that purpose. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1535) Combined input splits need to consider rack-locality for the underlying splits of rack info.
Combined input splits need to consider rack-locality for the underlying splits of rack info. Key: PIG-1535 URL: https://issues.apache.org/jira/browse/PIG-1535 Project: Pig Issue Type: Improvement Reporter: Yan Zhou PIG-1518 will add support to incorporate multiple small splits into bigger yet less splits. In doing so, the underlying generic input split's node-locality is consulted to maximize the data node-locality for the "big" splits. The rack-locality info is unavailable because the generic input splits do not have the info currently. MAPREDUCE-1698 is filed to address the lack of rack info in InputSplit. On the other hand, for many other types of input splits the rack info is available. FileSplit is an example. Future Howl's input splits will also contain the rack-locality info. In summary, before MAPREDUCE-1698 is resolved if ever, for some specific types of input splits, the small splits could be combined with the awareness of the rack-locality, by, probably, the same or similar algorithms by the CombineFileInputFormat. But it would mean non-trivial extra work on top of PIG-1518 and may be out of reach of 0.8, hence a separate JIRA. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894778#action_12894778 ] Yan Zhou commented on PIG-1518: --- In contrast with Hive, where the CombineFileInputFormat is used to generate input splits on the underlying storage formats, this PIG's combined splits work on top of the splits generated by the underlying loaders. In other words, Hive's input splits are CombineFileSplits that create record readers of underlying storage formats; while Pig's combined input splits contain underlying storage's splits. CombineFileRecordReader would have been reusable if not for its support only in 0.18 and the need of CombineFIleSplit as an argument to its constructor instead of InputSplit (MAPREDUCE-955). > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894205#action_12894205 ] Yan Zhou commented on PIG-1518: --- CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat, batches small files on the basis of block locality. For PIG, this umbrella input format will have to work with the generic input formats for which the block info is not available but the data node and size info are present to let the M/R make scheduling decisions. CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat, batches small files on the basis of block locality. For PIG, this umbrella input format will have to work with the generic input formats for which the block info is unavailable but the data node and size info are present to let the M/R make scheduling decisions. In other words, PIG can not break the original splits to "work inside" but can just use the original splits as building block for the combined input splits. Consequently, this combine input format will be holding multiple generic input splits so that each combined split's size is bound by a configured limit of, say, pig.maxsplitsize, with the default value of the HDFS block size of the file system the load source sits in. However, due to the constrains of sortness in the tables in merge join, the split combination will not be used for any loads that will be used in merge join. For mapside cogroup or mapside group by, though, the splits can be combined because the splits are only required to contain the all duplicate keys per instance and combination of splits will still preserve that invariant. During combination, the splits on the same data nodes will be merged as much as possible. Leftovers will be merged without regarding to the data localities. Of all the used data nodes, those of less splits will be merged before considering those of more splits so as to minimize the leftovers on the data nodes of less splits. On each data node, a greedy approach is adopted so that largest splits are tried to be merged before smaller ones. This is because smaller splits are easier merged later among themselves. As result, in implementation, a sorted list of data hosts (on the number of splits) of sorted lists (on the split size) of the original splits will be maintained to efficiently perform the above operations. The complexity should be linear with the number of the original splits. Note that for data locality, we just honor whatever the generic input split's getLocations() method produces. Any particular input split's implementation actually may or may not hold that property. For instance, CombinedInputFormat will combine node-local or rack-local blocks into a split. Essentially, this PIG container input split works on whatever data locality perception the underlying loader provides. On the implementation side, PigSplit will not hold a single wrapped InputSplit instance but a new CombinedInputSplit instance. Accordingly, PigRecordReader will hold a list of wrapped record readers and not just a single one. Correspondingly PigRecordReader's nextKeyValue() will use the wrapped record reader in order to fetch the next values. Risks include 1) the test verifications may need major changes since this optimization may cause major ordering changes in results; 2) since LoadFunc.prepareRead() takes a PigSplit argument, there might be a backward compatibility issue as PigSplit changes its wrapped input split to the combined input split. But this should be very unlikely as the only known use of the PigSplit argument is the internal "index loader" for the right table in merge join. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Attachment: compress_perf_data.txt The format in JIRA comment seems to be off mark. I'm attching the test results as an attachment. > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: compress_perf_data.txt > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893746#action_12893746 ] Yan Zhou commented on PIG-1501: --- gzip and lzo2 are tried as the compression codecs; TFile and RCFile are used as storage formats. The tests are PigMix's L3 and L11, and a variation of L3 with full projection, hereafter referred as L3_1, in order to expand the temporary data size. (In some cases, multiple runs are executed, particularly in presence of doubted system fluctuations.) End-to-end elapsed times are recorded. The results are on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM boxes: uncompressedTFile(lzo) TFile(gzip) RCFile(lzo2) L3133684504 19674398 11513958 18092681 1'40" 1'45" 1'40" 1'56" 18094161 1'46" L3_13889095541 36976818752637742581 3675818160 3'10" 4'4" 3'25"3'58" 3697666122 3675816707 3'10" 3'22" 3697674414 3'5" L11 25878480 21368784 15233146 21112892 1'52" 1'52" 1'57"1'59" 21112892 1'59" A few observations are in order: 1) L3 has the highest compress ratio; while L3_1 and L11 much lower compression ratio; 2) gzip compress better compared with LZO2 with a little perf cost; 3) RC file should have seen much better compression as it's a columnar store. But the actual difference is marginal. It is probably because of L11's unique values, and many of L3_1's random values like time stamp, plus the presence of map-typed columns. The conclusion from this observation is that compression of temporary intermediate data is not guaranteed to save disk space to a desired degree. It's subject to temporary data values being compressed upon. As result, this feature should be made configurable; 4) The performance implications from these tests seem to be negligible within background noise or within a few percentages of the overall run times. But this is not conclusive yet. Larger and more real life queries would be more suitable for the comparison purpose ; 5) RCFile as above has not shown clear advantage in terms of better columnar compression ratio. Bu this observation could be data-sensitive. > need to investigate the impact of compression on pig performance > > > Key: PIG-1501 > URL: https://issues.apache.org/jira/browse/PIG-1501 > Project: Pig > Issue Type: Test >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > > We would like to understand how compressing map results as well as well as > reducer output in a chain of MR jobs impacts performance. We can use PigMix > queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1453) [zebra] Intermittent failure for TestOrderPreserveUnionHDFS
[ https://issues.apache.org/jira/browse/PIG-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1453: -- Status: Resolved (was: Patch Available) Resolution: Fixed Committed to the trunk. > [zebra] Intermittent failure for TestOrderPreserveUnionHDFS > --- > > Key: PIG-1453 > URL: https://issues.apache.org/jira/browse/PIG-1453 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1453.patch, PIG-1453.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.