[jira] Created: (PIG-876) limit changes order of order-by to ascending
limit changes order of order-by to ascending Key: PIG-876 URL: https://issues.apache.org/jira/browse/PIG-876 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Daniel Dai grunt a = load 'students.txt' as (c1,c2,c3,c4); grunt s = order a by c1 desc; grunt dump s; (zxldf,M,21,12.56) (uhsdf,M,34,12.11) (qwer,F,21,14.44) (qwer,F,23,145.5) (oiue,M,54,23.33) (asdfxc,M,23,12.44) grunt l = limit s 3; grunt dump l; (asdfxc,M,23,12.44) (oiue,M,54,23.33) (qwer,F,21,14.44) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-792) PERFORMANCE: Support skewed join in pig
[ https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-792: Attachment: skewedjoin.patch Merged from trunk and cleared all unit tests PERFORMANCE: Support skewed join in pig --- Key: PIG-792 URL: https://issues.apache.org/jira/browse/PIG-792 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: skewedjoin.patch Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-792) PERFORMANCE: Support skewed join in pig
[ https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-792: Attachment: (was: skewedjoin.patch) PERFORMANCE: Support skewed join in pig --- Key: PIG-792 URL: https://issues.apache.org/jira/browse/PIG-792 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: skewedjoin.patch Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-792) PERFORMANCE: Support skewed join in pig
[ https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-792: Status: Open (was: Patch Available) PERFORMANCE: Support skewed join in pig --- Key: PIG-792 URL: https://issues.apache.org/jira/browse/PIG-792 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: skewedjoin.patch Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-792) PERFORMANCE: Support skewed join in pig
[ https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-792: Status: Patch Available (was: Open) PERFORMANCE: Support skewed join in pig --- Key: PIG-792 URL: https://issues.apache.org/jira/browse/PIG-792 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: skewedjoin.patch Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-792) PERFORMANCE: Support skewed join in pig
[ https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728944#action_12728944 ] Hadoop QA commented on PIG-792: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12412911/skewedjoin.patch against trunk revision 791916. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 263 javac compiler warnings (more than the trunk's current 250 warnings). -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/115/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/115/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/115/console This message is automatically generated. PERFORMANCE: Support skewed join in pig --- Key: PIG-792 URL: https://issues.apache.org/jira/browse/PIG-792 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: skewedjoin.patch Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-Patch-minerva.apache.org #115
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/115/changes Changes: [pradeepkth] PIG-820: additional fixes to original patch (Ashutosh Chauhan via pradeepkth) -- [...truncated 97511 lines...] [exec] [junit] 09/07/08 22:44:47 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 [exec] [junit] 09/07/08 22:44:47 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 [exec] [junit] 09/07/08 22:44:48 INFO mapReduceLayer.JobControlCompiler: Setting up single store job [exec] [junit] 09/07/08 22:44:48 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. [exec] [junit] 09/07/08 22:44:48 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_200907082244_0002/job.jar. blk_-7262515892829386070_1012 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Receiving block blk_-7262515892829386070_1012 src: /127.0.0.1:60664 dest: /127.0.0.1:47416 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Receiving block blk_-7262515892829386070_1012 src: /127.0.0.1:35053 dest: /127.0.0.1:60347 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Receiving block blk_-7262515892829386070_1012 src: /127.0.0.1:43688 dest: /127.0.0.1:43227 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Received block blk_-7262515892829386070_1012 of size 1479061 from /127.0.0.1 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: PacketResponder 0 for block blk_-7262515892829386070_1012 terminating [exec] [junit] 09/07/08 22:44:48 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:43227 is added to blk_-7262515892829386070_1012 size 1479061 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Received block blk_-7262515892829386070_1012 of size 1479061 from /127.0.0.1 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: PacketResponder 1 for block blk_-7262515892829386070_1012 terminating [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Received block blk_-7262515892829386070_1012 of size 1479061 from /127.0.0.1 [exec] [junit] 09/07/08 22:44:48 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:60347 is added to blk_-7262515892829386070_1012 size 1479061 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: PacketResponder 2 for block blk_-7262515892829386070_1012 terminating [exec] [junit] 09/07/08 22:44:48 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:47416 is added to blk_-7262515892829386070_1012 size 1479061 [exec] [junit] 09/07/08 22:44:48 INFO fs.FSNamesystem: Increasing replication for file /tmp/hadoop-hudson/mapred/system/job_200907082244_0002/job.jar. New replication is 2 [exec] [junit] 09/07/08 22:44:48 INFO fs.FSNamesystem: Reducing replication for file /tmp/hadoop-hudson/mapred/system/job_200907082244_0002/job.jar. New replication is 2 [exec] [junit] 09/07/08 22:44:48 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_200907082244_0002/job.split. blk_-211607849087949865_1013 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Receiving block blk_-211607849087949865_1013 src: /127.0.0.1:60667 dest: /127.0.0.1:47416 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Receiving block blk_-211607849087949865_1013 src: /127.0.0.1:35056 dest: /127.0.0.1:60347 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Receiving block blk_-211607849087949865_1013 src: /127.0.0.1:41453 dest: /127.0.0.1:45877 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Received block blk_-211607849087949865_1013 of size 14547 from /127.0.0.1 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: PacketResponder 0 for block blk_-211607849087949865_1013 terminating [exec] [junit] 09/07/08 22:44:48 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:45877 is added to blk_-211607849087949865_1013 size 14547 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Received block blk_-211607849087949865_1013 of size 14547 from /127.0.0.1 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: PacketResponder 1 for block blk_-211607849087949865_1013 terminating [exec] [junit] 09/07/08 22:44:48 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:60347 is added to blk_-211607849087949865_1013 size 14547 [exec] [junit] 09/07/08 22:44:48 INFO dfs.DataNode: Received block blk_-211607849087949865_1013 of size 14547 from /127.0.0.1 [exec] [junit] 09/07/08 22:44:48 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated:
[jira] Updated: (PIG-792) PERFORMANCE: Support skewed join in pig
[ https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-792: Attachment: (was: skewedjoin.patch) PERFORMANCE: Support skewed join in pig --- Key: PIG-792 URL: https://issues.apache.org/jira/browse/PIG-792 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: skewedjoin.patch Fragmented replicated join has a few limitations: - One of the tables needs to be loaded into memory - Join is limited to two tables Skewed join partitions the table and joins the records in the reduce phase. It computes a histogram of the key space to account for skewing in the input records. Further, it adjusts the number of reducers depending on the key distribution. We need to implement the skewed join in pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Is there any document about the JobControlCompiler
Jeff, Chris Olston answered this a while back: http://markmail.org/thread/xnwutstlftnyycxs (by the way, MarkMail is awesome for searching mailing list archives. Highly recommended.) There are some changes that have to do with sampling and multi-store, but that email will give you the general idea. Also, remember you can always get the MR plan by running describe on a relation. Hope this helps -Dmitriy On Wed, Jul 8, 2009 at 6:24 PM, zhang jianfeng zjf...@gmail.com wrote: Hi all, I found that the following script will be converted into 3 mapreduce jobs: A = *LOAD* '/user/zjffdu/input.txt' *USING* PigStorage(); B = *GROUP* A *BY* $0; B = *FOREACH* B *GENERATE* *group*,COUNT($1); B = *ORDER* B *BY* $1; *DUMP* B; I am very interested to know How Pig compile the script to jobs, reading the source code is a way, but If there’s any document, that would be better. Does anyone know where can I find the related documents ? Or is there any JIRA item related to this ? Thank you in advance. Jeff Zhang.
Re: Is there any document about the JobControlCompiler
Dmitriy , Thank you for your help. On Thu, Jul 9, 2009 at 9:34 AM, Dmitriy Ryaboy dvrya...@cloudera.comwrote: Jeff, Chris Olston answered this a while back: http://markmail.org/thread/xnwutstlftnyycxs (by the way, MarkMail is awesome for searching mailing list archives. Highly recommended.) There are some changes that have to do with sampling and multi-store, but that email will give you the general idea. Also, remember you can always get the MR plan by running describe on a relation. Hope this helps -Dmitriy On Wed, Jul 8, 2009 at 6:24 PM, zhang jianfeng zjf...@gmail.com wrote: Hi all, I found that the following script will be converted into 3 mapreduce jobs: A = *LOAD* '/user/zjffdu/input.txt' *USING* PigStorage(); B = *GROUP* A *BY* $0; B = *FOREACH* B *GENERATE* *group*,COUNT($1); B = *ORDER* B *BY* $1; *DUMP* B; I am very interested to know How Pig compile the script to jobs, reading the source code is a way, but If there’s any document, that would be better. Does anyone know where can I find the related documents ? Or is there any JIRA item related to this ? Thank you in advance. Jeff Zhang.
[jira] Assigned: (PIG-822) Flatten semantics are unknown
[ https://issues.apache.org/jira/browse/PIG-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed reassigned PIG-822: - Assignee: Benjamin Reed Flatten semantics are unknown - Key: PIG-822 URL: https://issues.apache.org/jira/browse/PIG-822 Project: Pig Issue Type: Bug Components: documentation Reporter: George Mavromatis Assignee: Benjamin Reed Priority: Critical There is no formal specification of the flatten keyword in http://hadoop.apache.org/pig/docs/r0.2.0/piglatin.html There are only some examples. I have found flatten to be very fragile and unpredictable with the data types it reads and creates. Please document: Flatten to be explained formally in its own dedicated section: What are the valid input types, the output types it creates, what transformation it does from input to output and how the resulting data are named. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-876) limit changes order of order-by to ascending
[ https://issues.apache.org/jira/browse/PIG-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-876: --- Fix Version/s: 0.4.0 Affects Version/s: 0.3.0 Status: Patch Available (was: Open) limit changes order of order-by to ascending Key: PIG-876 URL: https://issues.apache.org/jira/browse/PIG-876 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Thejas M Nair Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-876-1.patch grunt a = load 'students.txt' as (c1,c2,c3,c4); grunt s = order a by c1 desc; grunt dump s; (zxldf,M,21,12.56) (uhsdf,M,34,12.11) (qwer,F,21,14.44) (qwer,F,23,145.5) (oiue,M,54,23.33) (asdfxc,M,23,12.44) grunt l = limit s 3; grunt dump l; (asdfxc,M,23,12.44) (oiue,M,54,23.33) (qwer,F,21,14.44) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-876) limit changes order of order-by to ascending
[ https://issues.apache.org/jira/browse/PIG-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-876: --- Attachment: PIG-876-1.patch Need to change sort order of limitAfterSort operator the same as the sort operator. limit changes order of order-by to ascending Key: PIG-876 URL: https://issues.apache.org/jira/browse/PIG-876 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Thejas M Nair Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-876-1.patch grunt a = load 'students.txt' as (c1,c2,c3,c4); grunt s = order a by c1 desc; grunt dump s; (zxldf,M,21,12.56) (uhsdf,M,34,12.11) (qwer,F,21,14.44) (qwer,F,23,145.5) (oiue,M,54,23.33) (asdfxc,M,23,12.44) grunt l = limit s 3; grunt dump l; (asdfxc,M,23,12.44) (oiue,M,54,23.33) (qwer,F,21,14.44) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-812) COUNT(*) does not work
[ https://issues.apache.org/jira/browse/PIG-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-812: -- Priority: Major (was: Critical) COUNT(*) does not work --- Key: PIG-812 URL: https://issues.apache.org/jira/browse/PIG-812 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Assignee: Benjamin Reed Fix For: 0.2.0 Attachments: studenttab10k Pig script to count the number of rows in a studenttab10k file which contains 10k records. {code} studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float); X2 = GROUP studenttab ALL; describe X2; Y2 = FOREACH X2 GENERATE COUNT(*); explain Y2; DUMP Y2; {code} returns the following error ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias Y2 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log If you look at the log file: Caused by: java.lang.ClassCastException at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76) at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.