[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883424#action_12883424 ] Hadoop QA commented on PIG-1389: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448259/PIG-1389_1.patch against trunk revision 958666. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/console This message is automatically generated. Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order
[ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883464#action_12883464 ] Gianmarco De Francisci Morales commented on PIG-1468: - 1) I will write a simple program to measure the performance impact. 2) I think this has no correlation to other places, but I will check. Furthermore, this patch makes the ordering consistent with Hadoop's WritableComparator.compareBytes() (lexicographic order of binary data). DataByteArray.compareTo() does not compare in lexicographic order - Key: PIG-1468 URL: https://issues.apache.org/jira/browse/PIG-1468 Project: Pig Issue Type: Bug Reporter: Gianmarco De Francisci Morales Assignee: Gianmarco De Francisci Morales Attachments: PIG-1468.patch The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order. Actually, it takes into account the signum of the bytes that compose the DataByteArray. So, for example, 0xff compares to less than 0x00 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883486#action_12883486 ] Hadoop QA commented on PIG-1295: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448251/PIG-1295_0.6.patch against trunk revision 958666. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 150 javac compiler warnings (more than the trunk's current 145 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 402 release audit warnings (more than the trunk's current 399 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/console This message is automatically generated. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order
[ https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883557#action_12883557 ] Gianmarco De Francisci Morales commented on PIG-1468: - I ran some tests. I see a ~1% decrease in performance overall. I looked around the codebase for references to the method, and it does not seem there is any place that relies on the specific ordering. Here is the code I used: {code} import java.util.Random; public class TestSpeed { private static final int TIMES = (int) 10e6; private static final int NUM_ARRAYS = (int) 10e5; private static final int ARRAY_LENGTH = 50; private static int compareSigned(byte[] b1, byte[] b2) { int i; for (i = 0; i b1.length; i++) { if (i = b2.length) return 1; int a = b1[i]; int b = b2[i]; if (a b) return -1; else if (a b) return 1; } if (i b2.length) return -1; return 0; } private static int compareUnsisgned(byte[] b1, byte[] b2) { int i; for (i = 0; i b1.length; i++) { if (i = b2.length) return 1; int a = b1[i] 0xff; int b = b2[i] 0xff; if (a b) return -1; else if (a b) return 1; } if (i b2.length) return -1; return 0; } public static void main(String[] args) { long before, after; Random rand = new Random(123456789); byte[][] batch1 = new byte[NUM_ARRAYS][]; byte[][] batch2 = new byte[NUM_ARRAYS][]; for (int i = 0; i NUM_ARRAYS; i++) { batch1[i] = new byte[ARRAY_LENGTH]; batch2[i] = new byte[ARRAY_LENGTH]; rand.nextBytes(batch1[i]); rand.nextBytes(batch2[i]); } before = System.currentTimeMillis(); for (int i = 0; i TIMES; i++) for (int j = 0; j ARRAY_LENGTH; j++) compareSigned(batch1[j], batch2[j]); after = System.currentTimeMillis(); System.out.println(Time for signed comparison (ms): + (after - before)); before = System.currentTimeMillis(); for (int i = 0; i TIMES; i++) for (int j = 0; j ARRAY_LENGTH; j++) compareUnsisgned(batch1[j], batch2[j]); after = System.currentTimeMillis(); System.out.println(Time for UNsigned comparison (ms): + (after - before)); } } {code} DataByteArray.compareTo() does not compare in lexicographic order - Key: PIG-1468 URL: https://issues.apache.org/jira/browse/PIG-1468 Project: Pig Issue Type: Bug Reporter: Gianmarco De Francisci Morales Assignee: Gianmarco De Francisci Morales Attachments: PIG-1468.patch The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order. Actually, it takes into account the signum of the bytes that compose the DataByteArray. So, for example, 0xff compares to less than 0x00 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1309: Fix Version/s: 0.8.0 Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1475) Large number of FILTER matching clauses causes internal error 2998
Large number of FILTER matching clauses causes internal error 2998 -- Key: PIG-1475 URL: https://issues.apache.org/jira/browse/PIG-1475 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.6.0 Environment: Amazon Elastic MapReduce with Pig 0.6 and Hadoop 0.20.2 Reporter: Mike Subelsky Priority: Minor I'm generating reports using Pig where I only want to report on rows matching a set of regular expressions, but those regular expressions are pretty numerous. Pig fails with this internal error when I run FILTER with 500 terms: 2010-06-29 15:46:04,579 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. null This only occurred when I ran my Pig script against Hadoop with the full dataset. When I ran Pig in local mode, with a smaller sample file, Pig handled the FILTER command just fine. The workaround has been to split my list into two separate lists of 250 then UNION the results, but I assume this is something that could be addressed in the code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-1434: Status: Open (was: Patch Available) Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1475) Large number of FILTER matching clauses causes internal error 2998
[ https://issues.apache.org/jira/browse/PIG-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883768#action_12883768 ] Jeff Zhang commented on PIG-1475: - Mike, Could you paste your pig script and the full exception call stack ? Attach the sample data would be better Large number of FILTER matching clauses causes internal error 2998 -- Key: PIG-1475 URL: https://issues.apache.org/jira/browse/PIG-1475 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.6.0 Environment: Amazon Elastic MapReduce with Pig 0.6 and Hadoop 0.20.2 Reporter: Mike Subelsky Priority: Minor I'm generating reports using Pig where I only want to report on rows matching a set of regular expressions, but those regular expressions are pretty numerous. Pig fails with this internal error when I run FILTER with 500 terms: 2010-06-29 15:46:04,579 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. null This only occurred when I ran my Pig script against Hadoop with the full dataset. When I ran Pig in local mode, with a smaller sample file, Pig handled the FILTER command just fine. The workaround has been to split my list into two separate lists of 250 then UNION the results, but I assume this is something that could be addressed in the code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1476) Add trailing flag to commands to prevent retention of relation name in field names: STRIP ?
Add trailing flag to commands to prevent retention of relation name in field names: STRIP ? --- Key: PIG-1476 URL: https://issues.apache.org/jira/browse/PIG-1476 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Environment: sunny, 60% humidity with a chance of rain. Reporter: Russell Jurney Fix For: 0.8.0 After doing a JOIN or a GROUP/FOREACH, one often ends up with data looking like: DESCRIBE foo; foo: {other_thing::f1:int, other_thing::f2:chararray, other_thing::f3: int} What wunn usually wants is: foo: {f1:int, f2:chararray, f3: int} At this point, won is left with two choices, neither of which is very good. Choice wan: foo = FOREACH foo GENERATE $0 AS f1, $1 AS f2, $3 AS f3; This is a poor choice because later when wahn edits this file, it is confusing to remember what order is what field when wun manipulates something up stream in the script. So instead whun does this: foo = FOREACH foo GENERATE old_thing::f1 AS f1, old_thing::f2 AS f2, old_thing::f3 AS f3; This is a poor choice because it is verbose and cumbersome. Whan is unsure what to do, pauses and reflects that the Pig is perplexing, and hopes for a better tomorrow. Here's what wuhn should do to avoid this situation: foo = JOIN old_thing by f1, other_thing BY f1 STRIP; DESCRIBE foo foo: {f1:int, f2:chararray, f3: int}; I think so, anyway. I leave the behavior of duplicate fields to more enlightened beings, but I think this would be a big improvement to Pig Latin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1476) Add trailing flag to commands to prevent retention of relation name in field names: STRIP ?
[ https://issues.apache.org/jira/browse/PIG-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Jurney updated PIG-1476: Description: After doing a JOIN or a GROUP/FOREACH, one often ends up with data looking like: DESCRIBE foo; foo: {other_thing::f1:int, other_thing::f2:chararray, other_thing::f3: int} If oun was to let this chain, ouin can end up with: first_thing::second_thing::third_thing::fourth_thing::f1 which is pretty hairy. What wunn usually wants is: foo: {f1:int, f2:chararray, f3: int} At this point, won is left with two choices, neither of which is very good. Choice wan: foo = FOREACH foo GENERATE $0 AS f1, $1 AS f2, $3 AS f3; This is a poor choice because later when wahn edits this file, it is confusing to remember what order is what field when wun manipulates something up stream in the script. So instead whun does this: foo = FOREACH foo GENERATE old_thing::f1 AS f1, old_thing::f2 AS f2, old_thing::f3 AS f3; or foo = FOREACH foo GENERATE f1 AS f1, f2 AS f2, f3 AS f3; This is a poor choice because it is verbose and cumbersome. With no good choices available, whan is unsure what to do, pauses and reflects that the Pig is perplexing, and hopes for a better tomorrow. Here's what wuhn should do to avoid this situation: foo = JOIN old_thing by f1, other_thing BY f1 STRIP; DESCRIBE foo foo: {f1:int, f2:chararray, f3: int}; I think so, anyway. I leave the behavior of duplicate fields to more enlightened beings, but I think this would be a big improvement to Pig Latin. was: After doing a JOIN or a GROUP/FOREACH, one often ends up with data looking like: DESCRIBE foo; foo: {other_thing::f1:int, other_thing::f2:chararray, other_thing::f3: int} What wunn usually wants is: foo: {f1:int, f2:chararray, f3: int} At this point, won is left with two choices, neither of which is very good. Choice wan: foo = FOREACH foo GENERATE $0 AS f1, $1 AS f2, $3 AS f3; This is a poor choice because later when wahn edits this file, it is confusing to remember what order is what field when wun manipulates something up stream in the script. So instead whun does this: foo = FOREACH foo GENERATE old_thing::f1 AS f1, old_thing::f2 AS f2, old_thing::f3 AS f3; or foo = FOREACH foo GENERATE f1 AS f1, f2 AS f2, f3 AS f3; This is a poor choice because it is verbose and cumbersome. With no good choices available, whan is unsure what to do, pauses and reflects that the Pig is perplexing, and hopes for a better tomorrow. Here's what wuhn should do to avoid this situation: foo = JOIN old_thing by f1, other_thing BY f1 STRIP; DESCRIBE foo foo: {f1:int, f2:chararray, f3: int}; I think so, anyway. I leave the behavior of duplicate fields to more enlightened beings, but I think this would be a big improvement to Pig Latin. Add trailing flag to commands to prevent retention of relation name in field names: STRIP ? --- Key: PIG-1476 URL: https://issues.apache.org/jira/browse/PIG-1476 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Environment: sunny, 60% humidity with a chance of rain. Reporter: Russell Jurney Fix For: 0.8.0 After doing a JOIN or a GROUP/FOREACH, one often ends up with data looking like: DESCRIBE foo; foo: {other_thing::f1:int, other_thing::f2:chararray, other_thing::f3: int} If oun was to let this chain, ouin can end up with: first_thing::second_thing::third_thing::fourth_thing::f1 which is pretty hairy. What wunn usually wants is: foo: {f1:int, f2:chararray, f3: int} At this point, won is left with two choices, neither of which is very good. Choice wan: foo = FOREACH foo GENERATE $0 AS f1, $1 AS f2, $3 AS f3; This is a poor choice because later when wahn edits this file, it is confusing to remember what order is what field when wun manipulates something up stream in the script. So instead whun does this: foo = FOREACH foo GENERATE old_thing::f1 AS f1, old_thing::f2 AS f2, old_thing::f3 AS f3; or foo = FOREACH foo GENERATE f1 AS f1, f2 AS f2, f3 AS f3; This is a poor choice because it is verbose and cumbersome. With no good choices available, whan is unsure what to do, pauses and reflects that the Pig is perplexing, and hopes for a better tomorrow. Here's what wuhn should do to avoid this situation: foo = JOIN old_thing by f1, other_thing BY f1 STRIP; DESCRIBE foo foo: {f1:int, f2:chararray, f3: int}; I think so, anyway. I leave the behavior of duplicate fields to more enlightened beings, but I think this would be a big improvement to Pig Latin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.