[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883424#action_12883424
 ] 

Hadoop QA commented on PIG-1389:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448259/PIG-1389_1.patch
  against trunk revision 958666.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/335/console

This message is automatically generated.

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

2010-06-29 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883464#action_12883464
 ] 

Gianmarco De Francisci Morales commented on PIG-1468:
-

1) I will write a simple program to measure the performance impact.

2) I think this has no correlation to other places, but I will check.
Furthermore, this patch makes the ordering consistent with Hadoop's 
WritableComparator.compareBytes() (lexicographic order of binary data).

 DataByteArray.compareTo() does not compare in lexicographic order
 -

 Key: PIG-1468
 URL: https://issues.apache.org/jira/browse/PIG-1468
 Project: Pig
  Issue Type: Bug
Reporter: Gianmarco De Francisci Morales
Assignee: Gianmarco De Francisci Morales
 Attachments: PIG-1468.patch


 The compareTo() method of org.apache.pig.data.DataByteArray does not compare 
 items in lexicographic order.
 Actually, it takes into account the signum of the bytes that compose the 
 DataByteArray.
 So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883486#action_12883486
 ] 

Hadoop QA commented on PIG-1295:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448251/PIG-1295_0.6.patch
  against trunk revision 958666.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 150 javac compiler warnings (more 
than the trunk's current 145 warnings).

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 402 release audit warnings 
(more than the trunk's current 399 warnings).

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/console

This message is automatically generated.

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1468) DataByteArray.compareTo() does not compare in lexicographic order

2010-06-29 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883557#action_12883557
 ] 

Gianmarco De Francisci Morales commented on PIG-1468:
-

I ran some tests. I see a ~1% decrease in performance overall.

I looked around the codebase for references to the method, and it does not seem 
there is any place that relies on the specific ordering.

Here is the code I used:

{code}
import java.util.Random;

public class TestSpeed {
private static final int TIMES = (int) 10e6;
private static final int NUM_ARRAYS = (int) 10e5;
private static final int ARRAY_LENGTH = 50;

private static int compareSigned(byte[] b1, byte[] b2) {
int i;
for (i = 0; i  b1.length; i++) {
if (i = b2.length)
return 1;
int a = b1[i];
int b = b2[i];
if (a  b)
return -1;
else if (a  b)
return 1;
}
if (i  b2.length)
return -1;
return 0;
}

private static int compareUnsisgned(byte[] b1, byte[] b2) {
int i;
for (i = 0; i  b1.length; i++) {
if (i = b2.length)
return 1;
int a = b1[i]  0xff;
int b = b2[i]  0xff;
if (a  b)
return -1;
else if (a  b)
return 1;
}
if (i  b2.length)
return -1;
return 0;
}

public static void main(String[] args) {
long before, after;
Random rand = new Random(123456789);
byte[][] batch1 = new byte[NUM_ARRAYS][];
byte[][] batch2 = new byte[NUM_ARRAYS][];
for (int i = 0; i  NUM_ARRAYS; i++) {
batch1[i] = new byte[ARRAY_LENGTH];
batch2[i] = new byte[ARRAY_LENGTH];
rand.nextBytes(batch1[i]);
rand.nextBytes(batch2[i]);
}

before = System.currentTimeMillis();
for (int i = 0; i  TIMES; i++)
for (int j = 0; j  ARRAY_LENGTH; j++)
compareSigned(batch1[j], batch2[j]);
after = System.currentTimeMillis();
System.out.println(Time for signed comparison (ms):  + (after - 
before));

before = System.currentTimeMillis();
for (int i = 0; i  TIMES; i++)
for (int j = 0; j  ARRAY_LENGTH; j++)
compareUnsisgned(batch1[j], batch2[j]);
after = System.currentTimeMillis();
System.out.println(Time for UNsigned comparison (ms):  + (after - 
before));
}
}
{code}

 DataByteArray.compareTo() does not compare in lexicographic order
 -

 Key: PIG-1468
 URL: https://issues.apache.org/jira/browse/PIG-1468
 Project: Pig
  Issue Type: Bug
Reporter: Gianmarco De Francisci Morales
Assignee: Gianmarco De Francisci Morales
 Attachments: PIG-1468.patch


 The compareTo() method of org.apache.pig.data.DataByteArray does not compare 
 items in lexicographic order.
 Actually, it takes into account the signum of the bytes that compose the 
 DataByteArray.
 So, for example, 0xff compares to less than 0x00

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-06-29 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1309:


Fix Version/s: 0.8.0

 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1475) Large number of FILTER matching clauses causes internal error 2998

2010-06-29 Thread Mike Subelsky (JIRA)
Large number of FILTER matching clauses causes internal error 2998
--

 Key: PIG-1475
 URL: https://issues.apache.org/jira/browse/PIG-1475
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.6.0
 Environment: Amazon Elastic MapReduce with Pig 0.6 and Hadoop 0.20.2
Reporter: Mike Subelsky
Priority: Minor


I'm generating reports using Pig where I only want to report on rows matching a 
set of regular expressions, but those regular expressions are pretty numerous.

Pig fails with this internal error when I run FILTER with 500 terms:

2010-06-29 15:46:04,579 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2998: Unhandled internal error. null

This only occurred when I ran my Pig script against Hadoop with the full 
dataset. When I ran Pig in local mode, with a smaller sample file, Pig handled 
the FILTER command just fine.

The workaround has been to split my list into two separate lists of 250 then 
UNION the results, but I assume this is something that could be addressed in 
the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-06-29 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Status: Open  (was: Patch Available)

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1475) Large number of FILTER matching clauses causes internal error 2998

2010-06-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883768#action_12883768
 ] 

Jeff Zhang commented on PIG-1475:
-

Mike, Could you paste your pig script and the full exception call stack ? 
Attach the sample data would be better

 Large number of FILTER matching clauses causes internal error 2998
 --

 Key: PIG-1475
 URL: https://issues.apache.org/jira/browse/PIG-1475
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.6.0
 Environment: Amazon Elastic MapReduce with Pig 0.6 and Hadoop 0.20.2
Reporter: Mike Subelsky
Priority: Minor

 I'm generating reports using Pig where I only want to report on rows matching 
 a set of regular expressions, but those regular expressions are pretty 
 numerous.
 Pig fails with this internal error when I run FILTER with 500 terms:
 2010-06-29 15:46:04,579 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. null
 This only occurred when I ran my Pig script against Hadoop with the full 
 dataset. When I ran Pig in local mode, with a smaller sample file, Pig 
 handled the FILTER command just fine.
 The workaround has been to split my list into two separate lists of 250 then 
 UNION the results, but I assume this is something that could be addressed in 
 the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1476) Add trailing flag to commands to prevent retention of relation name in field names: STRIP ?

2010-06-29 Thread Russell Jurney (JIRA)
Add trailing flag to commands to prevent retention of relation name in field 
names: STRIP ?
---

 Key: PIG-1476
 URL: https://issues.apache.org/jira/browse/PIG-1476
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
 Environment: sunny, 60% humidity with a chance of rain.
Reporter: Russell Jurney
 Fix For: 0.8.0


After doing a JOIN or a GROUP/FOREACH, one often ends up with data looking like:

 DESCRIBE foo;

   foo: {other_thing::f1:int, other_thing::f2:chararray, other_thing::f3: int}

What wunn usually wants is:

   foo: {f1:int, f2:chararray, f3: int}

At this point, won is left with two choices, neither of which is very good.  
Choice wan:

 foo = FOREACH foo GENERATE $0 AS f1, $1 AS f2, $3 AS f3;

This is a poor choice because later when wahn edits this file, it is confusing 
to remember what order is what field when wun manipulates something up stream 
in the script.  So instead whun does this:

 foo = FOREACH foo GENERATE old_thing::f1 AS f1, old_thing::f2 AS f2, 
 old_thing::f3 AS f3;

This is a poor choice because it is verbose and cumbersome.

Whan is unsure what to do, pauses and reflects that the Pig is perplexing, and 
hopes for a better tomorrow.  Here's what wuhn should do to avoid this 
situation:

foo = JOIN old_thing by f1, other_thing BY f1 STRIP;

DESCRIBE foo foo: {f1:int, f2:chararray, f3: int};

I think so, anyway.  I leave the behavior of duplicate fields to more 
enlightened beings, but I think this would be a big improvement to Pig Latin.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1476) Add trailing flag to commands to prevent retention of relation name in field names: STRIP ?

2010-06-29 Thread Russell Jurney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Jurney updated PIG-1476:


Description: 
After doing a JOIN or a GROUP/FOREACH, one often ends up with data looking like:

 DESCRIBE foo;

   foo: {other_thing::f1:int, other_thing::f2:chararray, other_thing::f3: int}

If oun was to let this chain, ouin can end up with: 
first_thing::second_thing::third_thing::fourth_thing::f1 which is pretty hairy.

What wunn usually wants is:

   foo: {f1:int, f2:chararray, f3: int}

At this point, won is left with two choices, neither of which is very good.  
Choice wan:

 foo = FOREACH foo GENERATE $0 AS f1, $1 AS f2, $3 AS f3;

This is a poor choice because later when wahn edits this file, it is confusing 
to remember what order is what field when wun manipulates something up stream 
in the script.  So instead whun does this:

 foo = FOREACH foo GENERATE old_thing::f1 AS f1, old_thing::f2 AS f2, 
 old_thing::f3 AS f3;

or

 foo = FOREACH foo GENERATE f1 AS f1, f2 AS f2, f3 AS f3;

This is a poor choice because it is verbose and cumbersome.

With no good choices available, whan is unsure what to do, pauses and reflects 
that the Pig is perplexing, and hopes for a better tomorrow.  Here's what wuhn 
should do to avoid this situation:

foo = JOIN old_thing by f1, other_thing BY f1 STRIP;

DESCRIBE foo foo: {f1:int, f2:chararray, f3: int};

I think so, anyway.  I leave the behavior of duplicate fields to more 
enlightened beings, but I think this would be a big improvement to Pig Latin.


  was:
After doing a JOIN or a GROUP/FOREACH, one often ends up with data looking like:

 DESCRIBE foo;

   foo: {other_thing::f1:int, other_thing::f2:chararray, other_thing::f3: int}

What wunn usually wants is:

   foo: {f1:int, f2:chararray, f3: int}

At this point, won is left with two choices, neither of which is very good.  
Choice wan:

 foo = FOREACH foo GENERATE $0 AS f1, $1 AS f2, $3 AS f3;

This is a poor choice because later when wahn edits this file, it is confusing 
to remember what order is what field when wun manipulates something up stream 
in the script.  So instead whun does this:

 foo = FOREACH foo GENERATE old_thing::f1 AS f1, old_thing::f2 AS f2, 
 old_thing::f3 AS f3;

or

 foo = FOREACH foo GENERATE f1 AS f1, f2 AS f2, f3 AS f3;

This is a poor choice because it is verbose and cumbersome.

With no good choices available, whan is unsure what to do, pauses and reflects 
that the Pig is perplexing, and hopes for a better tomorrow.  Here's what wuhn 
should do to avoid this situation:

foo = JOIN old_thing by f1, other_thing BY f1 STRIP;

DESCRIBE foo foo: {f1:int, f2:chararray, f3: int};

I think so, anyway.  I leave the behavior of duplicate fields to more 
enlightened beings, but I think this would be a big improvement to Pig Latin.



 Add trailing flag to commands to prevent retention of relation name in field 
 names: STRIP ?
 ---

 Key: PIG-1476
 URL: https://issues.apache.org/jira/browse/PIG-1476
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
 Environment: sunny, 60% humidity with a chance of rain.
Reporter: Russell Jurney
 Fix For: 0.8.0


 After doing a JOIN or a GROUP/FOREACH, one often ends up with data looking 
 like:
  DESCRIBE foo;
foo: {other_thing::f1:int, other_thing::f2:chararray, other_thing::f3: int}
 If oun was to let this chain, ouin can end up with: 
 first_thing::second_thing::third_thing::fourth_thing::f1 which is pretty 
 hairy.
 What wunn usually wants is:
foo: {f1:int, f2:chararray, f3: int}
 At this point, won is left with two choices, neither of which is very good.  
 Choice wan:
  foo = FOREACH foo GENERATE $0 AS f1, $1 AS f2, $3 AS f3;
 This is a poor choice because later when wahn edits this file, it is 
 confusing to remember what order is what field when wun manipulates something 
 up stream in the script.  So instead whun does this:
  foo = FOREACH foo GENERATE old_thing::f1 AS f1, old_thing::f2 AS f2, 
  old_thing::f3 AS f3;
 or
  foo = FOREACH foo GENERATE f1 AS f1, f2 AS f2, f3 AS f3;
 This is a poor choice because it is verbose and cumbersome.
 With no good choices available, whan is unsure what to do, pauses and 
 reflects that the Pig is perplexing, and hopes for a better tomorrow.  Here's 
 what wuhn should do to avoid this situation:
 foo = JOIN old_thing by f1, other_thing BY f1 STRIP;
 DESCRIBE foo foo: {f1:int, f2:chararray, f3: int};
 I think so, anyway.  I leave the behavior of duplicate fields to more 
 enlightened beings, but I think this would be a big improvement to Pig Latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.