[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-10-01 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Attachment: PIG-1658.patch

Add Zebra test TestMergeJoinPartial to the pigtest target.

 ORDER BY does not work properly on integer/short keys that are -1
 -

 Key: PIG-1658
 URL: https://issues.apache.org/jira/browse/PIG-1658
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1658.patch, PIG-1658.patch


 In fact, all these types of keys of values that are negative but within the 
 byte or short's range would have the problem.
 Basic cally, a byte value of -1  0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY

2010-10-01 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917012#action_12917012
 ] 

Yan Zhou commented on PIG-1659:
---

Need to make sure it is invoked after optimization in both old and new logical 
plans.

 sortinfo is not set for store if there is a filter after ORDER BY
 -

 Key: PIG-1659
 URL: https://issues.apache.org/jira/browse/PIG-1659
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Daniel Dai
 Fix For: 0.8.0


 This has caused 6 (of 7) failures in the Zebra test 
 TestOrderPreserveVariableTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-10-01 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Committed to both trunk and the 0.8 branch.

 ORDER BY does not work properly on integer/short keys that are -1
 -

 Key: PIG-1658
 URL: https://issues.apache.org/jira/browse/PIG-1658
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1658.patch, PIG-1658.patch


 In fact, all these types of keys of values that are negative but within the 
 byte or short's range would have the problem.
 Basic cally, a byte value of -1  0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)
ORDER BY does not work properly on integer/short keys that are -1
-

 Key: PIG-1658
 URL: https://issues.apache.org/jira/browse/PIG-1658
 Project: Pig
  Issue Type: Bug
Reporter: Yan Zhou


In fact, all these types of keys of values that are negative but within the 
byte or short's range would have the problem.

Basic cally, a byte value of -1  0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Fix Version/s: 0.8.0
Affects Version/s: 0.8.0

 ORDER BY does not work properly on integer/short keys that are -1
 -

 Key: PIG-1658
 URL: https://issues.apache.org/jira/browse/PIG-1658
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


 In fact, all these types of keys of values that are negative but within the 
 byte or short's range would have the problem.
 Basic cally, a byte value of -1  0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou reassigned PIG-1658:
-

Assignee: Yan Zhou

 ORDER BY does not work properly on integer/short keys that are -1
 -

 Key: PIG-1658
 URL: https://issues.apache.org/jira/browse/PIG-1658
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


 In fact, all these types of keys of values that are negative but within the 
 byte or short's range would have the problem.
 Basic cally, a byte value of -1  0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1659) sortinfo is not set for store if there is a filter after ORDER BY

2010-09-30 Thread Yan Zhou (JIRA)
sortinfo is not set for store if there is a filter after ORDER BY
-

 Key: PIG-1659
 URL: https://issues.apache.org/jira/browse/PIG-1659
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Daniel Dai
 Fix For: 0.8.0


This has caused 6 (of 7) failures in the Zebra test 
TestOrderPreserveVariableTable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Status: Patch Available  (was: Open)

 ORDER BY does not work properly on integer/short keys that are -1
 -

 Key: PIG-1658
 URL: https://issues.apache.org/jira/browse/PIG-1658
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1658.patch


 In fact, all these types of keys of values that are negative but within the 
 byte or short's range would have the problem.
 Basic cally, a byte value of -1  0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1658) ORDER BY does not work properly on integer/short keys that are -1

2010-09-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1658:
--

Attachment: PIG-1658.patch

This problem is caused by the PIG-1295 patch.

test-core pass. Zebra's nightly pass too.

test-patch output:

 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no tests are needed for 
this patch.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

Zebra's TestMergeJoinPartial is used to verify the fix.

 ORDER BY does not work properly on integer/short keys that are -1
 -

 Key: PIG-1658
 URL: https://issues.apache.org/jira/browse/PIG-1658
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1658.patch


 In fact, all these types of keys of values that are negative but within the 
 byte or short's range would have the problem.
 Basic cally, a byte value of -1  0xff will return 255 not -1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915815#action_12915815
 ] 

Yan Zhou commented on PIG-1648:
---

Top 5 locations with most data will be used. This has been agreed upon by the 
M/R dev.

 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915852#action_12915852
 ] 

Yan Zhou commented on PIG-1648:
---

test-patch results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

test-core tests pass too.


 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1648.patch


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1648:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to both trunk and the 0.8 branch.

 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1648.patch


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1648:
--

Status: Patch Available  (was: Open)

 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1648.patch


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1647) Logical simplifier throws a NPE

2010-09-27 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1647:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to both trunk and the 0.8 branch.

 Logical simplifier throws a NPE
 ---

 Key: PIG-1647
 URL: https://issues.apache.org/jira/browse/PIG-1647
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1647.patch, PIG-1647.patch


 A query like:
 A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
 B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' 
 and ((d is not null and d != '') or (e is not null and e != ''));
 will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1651) PIG class loading mishandled

2010-09-27 Thread Yan Zhou (JIRA)
PIG class loading mishandled


 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0


If just having zebra.jar as being registered in a PIG script but not in the 
CLASSPATH, the query using zebra fails since there appear to be multiple 
classes loaded into JVM, causing static variable set previously not seen after 
one instance of the class is created through reflection. (After the zebra.jar 
is specified in CLASSPATH, it works fine.) The exception stack is as follows:

ackend error message during job submission
---
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for: hdfs://hostname/pathto/zebra_dir :: null
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at 
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
at 
org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
at 
org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
at 
org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
at 
org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
... 7 more



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1647) Logical simplifier throws a NPE

2010-09-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1647:
--

Attachment: PIG-1647.patch

passes test-core.

test-patch results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


 Logical simplifier throws a NPE
 ---

 Key: PIG-1647
 URL: https://issues.apache.org/jira/browse/PIG-1647
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1647.patch, PIG-1647.patch


 A query like:
 A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
 B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' 
 and ((d is not null and d != '') or (e is not null and e != ''));
 will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1647) Logical simplifier throws a NPE

2010-09-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1647:
--

Status: Patch Available  (was: Open)

 Logical simplifier throws a NPE
 ---

 Key: PIG-1647
 URL: https://issues.apache.org/jira/browse/PIG-1647
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1647.patch, PIG-1647.patch


 A query like:
 A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
 B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' 
 and ((d is not null and d != '') or (e is not null and e != ''));
 will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1645:
--

Attachment: PIG-1645.patch

test-core passed.

test-patch results:

 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no tests are needed for 
this patch.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 459 release 
audit warnings (more than the trunk's current 457 warnings).

The scenario is trully a corner case. The following query *might* have caused 
the problem:

A = load '/tmp/test/jsTst2.txt' as (fn, age:int);
B = load '/tmp/test/sample.txt' as (fn, age:int);
C = join A by fn, B by fn USING 'replicated';
D = ORDER C BY B::age;
dump D;

where sample.txt has only one row that contains one record that has the same 
join key as a single record in jsTst2.txt which should have size of several 
HDFS blocks. Even so, it is random to see a failure, as it depends upon whether 
any of the logically empty files is placed in the first underlying split of the 
list of splits combined. Compute nodes' host names seem to play a role too.  
Running in local mode seems to see no failure.

The 2 release audit warnings are due to jdiff. No new file added.

 Using both small split combination and temporary file compression on a query 
 of ORDER BY may cause crash
 

 Key: PIG-1645
 URL: https://issues.apache.org/jira/browse/PIG-1645
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1645.patch


 The stack looks like the following:
 java.lang.NullPointerException at 
 java.util.Arrays.binarySearch(Arrays.java:2043) at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
  at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
  at
 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
 org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
  at
 org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1645:
--

Status: Patch Available  (was: Open)

 Using both small split combination and temporary file compression on a query 
 of ORDER BY may cause crash
 

 Key: PIG-1645
 URL: https://issues.apache.org/jira/browse/PIG-1645
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1645.patch


 The stack looks like the following:
 java.lang.NullPointerException at 
 java.util.Arrays.binarySearch(Arrays.java:2043) at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
  at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
  at
 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
 org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
  at
 org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-24 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914541#action_12914541
 ] 

Yan Zhou commented on PIG-1645:
---

The possibility of failure also depends upon the block distribution since the 
split combination makes use of that info.

 Using both small split combination and temporary file compression on a query 
 of ORDER BY may cause crash
 

 Key: PIG-1645
 URL: https://issues.apache.org/jira/browse/PIG-1645
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1645.patch


 The stack looks like the following:
 java.lang.NullPointerException at 
 java.util.Arrays.binarySearch(Arrays.java:2043) at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
  at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
  at
 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
 org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
  at
 org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-24 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914672#action_12914672
 ] 

Yan Zhou commented on PIG-1635:
---

I did a thorough check for this patch. Actually some of the ordering changes 
were caused by the mentioned misuse. Thanks.

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1647) Logical simplifier throws a NPE

2010-09-24 Thread Yan Zhou (JIRA)
Logical simplifier throws a NPE
---

 Key: PIG-1647
 URL: https://issues.apache.org/jira/browse/PIG-1647
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


A query like:

A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' and 
((d is not null and d != '') or (e is not null and e != ''));

will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-24 Thread Yan Zhou (JIRA)
Split combination may return too many block locations to map/reduce framework
-

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


For instance, if a small split has block locations h1, h2 and h3; another small 
split has h1, h3, h4. After combination, the composite split contains 4 block 
locations. If the number of component splits is big, then the number of block 
locations could be big too. In fact, the  number of block locations serves as a 
hint to M/R as the best hosts this composite split should be run on so the list 
should contain a short list, say 5, of the hosts that contain the most data in 
this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1635:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to both trunk and the 0.8 branch.

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1647) Logical simplifier throws a NPE

2010-09-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1647:
--

Attachment: PIG-1647.patch

 Logical simplifier throws a NPE
 ---

 Key: PIG-1647
 URL: https://issues.apache.org/jira/browse/PIG-1647
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1647.patch


 A query like:
 A = load 'd.txt' as (a:chararray, b:long, c:map[], d:chararray, e:chararray);
 B = filter A by a == 'v' and b == 117L and c#'p1' == 'h' and c#'p2' == 'to' 
 and ((d is not null and d != '') or (e is not null and e != ''));
 will cause the logical expression simplifier to throw a NPE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1645:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to both trunk and the 0.8 branch.

 Using both small split combination and temporary file compression on a query 
 of ORDER BY may cause crash
 

 Key: PIG-1645
 URL: https://issues.apache.org/jira/browse/PIG-1645
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1645.patch


 The stack looks like the following:
 java.lang.NullPointerException at 
 java.util.Arrays.binarySearch(Arrays.java:2043) at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
  at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
  at
 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
 org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
  at
 org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-09-23 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Release Note: 
Feature: combine splits of sizes smaller than the value of property 
pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property pig.splitCombination 
to false. When such a combination is performed, a log message like Total 
input paths (combined) to process : 7 will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more under-fed 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

This change also requires the loader to be stateless across the invocations to 
the prepareToRead method. That is, the method should reset any internal states 
that are not affected by the RecordReader argument.
Otherwise, this feature should be disabled.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.

  was:
Feature: combine splits of sizes smaller than the value of property 
pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property pig.noSplitCombination 
to true. When such a combination is performed, a log message like Total input 
paths (combined) to process : 7 will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more under-fed 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.


 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-23 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914128#action_12914128
 ] 

Yan Zhou commented on PIG-1645:
---

The problem is that both RandomSampleLoader and PossionSampleLoader have 
internal states from the previous invocations that should be reset when a 
different underlying split is worked on under the same umbrella split when the 
split combination (PIG-1518) is on.

When temporary file compression is disabled, Pig internal storage will create 
empty files which will be discarded by split combiner, making the only 
non-empty split as the only split to be worked on, so it is ok in this case.

 Using both small split combination and temporary file compression on a query 
 of ORDER BY may cause crash
 

 Key: PIG-1645
 URL: https://issues.apache.org/jira/browse/PIG-1645
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


 The stack looks like the following:
 java.lang.NullPointerException at 
 java.util.Arrays.binarySearch(Arrays.java:2043) at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
  at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
  at
 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
 org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
  at
 org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-23 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914145#action_12914145
 ] 

Yan Zhou commented on PIG-1635:
---

test-patch results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-23 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914150#action_12914150
 ] 

Yan Zhou commented on PIG-1635:
---

All test-core tests also run clean.

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-22 Thread Yan Zhou (JIRA)
Using both small split combination and temporary file compression on a query of 
ORDER BY may cause crash


 Key: PIG-1645
 URL: https://issues.apache.org/jira/browse/PIG-1645
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


The stack looks like the following:

java.lang.NullPointerException at 
java.util.Arrays.binarySearch(Arrays.java:2043) at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
 at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
 at
org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1628) log this message at debug level : 'Pig Internal storage in use'

2010-09-21 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913029#action_12913029
 ] 

Yan Zhou commented on PIG-1628:
---

+1. Patch looks good.

 log this message at debug level : 'Pig Internal storage in use'
 ---

 Key: PIG-1628
 URL: https://issues.apache.org/jira/browse/PIG-1628
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1628.1.patch


 The temporary storage functions used are logging at the INFO level. This 
 should change to debug level, they are reducing the visibility of more useful 
 INFO messages. The messages include  'Pig Internal storage in use' from 
 InterStorage and  'TFile storage in use' from TFileStorage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)
Logical simplifier does not simplify away constants under AND and OR; after 
simplificaion the ordering of operands of AND and OR may get changed


 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor


b = FILTER a by (( f1  1) AND (1 == 1))

or 

b = FILTER a by ((f1  1) OR ( 1==0))

should be simplified to

b = FILTER a by f1  1;

Regarding ordering change, an example is that 

b = filter a by ((f1 is not null) AND (f2 is not null));

Even without possible simplification, the expression is changed to

b = filter a by ((f2 is not null) AND (f1 is not null));

Even though the ordering change in this case, and probably in most other cases, 
does not create any difference, but for two reasons some users might care about 
the ordering: if stateful UDFs are used as operands of AND or OR; and if the 
ordering is intended by the application designer to maximize the chances to 
shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913036#action_12913036
 ] 

Yan Zhou commented on PIG-1635:
---

This is regarding a new feature (PIG-1399) added for 0.8.

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor

 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1635:
--

Affects Version/s: 0.8.0

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor

 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1635:
--

Attachment: PIG-1635.patch

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-21 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1635:
--

Status: Patch Available  (was: Open)

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-09-14 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909330#action_12909330
 ] 

Yan Zhou commented on PIG-366:
--

Robert,

Could you put down a step-by-step instruction on how to use this jar as an 
eclipse plug-in?  Thanks.

 PigPen - Eclipse plugin for a graphical PigLatin editor
 ---

 Key: PIG-366
 URL: https://issues.apache.org/jira/browse/PIG-366
 Project: Pig
  Issue Type: New Feature
Reporter: Shubham Chopra
Assignee: Robert Gibbon
Priority: Minor
 Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, 
 org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, 
 org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
 org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz


 This is an Eclipse plugin that provides a GUI that can help users create 
 PigLatin scripts and see the example generator outputs on the fly and submit 
 the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-09-13 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908926#action_12908926
 ] 

Yan Zhou commented on PIG-366:
--

Robert, first, thanks for your effort to pick up this feature.

You mentioned in your 09/08 Comment that you stripped back a lot of 
functionality and focused on the script editor.  I'm wondering if it is 
possible to add your fixes/improvements on top of Shubham's patch. 
Specifically, I'm interested in the example generator use in PigPen, which 
seems to absent from your patches. FYI, I'm currently working on improving and 
enhancing the example generator left over by Shubham about 2 years ago.

 PigPen - Eclipse plugin for a graphical PigLatin editor
 ---

 Key: PIG-366
 URL: https://issues.apache.org/jira/browse/PIG-366
 Project: Pig
  Issue Type: New Feature
Reporter: Shubham Chopra
Assignee: Robert Gibbon
Priority: Minor
 Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, 
 org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, 
 org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
 org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz


 This is an Eclipse plugin that provides a GUI that can help users create 
 PigLatin scripts and see the example generator outputs on the fly and submit 
 the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-09-13 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908962#action_12908962
 ] 

Yan Zhou commented on PIG-366:
--

Yes. But the original patch by Shubham had hooked the plugin to the example 
generator interface unless you will have found something funky in that patch. I 
have no intention to change the interface.

 PigPen - Eclipse plugin for a graphical PigLatin editor
 ---

 Key: PIG-366
 URL: https://issues.apache.org/jira/browse/PIG-366
 Project: Pig
  Issue Type: New Feature
Reporter: Shubham Chopra
Assignee: Robert Gibbon
Priority: Minor
 Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, 
 org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, 
 org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
 org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz


 This is an Eclipse plugin that provides a GUI that can help users create 
 PigLatin scripts and see the example generator outputs on the fly and submit 
 the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-366) PigPen - Eclipse plugin for a graphical PigLatin editor

2010-09-13 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908971#action_12908971
 ] 

Yan Zhou commented on PIG-366:
--

One more clearification: by design example generator does not submit any jobs 
to hadoop, it just runs at the client as a local application.

 PigPen - Eclipse plugin for a graphical PigLatin editor
 ---

 Key: PIG-366
 URL: https://issues.apache.org/jira/browse/PIG-366
 Project: Pig
  Issue Type: New Feature
Reporter: Shubham Chopra
Assignee: Robert Gibbon
Priority: Minor
 Attachments: org.apache.pig.pigpen-0.7.0.tar.gz, 
 org.apache.pig.pigpen-0.7.2.tar.gz, org.apache.pig.pigpen_0.0.1.jar, 
 org.apache.pig.pigpen_0.0.1.tgz, org.apache.pig.pigpen_0.0.4.jar, 
 org.apache.pig.pigpen_0.7.2.jar, pigpen.patch, pigPen.patch, PigPen.tgz


 This is an Eclipse plugin that provides a GUI that can help users create 
 PigLatin scripts and see the example generator outputs on the fly and submit 
 the jobs to hadoop clusters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-239) illustrate followed by dump gives a runtime exception

2010-09-13 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou resolved PIG-239.
--

Fix Version/s: 0.8.0
   (was: 0.9.0)
   Resolution: Cannot Reproduce

Can not reproduce using 0.8.

 illustrate followed by dump gives a runtime exception
 -

 Key: PIG-239
 URL: https://issues.apache.org/jira/browse/PIG-239
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Pradeep Kamath
Assignee: Yan Zhou
 Fix For: 0.8.0


 Here is a session which outlines the issue:
 grunt a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, 
 age,gpa);
 grunt b = filter a by name lt 'b';
 grunt c = foreach b generate TOKENIZE(name);
 grunt illustrate c;
 -
 | a | name  | age   | gpa   |
 -
 |   | tom xylophone | 69| 0.04  |
 |   | alice ovid| 75| 3.89  |
 -
 --
 | b | name   | age   | gpa   |
 --
 |   | alice ovid | 75| 3.89  |
 --
 -
 | c | (token )  |
 -
 |   | {(alice), (ovid)} |
 -
 grunt dump c;
 2008-05-15 14:35:54,476 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
 java.lang.RuntimeException: java.io.IOException: Serialization error: 
 org.apache.pig.impl.util.
 LineageTracer
 at 
 org.apache.pig.backend.hadoop.executionengine.POMapreduce.copy(POMapreduce.java:242)
 at 
 org.apache.pig.backend.hadoop.executionengine.MapreducePlanCompiler.compile(MapreducePlanCompiler.java:115)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:232)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:209)
 at org.apache.pig.PigServer.optimizeAndRunQuery(PigServer.java:410)
 at org.apache.pig.PigServer.openIterator(PigServer.java:332)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:265)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:162)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseContOnError(GruntParser.java:73)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:54)
 at org.apache.pig.Main.main(Main.java:270)
 Caused by: java.io.IOException: Serialization error: 
 org.apache.pig.impl.util.LineageTracer
 at 
 org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:16)
 at 
 org.apache.pig.impl.util.ObjectSerializer.serialize(ObjectSerializer.java:44)
 at 
 org.apache.pig.backend.hadoop.executionengine.POMapreduce.copy(POMapreduce.java:233)
 ... 10 more
 Caused by: java.io.NotSerializableException: 
 org.apache.pig.impl.util.LineageTracer
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1081)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1375)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1347)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1290)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1079)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:302)
 at java.util.ArrayList.writeObject(ArrayList.java:569)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:585)
 at 
 java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:917)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1339)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1290)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1079)
 at 
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1375)
 at 
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1347)
 at 
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1290)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1079)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:302)
 at java.util.ArrayList.writeObject(ArrayList.java:569)
 at 

[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Release Note: 
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts gz and lzo as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following test.pig script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

  was:
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts gz and lzo as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following test.pig script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

[ Show » ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save 
HDFS space used to store the intermediate data used by PIG and potentially 
improve query execution speed. In general, the more intermediate data 
generated, the more storage and speedup benefits. There are no backward 
compatibility issues as result of this feature. An example is the following 
test.pig script: register pigperf.jar; A = load 
'/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, 
timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links); B1 = filter A by timespent == 4; B = load 
'/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by 
query_term, B by query_term using 'skewed' parallel 300; D = distinct C 
parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp 
/grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig 



 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well 

[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

I use findbugs 1.3.9 and it finds the patch clean. The attached findbugs 
results were generated using 1.3.8, it might be the difference. Anyways, I make 
a minor modification that should fix the warnings by 1.3.8.

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: newPatchFindbugsWarnings.html, PIG-1399.patch, 
 PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
 PIG-1399.patch, PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

  Status: Patch Available  (was: Open)
Release Note: 
This logical simplification contains the following types of simplifications:

1) Constant pre-calculation
Example:
B = filter A by a0  5+7;

is simplified to

B = filter A by a0  12;


2) Elimination of negations
Example:
B = filter A by not (not(a05) or a10);

is simplified to

B = filter A by a05 and a=10;


3) Elimination of logical implied expression in AND
Example:
B = filter A by (a0  5 and a0  7);


is simplified to

B = filter A by a0  7;


4) Elimination of logical implied expression in OR
Example:
B = filter A by ((a0  5) or (a0  6 and a1  15);

is simplified to
B = filter C by a0  5;


5) Equivalence elimination
Example:
B = filter A by (a0  5 and a0  5);

is simplified to

B = filter A by a0  5;


6) Elimination of complementary expressions in OR
Example:
B = filter A by (a0  5 OR a0 = 5);

is simplified to non-filtering


7) Elimination of naive TRUE expression
Example:

B = filter A by 1==1;

is simplified to non-filtering

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: newPatchFindbugsWarnings.html, PIG-1399.patch, 
 PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
 PIG-1399.patch, PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-27 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

Addressing the review comments except for not making several optimization rules 
since the ordering of the application of the rules is significant.

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
 PIG-1399.patch, PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-27 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

rebased on the latest trunk.

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
 PIG-1399.patch, PIG-1399.patch, PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Status: Patch Available  (was: Open)

This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more  storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

An example is the following test.pig script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

rebased on the latest trunk

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903102#action_12903102
 ] 

Yan Zhou commented on PIG-1518:
---

It is not combinable if the loader is a CollectableLoadFunc AND a 
OrderedLoadFunc. Since PigStorage is a CollectableLoadFunc  but not a 
OrderedLoadFunc, it is combinable.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-25 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Attachment: PIG-1501.patch

Address the review comments, code rebasing on the latest trunk.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-25 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Improvement on logging info.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-25 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Status: Open  (was: Patch Available)

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Status: Open  (was: Patch Available)

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Status: Patch Available  (was: Open)

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-24 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Minor polish of a debugging code inside comments

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

The add method if PigSplit is removed. The debug code is left to facilitate 
future debugging work. The use of initNextRecordReader is pretty cloned from 
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader and I'll leave it 
as is too.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Fix a typo; rebase on the latest trunk.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

  Status: Patch Available  (was: Open)
Release Note: 
Feature: combine splits of sizes smaller than the value of property 
pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property pig.noSplitCombination 
to true. When such a combination is performed, a log message like Total input 
paths (combined) to process : 7 will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more under-fed 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-20 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Style changes, Hudson pass, plus other minor changes. Internal Hudson results:

[exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 427 release 
audit warnings (more than the trunk's current 425 warnings).


The release audit warnings are on two html files: PigInputFormat.html and 
PiRecordReader.html

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-20 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

rebased on the latest trunk.

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1399.patch, PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-20 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Attachment: PIG-1501.patch

the compression codec is configurable on gzip or lzo; plus some minor changes

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-20 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900950#action_12900950
 ] 

Yan Zhou commented on PIG-1501:
---

The internal Hudson results are as follows:

 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 9 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] -1 javac.  The applied patch generated 162 javac compiler 
warnings (more than the trunk's current 156 warnings).
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 427 release 
audit warnings (more than the trunk's current 425 warnings).

The 6 javac warnings are from the use of a deprecated PigMapReduce.sJobConf 
field. But that deprecation is for intended for external use only and internal 
use should be ok.

The 2 release audit warnings are on two html files, SampleOptimizer.html and 
org.apache.pig.impl.util.Utils.html.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899888#action_12899888
 ] 

Yan Zhou commented on PIG-1518:
---

In summary, the split combination's controllables are through the following jvm 
properties:

pig.maxCombinedSplitSize: by default, it is the load filesystem's default block 
size. This specifies the maximum combined split size in unit of bytes;

pig.splitCombination: takes values of false and true. The default is 
true. false will disable the split combination.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900123#action_12900123
 ] 

Yan Zhou commented on PIG-1518:
---

No. It does not work inside an optimizer as logical/physical plans are not 
changed as the other optimizers do.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899445#action_12899445
 ] 

Yan Zhou commented on PIG-1518:
---

Another approach is to mark splits as uncombinable only when necessary. 
Specifically, MergeJoinIndexer and the base load in mapside cogroup need to be 
excluded from the split combination. 

Breaking backward compatinility is probably too much a risk to take. In the 
meanwhile, OrderedLoadFunc has a notion of being evolving that will leave 
some headroom for future semantic polishes.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899605#action_12899605
 ] 

Yan Zhou commented on PIG-1518:
---

One experimental result on a 15-node cluster of 2 x Xeon L5420 2.50GHz/16G RAM 
boxes is as follows:

Query:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent, query_term, ip_addr, timestamp,
estimated_revenue, page_info, page_links);
B = foreach A generate user, (double)estimated_revenue;
B1 = distinct B;
alpha = load '/user/pig/tests/data/pigmix/users' using PigStorage('\u0001') as 
(name, phone, address,
city, state, zip);
beta = foreach alpha generate name;
C = join beta by name, B1 by user parallel 300;
D = group C by $0 parallel 40;
E = foreach D generate group, SUM(C.estimated_revenue);
store E into 'spliCombo2.out';

It creates 3 map/reduce jobs.

No Split Combination:

||Mappers|Reducers|
|number|120|300|
|elapsed time|24s|2m43s|
|number|301|300|
|elapsed time|46s|3m11s|
|number|300|40|
|elapsed time|38s|53s|
|Total elapsed time|7m36s|


With Split Combination:

||mappers|Reducers|
|number|120|300|
|elapsed time|22s|2m49s|
|number|3|300|
|elapsed time|27s|2m46s|
|number|1|40|
|elapsed time|17s|24s|
|Total elapsed time|7m5s|

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-17 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899609#action_12899609
 ] 

Yan Zhou commented on PIG-1518:
---

The formatting of the table of the last comment is a bit off: both headers 
should be be right-shifted by one column.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-13 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898490#action_12898490
 ] 

Yan Zhou commented on PIG-1518:
---

There is a bigger question at hand. The semantics of OrderedLoadFunc is that 
the splits are totally ordered. And BinStorage, InterStorage and PigStorage all 
implement that interface through FileInputLoadFunc. Since the combination of 
splits as conceived here will definitely destroy the split ordering, if the 
combination is disabled for these storages, the feature would be virtually 
useless for a majority of use cases.

On the other hand, I'm seeing no use of the comparison capability except for 
MergeJoinIndexer's getNext() method, which makes me wonder if the 
OrderedLoadFunc can be removed from the FileInputLoadFunc.  Semantically, 
FileInputLoadFunc should not support the ordering of splits, as Hadoop's 
FileInputFormat doesn't. When a need arises like in MergeJoinIndexer, we can 
add that extension on. But the change may incur some backward compatibility 
issues.
I'm now soliciting comments in this area.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-12 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897887#action_12897887
 ] 

Yan Zhou commented on PIG-1518:
---

During the merge process, any empty splits will be skipped. Currently empty 
splits will be generated on empty files, which is not necessary at the first 
place.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-11 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897493#action_12897493
 ] 

Yan Zhou commented on PIG-1518:
---

Right, map side cogroup needs the sortness of the input, but just the side 
inputs need the feature to be able to seek on a key; the base input will 
only need presence of all duplicate keys in a mapper. I'll mark the side 
inputs as non-combinable.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-11 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897496#action_12897496
 ] 

Yan Zhou commented on PIG-1501:
---

Please refer to HADOOP-3315 for overall Sequence File vs TFile comparison. It 
appears for compressed data, TFile performs better than SeqFile.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-10 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897005#action_12897005
 ] 

Yan Zhou commented on PIG-1501:
---

The default is *not* using the compression on the intermediate data, which is 
the existing behavoir.

For RC file, it is just a bit better in terms of compression ration  than 
TFile. In terms of performance, the difference is within background noise. 
Stitching costs should be minimal. Actually, the full projection is the 
biggest advantage of RCFile over other columnar storage like  zebra. I was 
surprised to see the compression improvement over TFile is marginal. The only 
cause I can think of is that the compression ratio is too sensitive to the data 
to pre-determine or even pre-estimate.

lzo is under GPL. But it appears that Hadoop installation has it, at least in 
my test cluster.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-10 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Attachment: PIG-1501.patch

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-10 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897085#action_12897085
 ] 

Yan Zhou commented on PIG-1518:
---

The pseudo code of the combination op is as follows:

for each node of the nodes (sorted in the order of ascending sizes) {
while the node's split list (sorted in the order of descending sizes) is not 
empty {
find the biggest splits that can be combined with the first split of the list 
of the splits;
if  the accumulated split size is = half of the limit {
  generate a combined split;
  remove the accumulated splits from the node's split list;
  clear the accumulated split list;
} else {
  break;
}
}
}

// leftover combination
for each node of the nodes {
for each split of the node's split list {
  add the split to a leftover list;
}
}

for each split in the leftover list {
if accumulated split size is = limit {
   generate a combined split;
   remove the accumulated splits from the node's split list;
   clear the accumulated split list;
}
if it is the last split in the leftover list {
  try to see if it can be added with an existing combined split;
  if not, generate a combined split on the accumulated splits;
}
}

The complexity is n*log(n) with n being the number of original splits that are 
smaller than the limit.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-09 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Attachment: compress_perf_data_2.txt

The data set in the last tests are small such that the performance difference 
was lost in background noise.  This test case generates more temporary data.

In summary, lzo generates about 3% compression ration and sees 4x  speed 
improvement than uncompressed;  gzip generates less than 1% compress ratio but 
the speed is 1%-2% slower than uncompressed. This observation is in line with 
the general observation that gzip compresses better but performs worse.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-09 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896620#action_12896620
 ] 

Yan Zhou commented on PIG-1501:
---

Unless there is any objection raised in the coming week, I'll go with LZO 
compression on TFile with the default option to disable compression that will 
be the old behavoir.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Attachment: PIG-1496.patch

More comments in code per the reviewer's comment.

 Mandatory rule ImplicitSplitInserter
 

 Key: PIG-1496
 URL: https://issues.apache.org/jira/browse/PIG-1496
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1496.patch


 Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Attachment: (was: PIG-1496.patch)

 Mandatory rule ImplicitSplitInserter
 

 Key: PIG-1496
 URL: https://issues.apache.org/jira/browse/PIG-1496
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1496.patch


 Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Status: Patch Available  (was: Open)

 Mandatory rule ImplicitSplitInserter
 

 Key: PIG-1496
 URL: https://issues.apache.org/jira/browse/PIG-1496
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1496.patch, PIG-1496.patch


 Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Attachment: PIG-1496.patch

 Mandatory rule ImplicitSplitInserter
 

 Key: PIG-1496
 URL: https://issues.apache.org/jira/browse/PIG-1496
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1496.patch, PIG-1496.patch


 Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-02 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894778#action_12894778
 ] 

Yan Zhou commented on PIG-1518:
---

In contrast with Hive, where the CombineFileInputFormat is used to generate 
input splits on the underlying storage formats, this PIG's combined splits work 
on top of the splits generated by the underlying loaders. In other words, 
Hive's input splits are CombineFileSplits that create record readers of 
underlying storage formats; while Pig's combined input splits contain 
underlying storage's splits.

CombineFileRecordReader would have been reusable if not for its support only in 
0.18 and the need of  CombineFIleSplit as an argument to its constructor 
instead of InputSplit (MAPREDUCE-955).

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-07-30 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894205#action_12894205
 ] 

Yan Zhou commented on PIG-1518:
---

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat,  batches 
small files on the basis of block locality. For PIG, this umbrella input format 
will have to work with the generic input formats for which the block info is 
not available but the data node and size info are present to let the M/R make 
scheduling decisions.

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat,  batches 
small files on the basis of block locality. For PIG, this umbrella input format 
will have to work with the generic input formats for which the block info is 
unavailable but the data node and size info are present to let the M/R make 
scheduling decisions. In other words, PIG can not
break the original splits to work inside but can just use the original splits 
as building block for the combined input splits.

Consequently, this combine input format will be holding multiple generic input 
splits so that each combined split's size is bound by a configured limit of, 
say, pig.maxsplitsize, with the default value of the HDFS block size of the 
file system the load source sits in.

However, due to the constrains of sortness in the tables in merge join, the 
split combination will not be used for any loads that will be used in merge 
join. For mapside cogroup or mapside group by, though, the splits can be 
combined because the splits are only required to contain the all duplicate keys 
per instance and combination of splits will still preserve that invariant.

During combination, the splits on the same data nodes will be merged as much as 
possible. Leftovers will be merged without regarding to the data localities. Of 
all the used data nodes, those of less splits will be merged before considering 
those of more splits so as to minimize the leftovers on the data nodes of less 
splits. On each data node,  a greedy approach is adopted so that largest splits 
are tried to be merged before smaller ones. This is because smaller splits are 
easier merged later among themselves. 
As result, in implementation, a sorted list of data hosts (on the number of 
splits) of sorted lists (on the split size) of the original splits will be 
maintained to efficiently perform the above operations. The complexity should 
be linear with the number of the original splits.

Note that for data locality, we just honor whatever the generic input split's 
getLocations() method produces. Any particular input split's implementation 
actually may or may not hold that property. For instance, CombinedInputFormat 
will combine 
node-local or rack-local blocks into a split. Essentially, this PIG container 
input split works on whatever data locality perception the underlying loader 
provides.

On the implementation side, PigSplit will not hold a single wrapped InputSplit 
instance but a new CombinedInputSplit instance. Accordingly, PigRecordReader 
will hold a list
of wrapped record readers and not just a single one. Correspondingly 
PigRecordReader's nextKeyValue() will use the wrapped record reader in order to 
fetch the next values.

Risks include 1) the test verifications may need major changes since this 
optimization may cause major ordering changes in results; 2) since 
LoadFunc.prepareRead() takes a PigSplit argument, there might be a backward 
compatibility issue as PigSplit changes its wrapped input split to the combined 
input split. But this should be very unlikely as the only known
use of the PigSplit argument is the internal  index loader for the right 
table in merge join.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-07-29 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893746#action_12893746
 ] 

Yan Zhou commented on PIG-1501:
---

gzip and lzo2 are tried as the compression codecs;  TFile and RCFile are used 
as storage formats. The tests are PigMix's L3 and L11, and a variation of L3 
with full projection, hereafter referred as L3_1,  in order to expand the 
temporary data size. (In some cases, multiple runs are executed, particularly 
in presence of doubted system fluctuations.)  End-to-end elapsed times are 
recorded.

The results are on a 15-node cluster of  2 x Xeon L5420 2.50GHz/16G RAM boxes:

  uncompressedTFile(lzo)  TFile(gzip)   
   RCFile(lzo2)
L3133684504   19674398 11513958 
   18092681
 1'40  1'45   
1'40 1'56

   18094161

 1'46

L3_13889095541  36976818752637742581 
3675818160
 3'10   4'4   
 3'253'58
  3697666122
 3675816707
   3'10
3'22
  3697674414
   3'5

L11   25878480   21368784 15233146  
   21112892
 1'52 1'52
  1'571'59

   21112892

  1'59

A few observations are in order:

1) L3 has the highest compress ratio; while L3_1 and L11 much lower compression 
ratio;
2) gzip compress better compared with LZO2 with a little perf cost;
3) RC file should have seen much better compression as it's a columnar store. 
But the actual difference is marginal. It is probably because of L11's unique 
values, and many of  L3_1's random values like time stamp, plus the presence of 
map-typed columns. The conclusion from this observation is that compression of 
temporary intermediate data is not guaranteed to save disk space to a desired 
degree. It's subject to temporary data values being compressed upon. As result, 
this feature should be made configurable;
4)  The performance implications from these tests seem to be negligible within 
background noise or within a few percentages of the overall run times. But this 
is not conclusive yet. Larger and more real life queries would be more suitable 
for the comparison purpose ;
5) RCFile as above has not shown clear advantage in terms of better columnar 
compression ratio. Bu this observation could be data-sensitive.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1453) [zebra] Intermittent failure for TestOrderPreserveUnionHDFS

2010-07-23 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1453:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Committed to the trunk.

 [zebra] Intermittent failure for TestOrderPreserveUnionHDFS
 ---

 Key: PIG-1453
 URL: https://issues.apache.org/jira/browse/PIG-1453
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1453.patch, PIG-1453.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-07-15 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: (was: PIG-1399.patch)

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-07-15 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-07-14 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

Might not be appplicable to trunk yet as it depends upon a uncommitted patch.

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-07-14 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: PIG-1399.patch

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-07-14 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1399:
--

Attachment: (was: PIG-1399.patch)

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-07-13 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887912#action_12887912
 ] 

Yan Zhou commented on PIG-1399:
---

A couple of additional scenarios to be simplified:

5.  equality
Example:
B = filter A by (a0  5 and a0  5);
= B = filter A by a0  5;


6. complementary OR
Example:
B = filter A by (a0  5 OR a0 = 5);
= the filtering is removed

Note that by themselves they both look straightforward and may have little 
value. But used after other simplification rules, it could simplify the end 
results further but could be not obviously applicable at first place.

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-07-13 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887923#action_12887923
 ] 

Yan Zhou commented on PIG-1399:
---

This work is not to optimize on the generic boolean logics, but rather to 
simplify the logic expression based upon the constant values as the logical 
expression's operands. The former, e.g.,  would change an boolean expression of 
((A AND B) OR (A AND C)) to (A AND (B OR C)); while the latter will change, 
say,  (a0  5 and a0  7) to (a0  7).  It is, therefore, up to the query 
writer to optimize his/her boolean logic, probably through use of some other 
tools.

The algorithm works in a series of steps in order:

1) a constant expression evaluation visitor that will evaluate the constant 
expressions. It works by traversing the expression tree in a bottom-up manner 
and evaluate all subexpressions that have all constant subexpressions. All 
results from constant children are pushed to a stack for the parent to digest 
for its own evaluation. Any non-constant expression will push a null to the 
stack and consequently will cause all of its ancestors not to be evaluated.
For simplicity, only constant binary expressions and constant unary expressions 
and evaluated. More complex expressions are not evaluated at this moment. For 
UDF, this evaluation is not planned at all for fear of possible stateful 
consequences resulting from the original evaluations;

2) A NOT conversion visitor that will traverse the expression tree in a 
depth-first manner with post-order handling. A status of negativity for a NOT 
expression is recorded in the depth-first traversal before subtree traversal 
and reversed after traversing the subtree. All reversible expressions is 
replaced by its negated counterpart for negative negativity. Currently equality 
ops, and its non-equality couter part, all range comparisons, logical AND and 
OR are reversible.
   Notably missing is the is null for lack of a is not null base expression;

3) A DNF plan is generated, through a helper DNFPlanGenerator visitor class, 
whose disjunctions are either of OrExpression or a new DNFExpression with type 
of OR, whose conjunctions are either of AndExpression or the new 
DNFExpression with type of AND. 
   The introduction of the new DNFExpression, which extends LogiclExpression, 
is to support multiple children (vs. the two children in a BinaryExpression) to 
facilitate the processing of multiple children
   of an OR or AND operator due to the commutative property of the two 
operators. The leaves of the DNF are of a new LogiclExpressionProxy type that 
extends the LogicalExpression.
   This new type is to be used as a proxy toward the original leaf expression 
in the original filter plan. The purpose is to track how often an original 
expression has been put in 
   the DNF plan as result of the normalizing process. Consequently, a 
DNFSplitCounter member is added to the LogicalExpression, which is incremented 
once a new proxy is created
   on the original expression. Due to the potentially exponential growth of the 
DNF plan, and the nonlinear complexity to trim the DNF plan (see 4 below), the 
size of the DNF plan is limited to 100 nodes beyind which the simplification
   beyond step 2) are just skipped;

4) Then the DNF plan is trimmed according to the inferrence rules between the 
operands of the conjunctions first, and then between the operands of the 
disjuction in the DNF plan.
   If a leaf is trimmed, the counter, DNFSpliCounter, of the source of the 
proxy will be decremented. Basically, the DNF plan is used as a utility to 
determine if an original leaf
   expression can be trimmed from the original filter plan or not. If all 
proxies of the original leaf expression have been trimmed from the DNF plan, 
the original leaf expression can be trimmed from the original plan then.
   The point is that the DNF plan is not intended to replace the original filer 
plan since the DNF plan in general tends to be more expensive to evaluate than 
the original filter plan.

5) The original filter plan is traversed in a bottom-up manner so that if a 
leaf's DNFSpliCounter is zero, which means all of its proxies on DNF has been 
trimmed, the leaf will be trimmed.
   For nonleafs of AND or OR expressions, if one child survives, the child 
will be relinked to the predecessor(s). If either or both children are trimmed, 
the nonleaf will be trimmed
   too. If the whole new filter plan is empty, the filter operator will be 
removed from the logical plan too.

Using a example of B = filter A by NOT((a0  1) or (a1  3 and a0 3+2)):

After 1), the filter plan becomes NOT((a0  1) or (a1  3 and a0 5));

After 2), the filter plan becomes (a0 = 1) AND ((a1 = 3) OR (a0 = 5));

After 3), the DNF plan is ((a0 = 1) AND (a1 = 3)) OR ((a0 = 1) AND (a0 = 
5));

After 4), the DNF plan becomes a0 = 1;

After 5), the filter plan becomes a0 = 1.


 

[jira] Updated: (PIG-1367) [zebra] Map-side Cogroup Test case is needed on 0.7 if the feature is supported in 0.7

2010-07-09 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1367:
--

   Status: Resolved  (was: Patch Available)
 Assignee: Yan Zhou
Fix Version/s: 0.7.0
   (was: 0.8.0)
   Resolution: Fixed

Committed to the 0.7 branch.

 [zebra] Map-side Cogroup Test case is needed on 0.7 if the feature is 
 supported in 0.7
 --

 Key: PIG-1367
 URL: https://issues.apache.org/jira/browse/PIG-1367
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.7.0

 Attachments: PIG-1367.patch


 PIG-1315 has the Zebra support for this feature and the map-side group-by. It 
 also has the test case for map-side COGROUP; while the test case for map-side 
 GROUP-BY is in PIG-1357.
 However PIG-1315 is committed to the trunk as a whole; but only committed to 
 the 0.7 branch without the map-side group-by test case because PIG has yet to 
 decide if the feature will be in the 0.7 release.
 This JIRA is created for tracking purpose should the decision to support 
 map-side COGROUP in 0.7 by PIG is made. If not, this should be made invalid 
 eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1367) [zebra] Map-side Cogroup Test case is needed on 0.7 if the feature is supported in 0.7

2010-06-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1367:
--

Attachment: PIG-1367.patch

 [zebra] Map-side Cogroup Test case is needed on 0.7 if the feature is 
 supported in 0.7
 --

 Key: PIG-1367
 URL: https://issues.apache.org/jira/browse/PIG-1367
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1367.patch


 PIG-1315 has the Zebra support for this feature and the map-side group-by. It 
 also has the test case for map-side COGROUP; while the test case for map-side 
 GROUP-BY is in PIG-1357.
 However PIG-1315 is committed to the trunk as a whole; but only committed to 
 the 0.7 branch without the map-side group-by test case because PIG has yet to 
 decide if the feature will be in the 0.7 release.
 This JIRA is created for tracking purpose should the decision to support 
 map-side COGROUP in 0.7 by PIG is made. If not, this should be made invalid 
 eventually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-06-28 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883348#action_12883348
 ] 

Yan Zhou commented on PIG-1399:
---

Other expression optimizations include:

3.  Optimization of erasure of logical implicated expression in AND
Example:
B = filter A by (a0  5 and a0  7);
= B = filter A by a0  7;

4. Optimization of erasure of logical implicated expression in OR
Example:
B = filter A by ((a0  5) or (a0  6 and a1  15);
= B = filter C by a0  5;

A comprehensive example of 2, 3 and 4 optimizations is:
B = filter A by NOT((a0  1 and a0  0) or (a1  3 and a0 5));
= B = filter A by a0 = 1;

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou

 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1453) [zebra] Intermittent failure for TestOrderPreserveUnionHDFS

2010-06-23 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1453:
--

Status: Open  (was: Patch Available)

 [zebra] Intermittent failure for TestOrderPreserveUnionHDFS
 ---

 Key: PIG-1453
 URL: https://issues.apache.org/jira/browse/PIG-1453
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1453.patch, PIG-1453.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1453) [zebra] Intermittent failure for TestOrderPreserveUnionHDFS

2010-06-23 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1453:
--

Status: Patch Available  (was: Open)

 [zebra] Intermittent failure for TestOrderPreserveUnionHDFS
 ---

 Key: PIG-1453
 URL: https://issues.apache.org/jira/browse/PIG-1453
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1453.patch, PIG-1453.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1451) [zebra] change the build.test property in build to test.build.dir to be in consistent with PIG

2010-06-21 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1451:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

The contrib test failure is due to the commit of PIG-1302 and will be 
addressed separately.

 [zebra] change the build.test property in build to test.build.dir to be in 
 consistent with PIG
 --

 Key: PIG-1451
 URL: https://issues.apache.org/jira/browse/PIG-1451
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0, 0.7.0, 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0, 0.7.0, 0.6.0

 Attachments: PIG-1451.patch


 Because build process handles PIG and Zebra builds in the same settings,  the 
 property should be the same so the build process have consistent controls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   >