[jira] Updated: (PIG-1518) multi file input format for loaders

2010-09-23 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Release Note: 
Feature: combine splits of sizes smaller than the value of property 
pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property pig.splitCombination 
to false. When such a combination is performed, a log message like Total 
input paths (combined) to process : 7 will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more under-fed 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

This change also requires the loader to be stateless across the invocations to 
the prepareToRead method. That is, the method should reset any internal states 
that are not affected by the RecordReader argument.
Otherwise, this feature should be disabled.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.

  was:
Feature: combine splits of sizes smaller than the value of property 
pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is 
not set, the file system default block size of the load's location. This 
feature can be turned off through setting the property pig.noSplitCombination 
to true. When such a combination is performed, a log message like Total input 
paths (combined) to process : 7 will be logged. 

This feature will be applicable if a user input, or an intermediate input, has 
many small files to be loaded that would otherwise cause many more under-fed 
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader 
implementation makes use of the PigSplit object passed through the 
prepareToRead method where a rebuild of the loader might be necessary as 
PigSplit's definition has been modified. However, currently we know of no 
external use of the object.

In addition, if a loader implements IndexableLoadFunc, or implements 
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to 
possible combinations.


 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1642:
-

Assignee: Richard Ding

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1641) Incorrect counters in local mode

2010-09-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

Fix Version/s: 0.8.0

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1641) Incorrect counters in local mode

2010-09-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

Status: Patch Available  (was: Open)

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1641.patch


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1641) Incorrect counters in local mode

2010-09-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

Attachment: PIG-1641.patch

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1641.patch


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-23 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1643:
---

Attachment: PIG-1643.1.patch

PIG-1643.1.patch
There was a code path that lead to fields having NULL datatype instead of the 
default datatype of BYTEARRAY. That was causing these failures. 
Test-patch has succeeded, unit tests are running.


 join fails for a query with input having 'load using pigstorage without 
 schema' + 'foreach'
 ---

 Key: PIG-1643
 URL: https://issues.apache.org/jira/browse/PIG-1643
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1643.1.patch


 {code}
 l1 = load 'std.txt';
 l2 = load 'std.txt'; 
 f1 = foreach l1 generate $0 as abc, $1 as  def;
 -- j =  join f1 by $0, l2 by $0 using 'replicated';
 -- j =  join l2 by $0, f1 by $0 using 'replicated';
 j =  join l2 by $0, f1 by $0 ;
 dump j;
 {code}
 the error -
 {code}
 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2044: The type null cannot be collected as a Key type
 {code}
 The MR plan from explain  -
 {code}
 #--
 # Map Reduce Plan  
 #--
 MapReduce node scope-21
 Map Plan
 Union[tuple] - scope-22
 |
 |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
 |   |   |
 |   |   Project[bytearray][0] - scope-12
 |   |
 |   |---l2: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-0
 |
 |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
 |   |
 |   Project[NULL][0] - scope-14
 |
 |---f1: New For Each(false,false)[bag] - scope-6
 |   |
 |   Project[bytearray][0] - scope-2
 |   |
 |   Project[bytearray][1] - scope-4
 |
 |---l1: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-1
 Reduce Plan
 j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
 |
 |---POJoinPackage(true,true)[tuple] - scope-23
 Global sort: false
 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-23 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1643:
---

Status: Patch Available  (was: Open)

 join fails for a query with input having 'load using pigstorage without 
 schema' + 'foreach'
 ---

 Key: PIG-1643
 URL: https://issues.apache.org/jira/browse/PIG-1643
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1643.1.patch


 {code}
 l1 = load 'std.txt';
 l2 = load 'std.txt'; 
 f1 = foreach l1 generate $0 as abc, $1 as  def;
 -- j =  join f1 by $0, l2 by $0 using 'replicated';
 -- j =  join l2 by $0, f1 by $0 using 'replicated';
 j =  join l2 by $0, f1 by $0 ;
 dump j;
 {code}
 the error -
 {code}
 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2044: The type null cannot be collected as a Key type
 {code}
 The MR plan from explain  -
 {code}
 #--
 # Map Reduce Plan  
 #--
 MapReduce node scope-21
 Map Plan
 Union[tuple] - scope-22
 |
 |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
 |   |   |
 |   |   Project[bytearray][0] - scope-12
 |   |
 |   |---l2: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-0
 |
 |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
 |   |
 |   Project[NULL][0] - scope-14
 |
 |---f1: New For Each(false,false)[bag] - scope-6
 |   |
 |   Project[bytearray][0] - scope-2
 |   |
 |   Project[bytearray][1] - scope-4
 |
 |---l1: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-1
 Reduce Plan
 j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
 |
 |---POJoinPackage(true,true)[tuple] - scope-23
 Global sort: false
 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914126#action_12914126
 ] 

Daniel Dai commented on PIG-1643:
-

+1 if tests pass.

 join fails for a query with input having 'load using pigstorage without 
 schema' + 'foreach'
 ---

 Key: PIG-1643
 URL: https://issues.apache.org/jira/browse/PIG-1643
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1643.1.patch


 {code}
 l1 = load 'std.txt';
 l2 = load 'std.txt'; 
 f1 = foreach l1 generate $0 as abc, $1 as  def;
 -- j =  join f1 by $0, l2 by $0 using 'replicated';
 -- j =  join l2 by $0, f1 by $0 using 'replicated';
 j =  join l2 by $0, f1 by $0 ;
 dump j;
 {code}
 the error -
 {code}
 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2044: The type null cannot be collected as a Key type
 {code}
 The MR plan from explain  -
 {code}
 #--
 # Map Reduce Plan  
 #--
 MapReduce node scope-21
 Map Plan
 Union[tuple] - scope-22
 |
 |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
 |   |   |
 |   |   Project[bytearray][0] - scope-12
 |   |
 |   |---l2: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-0
 |
 |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
 |   |
 |   Project[NULL][0] - scope-14
 |
 |---f1: New For Each(false,false)[bag] - scope-6
 |   |
 |   Project[bytearray][0] - scope-2
 |   |
 |   Project[bytearray][1] - scope-4
 |
 |---l1: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-1
 Reduce Plan
 j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
 |
 |---POJoinPackage(true,true)[tuple] - scope-23
 Global sort: false
 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash

2010-09-23 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914128#action_12914128
 ] 

Yan Zhou commented on PIG-1645:
---

The problem is that both RandomSampleLoader and PossionSampleLoader have 
internal states from the previous invocations that should be reset when a 
different underlying split is worked on under the same umbrella split when the 
split combination (PIG-1518) is on.

When temporary file compression is disabled, Pig internal storage will create 
empty files which will be discarded by split combiner, making the only 
non-empty split as the only split to be worked on, so it is ok in this case.

 Using both small split combination and temporary file compression on a query 
 of ORDER BY may cause crash
 

 Key: PIG-1645
 URL: https://issues.apache.org/jira/browse/PIG-1645
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


 The stack looks like the following:
 java.lang.NullPointerException at 
 java.util.Arrays.binarySearch(Arrays.java:2043) at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
  at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
  at
 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
 org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
  at
 org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-23 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914130#action_12914130
 ] 

Thejas M Nair commented on PIG-1644:


These operations will be fairly common in the optimizer. I think it would be 
good to have functions in the OperatorPlan that support these operations, that 
will reduce the chances of bugs and also make the code more readable.


 New logical plan: Plan.connect with position is misused in some places
 --

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1644-1.patch


 When we replace/remove/insert a node, we will use disconnect/connect methods 
 of OperatorPlan. When we disconnect an edge, we shall save the position of 
 the edge in origination and destination, and use this position when connect 
 to the new predecessor/successor. Some of the pattens are:
 Insert a new node:
 {code}
 PairInteger, Integer pos = plan.disconnect(pred, succ);
 plan.connect(pred, pos.first, newnode, 0);
 plan.connect(newnode, 0, succ, pos.second);
 {code}
 Remove a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove);
 PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ);
 plan.connect(pred, pos1.first, succ, pos2.second);
 {code}
 Replace a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace);
 PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ);
 plan.connect(pred, pos1.first, newNode, pos1.second);
 plan.connect(newNode, pos2.first, succ, pos2.second);
 {code}
 There are couple of places of we does not follow this pattern, that results 
 some error. For example, the following script fail:
 {code}
 a = load '1.txt' as (a0, a1, a2, a3);
 b = foreach a generate a0, a1, a2;
 store b into 'aaa';
 c = order b by a2;
 d = foreach c generate a2;
 store d into 'bbb';
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-23 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914145#action_12914145
 ] 

Yan Zhou commented on PIG-1635:
---

test-patch results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914147#action_12914147
 ] 

Daniel Dai commented on PIG-1644:
-

Yes, I think we can do replace/remove/insert. They should be simple and clear 
enough to use. Here is the new methods adding to OperatorPlan:
{code}
replace(Operator oldOperator, Operator newOperator)
remove(Operator operatorToRemove) // Connect all its successors to 
predecessor/connect all it's predecessors to successor
insertBefore(Operator operatorToInsert, Operator pos) // Insert 
operatorToInsert before pos, connect all pos's predecessors to operatorToInsert
insertAfter(Operator operatorToInsert, Operator pos) // Insert operatorToInsert 
after pos, connect operatorToInsert to all pos's successor
{code}

How does it sounds?

 New logical plan: Plan.connect with position is misused in some places
 --

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1644-1.patch


 When we replace/remove/insert a node, we will use disconnect/connect methods 
 of OperatorPlan. When we disconnect an edge, we shall save the position of 
 the edge in origination and destination, and use this position when connect 
 to the new predecessor/successor. Some of the pattens are:
 Insert a new node:
 {code}
 PairInteger, Integer pos = plan.disconnect(pred, succ);
 plan.connect(pred, pos.first, newnode, 0);
 plan.connect(newnode, 0, succ, pos.second);
 {code}
 Remove a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove);
 PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ);
 plan.connect(pred, pos1.first, succ, pos2.second);
 {code}
 Replace a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace);
 PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ);
 plan.connect(pred, pos1.first, newNode, pos1.second);
 plan.connect(newNode, pos2.first, succ, pos2.second);
 {code}
 There are couple of places of we does not follow this pattern, that results 
 some error. For example, the following script fail:
 {code}
 a = load '1.txt' as (a0, a1, a2, a3);
 b = foreach a generate a0, a1, a2;
 store b into 'aaa';
 c = order b by a2;
 d = foreach c generate a2;
 store d into 'bbb';
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed

2010-09-23 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914150#action_12914150
 ] 

Yan Zhou commented on PIG-1635:
---

All test-core tests also run clean.

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF

2010-09-23 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1639:


Summary: New logical plan: PushUpFilter should not push before 
group/cogroup if filter condition contains UDF  (was: New logical plan: 
PushUpFilter should not optimize if filter condition contains UDF)

 New logical plan: PushUpFilter should not push before group/cogroup if filter 
 condition contains UDF
 

 Key: PIG-1639
 URL: https://issues.apache.org/jira/browse/PIG-1639
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1639-1.patch


 The following script fail:
 {code}
 a = load 'file' AS (f1, f2, f3);
 b = group a by f1;
 c = filter b by COUNT(a)  1;
 dump c;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF

2010-09-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914154#action_12914154
 ] 

Daniel Dai commented on PIG-1639:
-

+1 if all tests pass.

 New logical plan: PushUpFilter should not push before group/cogroup if filter 
 condition contains UDF
 

 Key: PIG-1639
 URL: https://issues.apache.org/jira/browse/PIG-1639
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1639-1.patch


 The following script fail:
 {code}
 a = load 'file' AS (f1, f2, f3);
 b = group a by f1;
 c = filter b by COUNT(a)  1;
 dump c;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-23 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914167#action_12914167
 ] 

Thejas M Nair commented on PIG-1644:


I think insertAsPredecessor and insertAsSuccessor (instead of  insertBefore and 
insertAfter) will convey the idea of what it does a little better. 


 New logical plan: Plan.connect with position is misused in some places
 --

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1644-1.patch


 When we replace/remove/insert a node, we will use disconnect/connect methods 
 of OperatorPlan. When we disconnect an edge, we shall save the position of 
 the edge in origination and destination, and use this position when connect 
 to the new predecessor/successor. Some of the pattens are:
 Insert a new node:
 {code}
 PairInteger, Integer pos = plan.disconnect(pred, succ);
 plan.connect(pred, pos.first, newnode, 0);
 plan.connect(newnode, 0, succ, pos.second);
 {code}
 Remove a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove);
 PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ);
 plan.connect(pred, pos1.first, succ, pos2.second);
 {code}
 Replace a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace);
 PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ);
 plan.connect(pred, pos1.first, newNode, pos1.second);
 plan.connect(newNode, pos2.first, succ, pos2.second);
 {code}
 There are couple of places of we does not follow this pattern, that results 
 some error. For example, the following script fail:
 {code}
 a = load '1.txt' as (a0, a1, a2, a3);
 b = foreach a generate a0, a1, a2;
 store b into 'aaa';
 c = order b by a2;
 d = foreach c generate a2;
 store d into 'bbb';
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1632) The core jar in the tarball contains the kitchen sink

2010-09-23 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1632:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

 The core jar in the tarball contains the kitchen sink 
 --

 Key: PIG-1632
 URL: https://issues.apache.org/jira/browse/PIG-1632
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.8.0, 0.9.0
Reporter: Eli Collins
Assignee: Eli Collins
 Fix For: site, 0.9.0

 Attachments: pig-1632-1.patch, pig-1632-2.patch


 The core jar in the tarball contains the kitchen sink, it's not the same core 
 jar built by ant jar. This is problematic since other projects that want to 
 depend on the pig core jar just want pig core, but 
 pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff 
 (hadoop, com.google, commons, etc) that may conflict with the packages also 
 on a user's classpath.
 {noformat}
 pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l
 12
 pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz
 ...
 pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v 
 pig|wc -l
 4819
 {noformat}
 How about restricting the core jar to just Pig classes?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'

2010-09-23 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1643:
---

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Tests passed.
Patch committed to 0.8 branch and trunk.


 join fails for a query with input having 'load using pigstorage without 
 schema' + 'foreach'
 ---

 Key: PIG-1643
 URL: https://issues.apache.org/jira/browse/PIG-1643
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1643.1.patch


 {code}
 l1 = load 'std.txt';
 l2 = load 'std.txt'; 
 f1 = foreach l1 generate $0 as abc, $1 as  def;
 -- j =  join f1 by $0, l2 by $0 using 'replicated';
 -- j =  join l2 by $0, f1 by $0 using 'replicated';
 j =  join l2 by $0, f1 by $0 ;
 dump j;
 {code}
 the error -
 {code}
 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2044: The type null cannot be collected as a Key type
 {code}
 The MR plan from explain  -
 {code}
 #--
 # Map Reduce Plan  
 #--
 MapReduce node scope-21
 Map Plan
 Union[tuple] - scope-22
 |
 |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
 |   |   |
 |   |   Project[bytearray][0] - scope-12
 |   |
 |   |---l2: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-0
 |
 |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
 |   |
 |   Project[NULL][0] - scope-14
 |
 |---f1: New For Each(false,false)[bag] - scope-6
 |   |
 |   Project[bytearray][0] - scope-2
 |   |
 |   Project[bytearray][1] - scope-4
 |
 |---l1: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-1
 Reduce Plan
 j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
 |
 |---POJoinPackage(true,true)[tuple] - scope-23
 Global sort: false
 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914317#action_12914317
 ] 

Daniel Dai commented on PIG-1644:
-

After looking into the existing code, seems insertBetween is a more useful 
method. So I want to drop insertBefore/insertAfter, and add insertBetween
{code}
insertBetween(Operator pred, Operator operatorToInsert, Operator succ)
{code}

 New logical plan: Plan.connect with position is misused in some places
 --

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1644-1.patch


 When we replace/remove/insert a node, we will use disconnect/connect methods 
 of OperatorPlan. When we disconnect an edge, we shall save the position of 
 the edge in origination and destination, and use this position when connect 
 to the new predecessor/successor. Some of the pattens are:
 Insert a new node:
 {code}
 PairInteger, Integer pos = plan.disconnect(pred, succ);
 plan.connect(pred, pos.first, newnode, 0);
 plan.connect(newnode, 0, succ, pos.second);
 {code}
 Remove a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove);
 PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ);
 plan.connect(pred, pos1.first, succ, pos2.second);
 {code}
 Replace a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace);
 PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ);
 plan.connect(pred, pos1.first, newNode, pos1.second);
 plan.connect(newNode, pos2.first, succ, pos2.second);
 {code}
 There are couple of places of we does not follow this pattern, that results 
 some error. For example, the following script fail:
 {code}
 a = load '1.txt' as (a0, a1, a2, a3);
 b = foreach a generate a0, a1, a2;
 store b into 'aaa';
 c = order b by a2;
 d = foreach c generate a2;
 store d into 'bbb';
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places

2010-09-23 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1644:


Attachment: PIG-1644-2.patch

Attach the patch with new methods and refactory of existing code.

 New logical plan: Plan.connect with position is misused in some places
 --

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1644-1.patch, PIG-1644-2.patch


 When we replace/remove/insert a node, we will use disconnect/connect methods 
 of OperatorPlan. When we disconnect an edge, we shall save the position of 
 the edge in origination and destination, and use this position when connect 
 to the new predecessor/successor. Some of the pattens are:
 Insert a new node:
 {code}
 PairInteger, Integer pos = plan.disconnect(pred, succ);
 plan.connect(pred, pos.first, newnode, 0);
 plan.connect(newnode, 0, succ, pos.second);
 {code}
 Remove a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove);
 PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ);
 plan.connect(pred, pos1.first, succ, pos2.second);
 {code}
 Replace a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace);
 PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ);
 plan.connect(pred, pos1.first, newNode, pos1.second);
 plan.connect(newNode, pos2.first, succ, pos2.second);
 {code}
 There are couple of places of we does not follow this pattern, that results 
 some error. For example, the following script fail:
 {code}
 a = load '1.txt' as (a0, a1, a2, a3);
 b = foreach a generate a0, a1, a2;
 store b into 'aaa';
 c = order b by a2;
 d = foreach c generate a2;
 store d into 'bbb';
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.