date:20100923

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yan Zhou updated PIG-1518:
--

Release Note:
Feature: combine splits of sizes smaller than the value of property
pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is
not set, the file system default block size of the load's location. This
feature can be turned off through setting the property pig.splitCombination
to false. When such a combination is performed, a log message like Total
input paths (combined) to process : 7 will be logged.

This feature will be applicable if a user input, or an intermediate input, has
many small files to be loaded that would otherwise cause many more under-fed
mappers to be launched and potentially slowdown of the execution.

This change will not cause any backward compatibility issue except if a loader
implementation makes use of the PigSplit object passed through the
prepareToRead method where a rebuild of the loader might be necessary as
PigSplit's definition has been modified. However, currently we know of no
external use of the object.

This change also requires the loader to be stateless across the invocations to
the prepareToRead method. That is, the method should reset any internal states
that are not affected by the RecordReader argument.
Otherwise, this feature should be disabled.

In addition, if a loader implements IndexableLoadFunc, or implements
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to
possible combinations.

was:
Feature: combine splits of sizes smaller than the value of property
pig.maxCombinedSplitSize or, if the property of pig.maxCombinedSplitSize is
not set, the file system default block size of the load's location. This
feature can be turned off through setting the property pig.noSplitCombination
to true. When such a combination is performed, a log message like Total input
paths (combined) to process : 7 will be logged.

In addition, if a loader implements IndexableLoadFunc, or implements
OrderedLoadFunc and CollectableLoadFunc, its input splits won't be subject to
possible combinations.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

Attachments: PIG-1518-0.7.0.patch, PIG-1518.patch, PIG-1518.patch,
PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch,
PIG-1518.patch, PIG-1518.patch

We frequently run in the situation where Pig needs to deal with small files
in the input. In this case a separate map is created for each file which
could be very inefficient.
It would be greate to have an umbrella input format that can take multiple
files and use them in a single split. We would like to see this working with
different data formats if possible.
There are already a couple of input formats doing similar thing:
MultifileInputFormat as well as CombinedInputFormat; howevere, neither works
with ne Hadoop 20 API.
We at least want to do a feasibility study for Pig 0.8.0.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1642) Order by doesn't use estimation to determine the parallelism


 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1642:
-

Assignee: Richard Ding

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1641) Incorrect counters in local mode


 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

Fix Version/s: 0.8.0

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1641) Incorrect counters in local mode


 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

Status: Patch Available  (was: Open)

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1641.patch


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1641) Incorrect counters in local mode


 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

Attachment: PIG-1641.patch

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1641.patch


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'


 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1643:
---

Attachment: PIG-1643.1.patch

PIG-1643.1.patch
There was a code path that lead to fields having NULL datatype instead of the 
default datatype of BYTEARRAY. That was causing these failures. 
Test-patch has succeeded, unit tests are running.


 join fails for a query with input having 'load using pigstorage without 
 schema' + 'foreach'
 ---

 Key: PIG-1643
 URL: https://issues.apache.org/jira/browse/PIG-1643
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1643.1.patch


 {code}
 l1 = load 'std.txt';
 l2 = load 'std.txt'; 
 f1 = foreach l1 generate $0 as abc, $1 as  def;
 -- j =  join f1 by $0, l2 by $0 using 'replicated';
 -- j =  join l2 by $0, f1 by $0 using 'replicated';
 j =  join l2 by $0, f1 by $0 ;
 dump j;
 {code}
 the error -
 {code}
 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2044: The type null cannot be collected as a Key type
 {code}
 The MR plan from explain  -
 {code}
 #--
 # Map Reduce Plan  
 #--
 MapReduce node scope-21
 Map Plan
 Union[tuple] - scope-22
 |
 |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
 |   |   |
 |   |   Project[bytearray][0] - scope-12
 |   |
 |   |---l2: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-0
 |
 |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
 |   |
 |   Project[NULL][0] - scope-14
 |
 |---f1: New For Each(false,false)[bag] - scope-6
 |   |
 |   Project[bytearray][0] - scope-2
 |   |
 |   Project[bytearray][1] - scope-4
 |
 |---l1: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-1
 Reduce Plan
 j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
 |
 |---POJoinPackage(true,true)[tuple] - scope-23
 Global sort: false
 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'


 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1643:
---

Status: Patch Available  (was: Open)

 join fails for a query with input having 'load using pigstorage without 
 schema' + 'foreach'
 ---

 Key: PIG-1643
 URL: https://issues.apache.org/jira/browse/PIG-1643
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1643.1.patch


 {code}
 l1 = load 'std.txt';
 l2 = load 'std.txt'; 
 f1 = foreach l1 generate $0 as abc, $1 as  def;
 -- j =  join f1 by $0, l2 by $0 using 'replicated';
 -- j =  join l2 by $0, f1 by $0 using 'replicated';
 j =  join l2 by $0, f1 by $0 ;
 dump j;
 {code}
 the error -
 {code}
 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2044: The type null cannot be collected as a Key type
 {code}
 The MR plan from explain  -
 {code}
 #--
 # Map Reduce Plan  
 #--
 MapReduce node scope-21
 Map Plan
 Union[tuple] - scope-22
 |
 |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
 |   |   |
 |   |   Project[bytearray][0] - scope-12
 |   |
 |   |---l2: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-0
 |
 |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
 |   |
 |   Project[NULL][0] - scope-14
 |
 |---f1: New For Each(false,false)[bag] - scope-6
 |   |
 |   Project[bytearray][0] - scope-2
 |   |
 |   Project[bytearray][1] - scope-4
 |
 |---l1: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-1
 Reduce Plan
 j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
 |
 |---POJoinPackage(true,true)[tuple] - scope-23
 Global sort: false
 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'


[ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914126#action_12914126
 ] 

Daniel Dai commented on PIG-1643:
-

+1 if tests pass.

 join fails for a query with input having 'load using pigstorage without 
 schema' + 'foreach'
 ---

 Key: PIG-1643
 URL: https://issues.apache.org/jira/browse/PIG-1643
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1643.1.patch


 {code}
 l1 = load 'std.txt';
 l2 = load 'std.txt'; 
 f1 = foreach l1 generate $0 as abc, $1 as  def;
 -- j =  join f1 by $0, l2 by $0 using 'replicated';
 -- j =  join l2 by $0, f1 by $0 using 'replicated';
 j =  join l2 by $0, f1 by $0 ;
 dump j;
 {code}
 the error -
 {code}
 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2044: The type null cannot be collected as a Key type
 {code}
 The MR plan from explain  -
 {code}
 #--
 # Map Reduce Plan  
 #--
 MapReduce node scope-21
 Map Plan
 Union[tuple] - scope-22
 |
 |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
 |   |   |
 |   |   Project[bytearray][0] - scope-12
 |   |
 |   |---l2: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-0
 |
 |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
 |   |
 |   Project[NULL][0] - scope-14
 |
 |---f1: New For Each(false,false)[bag] - scope-6
 |   |
 |   Project[bytearray][0] - scope-2
 |   |
 |   Project[bytearray][1] - scope-4
 |
 |---l1: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-1
 Reduce Plan
 j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
 |
 |---POJoinPackage(true,true)[tuple] - scope-23
 Global sort: false
 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1645) Using both small split combination and temporary file compression on a query of ORDER BY may cause crash


[ 
https://issues.apache.org/jira/browse/PIG-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914128#action_12914128
 ] 

Yan Zhou commented on PIG-1645:
---

The problem is that both RandomSampleLoader and PossionSampleLoader have 
internal states from the previous invocations that should be reset when a 
different underlying split is worked on under the same umbrella split when the 
split combination (PIG-1518) is on.

When temporary file compression is disabled, Pig internal storage will create 
empty files which will be discarded by split combiner, making the only 
non-empty split as the only split to be worked on, so it is ok in this case.

 Using both small split combination and temporary file compression on a query 
 of ORDER BY may cause crash
 

 Key: PIG-1645
 URL: https://issues.apache.org/jira/browse/PIG-1645
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


 The stack looks like the following:
 java.lang.NullPointerException at 
 java.util.Arrays.binarySearch(Arrays.java:2043) at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:72)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.getPartition(WeightedRangePartitioner.java:52)
  at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:565) at
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:116)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:238)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
  at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
  at
 org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:638) at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:314) at 
 org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
 java.security.AccessController.doPrivileged(Native Method) at 
 javax.security.auth.Subject.doAs(Subject.java:396) at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
  at
 org.apache.hadoop.mapred.Child.main(Child.java:211) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places


[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914130#action_12914130
 ] 

Thejas M Nair commented on PIG-1644:


These operations will be fairly common in the optimizer. I think it would be 
good to have functions in the OperatorPlan that support these operations, that 
will reduce the chances of bugs and also make the code more readable.


 New logical plan: Plan.connect with position is misused in some places
 --

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1644-1.patch


 When we replace/remove/insert a node, we will use disconnect/connect methods 
 of OperatorPlan. When we disconnect an edge, we shall save the position of 
 the edge in origination and destination, and use this position when connect 
 to the new predecessor/successor. Some of the pattens are:
 Insert a new node:
 {code}
 PairInteger, Integer pos = plan.disconnect(pred, succ);
 plan.connect(pred, pos.first, newnode, 0);
 plan.connect(newnode, 0, succ, pos.second);
 {code}
 Remove a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove);
 PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ);
 plan.connect(pred, pos1.first, succ, pos2.second);
 {code}
 Replace a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace);
 PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ);
 plan.connect(pred, pos1.first, newNode, pos1.second);
 plan.connect(newNode, pos2.first, succ, pos2.second);
 {code}
 There are couple of places of we does not follow this pattern, that results 
 some error. For example, the following script fail:
 {code}
 a = load '1.txt' as (a0, a1, a2, a3);
 b = foreach a generate a0, a1, a2;
 store b into 'aaa';
 c = order b by a2;
 d = foreach c generate a2;
 store d into 'bbb';
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed


[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914145#action_12914145
 ] 

Yan Zhou commented on PIG-1635:
---

test-patch results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places


[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914147#action_12914147
 ] 

Daniel Dai commented on PIG-1644:
-

Yes, I think we can do replace/remove/insert. They should be simple and clear 
enough to use. Here is the new methods adding to OperatorPlan:
{code}
replace(Operator oldOperator, Operator newOperator)
remove(Operator operatorToRemove) // Connect all its successors to 
predecessor/connect all it's predecessors to successor
insertBefore(Operator operatorToInsert, Operator pos) // Insert 
operatorToInsert before pos, connect all pos's predecessors to operatorToInsert
insertAfter(Operator operatorToInsert, Operator pos) // Insert operatorToInsert 
after pos, connect operatorToInsert to all pos's successor
{code}

How does it sounds?

 New logical plan: Plan.connect with position is misused in some places
 --

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1644-1.patch


 When we replace/remove/insert a node, we will use disconnect/connect methods 
 of OperatorPlan. When we disconnect an edge, we shall save the position of 
 the edge in origination and destination, and use this position when connect 
 to the new predecessor/successor. Some of the pattens are:
 Insert a new node:
 {code}
 PairInteger, Integer pos = plan.disconnect(pred, succ);
 plan.connect(pred, pos.first, newnode, 0);
 plan.connect(newnode, 0, succ, pos.second);
 {code}
 Remove a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove);
 PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ);
 plan.connect(pred, pos1.first, succ, pos2.second);
 {code}
 Replace a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace);
 PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ);
 plan.connect(pred, pos1.first, newNode, pos1.second);
 plan.connect(newNode, pos2.first, succ, pos2.second);
 {code}
 There are couple of places of we does not follow this pattern, that results 
 some error. For example, the following script fail:
 {code}
 a = load '1.txt' as (a0, a1, a2, a3);
 b = foreach a generate a0, a1, a2;
 store b into 'aaa';
 c = order b by a2;
 d = foreach c generate a2;
 store d into 'bbb';
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1635) Logical simplifier does not simplify away constants under AND and OR; after simplificaion the ordering of operands of AND and OR may get changed


[ 
https://issues.apache.org/jira/browse/PIG-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914150#action_12914150
 ] 

Yan Zhou commented on PIG-1635:
---

All test-core tests also run clean.

 Logical simplifier does not simplify away constants under AND and OR; after 
 simplificaion the ordering of operands of AND and OR may get changed
 

 Key: PIG-1635
 URL: https://issues.apache.org/jira/browse/PIG-1635
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1635.patch


 b = FILTER a by (( f1  1) AND (1 == 1))
 or 
 b = FILTER a by ((f1  1) OR ( 1==0))
 should be simplified to
 b = FILTER a by f1  1;
 Regarding ordering change, an example is that 
 b = filter a by ((f1 is not null) AND (f2 is not null));
 Even without possible simplification, the expression is changed to
 b = filter a by ((f2 is not null) AND (f1 is not null));
 Even though the ordering change in this case, and probably in most other 
 cases, does not create any difference, but for two reasons some users might 
 care about the ordering: if stateful UDFs are used as operands of AND or OR; 
 and if the ordering is intended by the application designer to maximize the 
 chances to shortcut the composite boolean evaluation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF


 [ 
https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1639:


Summary: New logical plan: PushUpFilter should not push before 
group/cogroup if filter condition contains UDF  (was: New logical plan: 
PushUpFilter should not optimize if filter condition contains UDF)

 New logical plan: PushUpFilter should not push before group/cogroup if filter 
 condition contains UDF
 

 Key: PIG-1639
 URL: https://issues.apache.org/jira/browse/PIG-1639
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1639-1.patch


 The following script fail:
 {code}
 a = load 'file' AS (f1, f2, f3);
 b = group a by f1;
 c = filter b by COUNT(a)  1;
 dump c;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1639) New logical plan: PushUpFilter should not push before group/cogroup if filter condition contains UDF


[ 
https://issues.apache.org/jira/browse/PIG-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914154#action_12914154
 ] 

Daniel Dai commented on PIG-1639:
-

+1 if all tests pass.

 New logical plan: PushUpFilter should not push before group/cogroup if filter 
 condition contains UDF
 

 Key: PIG-1639
 URL: https://issues.apache.org/jira/browse/PIG-1639
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1639-1.patch


 The following script fail:
 {code}
 a = load 'file' AS (f1, f2, f3);
 b = group a by f1;
 c = filter b by COUNT(a)  1;
 dump c;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places


[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914167#action_12914167
 ] 

Thejas M Nair commented on PIG-1644:


I think insertAsPredecessor and insertAsSuccessor (instead of  insertBefore and 
insertAfter) will convey the idea of what it does a little better. 


 New logical plan: Plan.connect with position is misused in some places
 --

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1644-1.patch


 When we replace/remove/insert a node, we will use disconnect/connect methods 
 of OperatorPlan. When we disconnect an edge, we shall save the position of 
 the edge in origination and destination, and use this position when connect 
 to the new predecessor/successor. Some of the pattens are:
 Insert a new node:
 {code}
 PairInteger, Integer pos = plan.disconnect(pred, succ);
 plan.connect(pred, pos.first, newnode, 0);
 plan.connect(newnode, 0, succ, pos.second);
 {code}
 Remove a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove);
 PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ);
 plan.connect(pred, pos1.first, succ, pos2.second);
 {code}
 Replace a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace);
 PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ);
 plan.connect(pred, pos1.first, newNode, pos1.second);
 plan.connect(newNode, pos2.first, succ, pos2.second);
 {code}
 There are couple of places of we does not follow this pattern, that results 
 some error. For example, the following script fail:
 {code}
 a = load '1.txt' as (a0, a1, a2, a3);
 b = foreach a generate a0, a1, a2;
 store b into 'aaa';
 c = order b by a2;
 d = foreach c generate a2;
 store d into 'bbb';
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1632) The core jar in the tarball contains the kitchen sink

2010-09-23 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1632:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

 The core jar in the tarball contains the kitchen sink 
 --

 Key: PIG-1632
 URL: https://issues.apache.org/jira/browse/PIG-1632
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.8.0, 0.9.0
Reporter: Eli Collins
Assignee: Eli Collins
 Fix For: site, 0.9.0

 Attachments: pig-1632-1.patch, pig-1632-2.patch


 The core jar in the tarball contains the kitchen sink, it's not the same core 
 jar built by ant jar. This is problematic since other projects that want to 
 depend on the pig core jar just want pig core, but 
 pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff 
 (hadoop, com.google, commons, etc) that may conflict with the packages also 
 on a user's classpath.
 {noformat}
 pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l
 12
 pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz
 ...
 pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v 
 pig|wc -l
 4819
 {noformat}
 How about restricting the core jar to just Pig classes?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1643) join fails for a query with input having 'load using pigstorage without schema' + 'foreach'


 [ 
https://issues.apache.org/jira/browse/PIG-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1643:
---

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Tests passed.
Patch committed to 0.8 branch and trunk.


 join fails for a query with input having 'load using pigstorage without 
 schema' + 'foreach'
 ---

 Key: PIG-1643
 URL: https://issues.apache.org/jira/browse/PIG-1643
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1643.1.patch


 {code}
 l1 = load 'std.txt';
 l2 = load 'std.txt'; 
 f1 = foreach l1 generate $0 as abc, $1 as  def;
 -- j =  join f1 by $0, l2 by $0 using 'replicated';
 -- j =  join l2 by $0, f1 by $0 using 'replicated';
 j =  join l2 by $0, f1 by $0 ;
 dump j;
 {code}
 the error -
 {code}
 2010-09-22 16:24:48,584 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2044: The type null cannot be collected as a Key type
 {code}
 The MR plan from explain  -
 {code}
 #--
 # Map Reduce Plan  
 #--
 MapReduce node scope-21
 Map Plan
 Union[tuple] - scope-22
 |
 |---j: Local Rearrange[tuple]{bytearray}(false) - scope-11
 |   |   |
 |   |   Project[bytearray][0] - scope-12
 |   |
 |   |---l2: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-0
 |
 |---j: Local Rearrange[tuple]{NULL}(false) - scope-13
 |   |
 |   Project[NULL][0] - scope-14
 |
 |---f1: New For Each(false,false)[bag] - scope-6
 |   |
 |   Project[bytearray][0] - scope-2
 |   |
 |   Project[bytearray][1] - scope-4
 |
 |---l1: 
 Load(file:///Users/tejas/pig_obyfail/trunk/std.txt:org.apache.pig.builtin.PigStorage)
  - scope-1
 Reduce Plan
 j: Store(/tmp/x:org.apache.pig.builtin.PigStorage) - scope-18
 |
 |---POJoinPackage(true,true)[tuple] - scope-23
 Global sort: false
 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1644) New logical plan: Plan.connect with position is misused in some places


[ 
https://issues.apache.org/jira/browse/PIG-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914317#action_12914317
 ] 

Daniel Dai commented on PIG-1644:
-

After looking into the existing code, seems insertBetween is a more useful 
method. So I want to drop insertBefore/insertAfter, and add insertBetween
{code}
insertBetween(Operator pred, Operator operatorToInsert, Operator succ)
{code}

 New logical plan: Plan.connect with position is misused in some places
 --

 Key: PIG-1644
 URL: https://issues.apache.org/jira/browse/PIG-1644
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1644-1.patch


 When we replace/remove/insert a node, we will use disconnect/connect methods 
 of OperatorPlan. When we disconnect an edge, we shall save the position of 
 the edge in origination and destination, and use this position when connect 
 to the new predecessor/successor. Some of the pattens are:
 Insert a new node:
 {code}
 PairInteger, Integer pos = plan.disconnect(pred, succ);
 plan.connect(pred, pos.first, newnode, 0);
 plan.connect(newnode, 0, succ, pos.second);
 {code}
 Remove a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToRemove);
 PairInteger, Integer pos2 = plan.disconnect(nodeToRemove, succ);
 plan.connect(pred, pos1.first, succ, pos2.second);
 {code}
 Replace a node:
 {code}
 PairInteger, Integer pos1 = plan.disconnect(pred, nodeToReplace);
 PairInteger, Integer pos2 = plan.disconnect(nodeToReplace, succ);
 plan.connect(pred, pos1.first, newNode, pos1.second);
 plan.connect(newNode, pos2.first, succ, pos2.second);
 {code}
 There are couple of places of we does not follow this pattern, that results 
 some error. For example, the following script fail:
 {code}
 a = load '1.txt' as (a0, a1, a2, a3);
 b = foreach a generate a0, a1, a2;
 store b into 'aaa';
 c = order b by a2;
 d = foreach c generate a2;
 store d into 'bbb';
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1644) New logical plan: Plan.connect with position is misused in some places