[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-07-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887283#action_12887283
 ] 

Ashutosh Chauhan commented on PIG-1249:
---

Map-reduce framework has a jira related to this issue.  
https://issues.apache.org/jira/browse/MAPREDUCE-1521 It has two implications 
for Pig:

1) We need to reconsider whether we still want Pig to set number of reducers on 
user's behalf. We can choose not to intelligently choose # of reducers and 
let framework fail the  job which doesn't correctly specify # of reducers. 
Then, Pig is out of this guessing game and users are forced by framework to 
correctly specify # of reducers. 

2) Now that MR framework will fail the job based on configured limits, 
operators where Pig does compute and set number of reducers (like skewed join 
etc.) should now be aware of those limits so that # of reducers computed by 
them fall within those limits.

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
 PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-928) UDFs in scripting languages

2010-07-12 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-928:
---

Status: Patch Available  (was: Open)

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, 
 RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, 
 RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, 
 RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, 
 RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-928) UDFs in scripting languages

2010-07-12 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-928:
---

Attachment: (was: RegisterPythonUDF2.patch)

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, 
 RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, 
 RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, 
 RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, 
 RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1490) Make Pig storers work with remote HDFS in secure mode

2010-07-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887424#action_12887424
 ] 

Daniel Dai commented on PIG-1490:
-

+1

 Make Pig storers work with remote HDFS in secure mode
 -

 Key: PIG-1490
 URL: https://issues.apache.org/jira/browse/PIG-1490
 Project: Pig
  Issue Type: Bug
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1490.patch


 PIG-1403 fixed the problem for Pig loaders. We need to do the same for Pig 
 storers. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-07-12 Thread Swati Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887425#action_12887425
 ] 

Swati Jain commented on PIG-1494:
-

Reply from Yan Zhou:

The filter logic split problem can be divided into 2 parts:
1) the filtering logic that can be applied to individual input sources;
and 2) the filtering logic that has to be applied when merged(or joined)
inputs are processed.

The benefits for 1) are any benefits if the underlying storage supports
predicate pushdown; plus the memory/CPU savings by PIG for not
processing the unqualified rows.

For 2), the purpose is not paying higher evaluation costs than
necessary.

For 1), no normal form is necessary. The original logical expression
tree
can be trimmed off any sub-expressions that are not constants nor just
from a particular input source. The complexity is linear with the tree
size; while the use of normal form could potentially lead to exponential
complexity. The difficulty with this approach is how to generate the
filtering logic for 2); while CNF can be used to easily figure out the
logic for 2). However, the exact logic in 2) might not be cheaper to
evaluate than the original logical expression. An example is Filter J2
by ((C1  10) AND (a3+b310)) OR ((C2 == 5) AND (a2+b2 5)). In 2) the
filtering logic after CNF will be ((C1  10) OR (a2+b2  5)) AND
((a3+b310) OR (C2 == 5)) AND ((a3+b3 10) OR (a2+b2  5)). The cost
will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
to 4, compared with 3 logical evaluations in the original form.

In summary, if only 1) is desired, the tree trimming is enough. If 2) is
desired too, then CNF could be used but its complexity should be
controlled and the cost of the filtering logic evaluation in 2) should
be computed and compared with the original expression evaluation cost.
Further optimization is possible in this direction.

Another potential optimization to consider is to support logical
expression tree of multiple children vs. the binary tree after taking
into consideration of the commutative property of OR and AND operations.
The advantages are less tree traversal costs and easier to change the
evaluation ordering within the same sub-tree in order to maximize the
possibilities to short-cut the evaluations. Although this is general for
all logical expressions, this tends to be more suitable for normal form
handlings as normal forms group the sub-expressions by the operators
that act on the sub-expressions.

 PIG Logical Optimization: Use CNF in PushUpFilter
 -

 Key: PIG-1494
 URL: https://issues.apache.org/jira/browse/PIG-1494
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Swati Jain
Priority: Minor
 Fix For: 0.8.0


 The PushUpFilter rule is not able to handle complicated boolean expressions.
 For example, SplitFilter rule is splitting one LOFilter into two by AND. 
 However it will not be able to split LOFilter if the top level operator is 
 OR. For example:
 *ex script:*
 A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
 B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
 C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
 J1 = JOIN B by b1, C by c1;
 J2 = JOIN J1 by $0, A by a1;
 D = *Filter J2 by ( (c1  10) AND (a3+b3  10) ) OR (c2 == 5);*
 explain D;
 In the above example, the PushUpFilter is not able to push any filter 
 condition across any join as it contains columns from all branches (inputs). 
 But if we convert this expression into Conjunctive Normal Form (CNF) then 
 we would be able to push filter condition c1 10 and c2 == 5 below both join 
 conditions. Here is the CNF expression for highlighted line:
 ( (c1  10) OR (c2 == 5) ) AND ( (a3+b3  10) OR (c2 ==5) )
 *Suggestion:* It would be a good idea to convert LOFilter's boolean 
 expression into CNF, it would then be easy to push parts (conjuncts) of the 
 LOFilter boolean expression selectively. We would also not require rule 
 SplitFilter anymore if we were to add this utility to rule PushUpFilter 
 itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-07-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887420#action_12887420
 ] 

Daniel Dai commented on PIG-1295:
-

More clarification for custom Tuple. There two cases for custom tuple:
1. User create custom tuple inside UDF. In this case, we do not have a special 
serialized format for custom tuple. After serialization, we cannot tell if it 
is a custom tuple. That is say, we lose track of tuple implementation after 
se/des. Since serialized format is the same, we can still use the same raw 
comparator.
2. If user use a custom tuple factory (by overriding 
pig.data.tuple.factory.name), then serialized format may be changed. If we 
detect that tuple factory is not BinSedesTupleFactory, we shall not use this 
raw comparator.

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, 
 PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-928) UDFs in scripting languages

2010-07-12 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-928:
---

Status: Open  (was: Patch Available)

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, 
 RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, 
 RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, 
 RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, 
 RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-07-12 Thread Swati Jain (JIRA)
PIG Logical Optimization: Use CNF in PushUpFilter
-

 Key: PIG-1494
 URL: https://issues.apache.org/jira/browse/PIG-1494
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Swati Jain
Priority: Minor
 Fix For: 0.8.0


The PushUpFilter rule is not able to handle complicated boolean expressions.

For example, SplitFilter rule is splitting one LOFilter into two by AND. 
However it will not be able to split LOFilter if the top level operator is 
OR. For example:

*ex script:*
A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
J1 = JOIN B by b1, C by c1;
J2 = JOIN J1 by $0, A by a1;
D = *Filter J2 by ( (c1  10) AND (a3+b3  10) ) OR (c2 == 5);*
explain D;
In the above example, the PushUpFilter is not able to push any filter condition 
across any join as it contains columns from all branches (inputs). But if we 
convert this expression into Conjunctive Normal Form (CNF) then we would be 
able to push filter condition c1 10 and c2 == 5 below both join conditions. 
Here is the CNF expression for highlighted line:

( (c1  10) OR (c2 == 5) ) AND ( (a3+b3  10) OR (c2 ==5) )

*Suggestion:* It would be a good idea to convert LOFilter's boolean expression 
into CNF, it would then be easy to push parts (conjuncts) of the LOFilter 
boolean expression selectively. We would also not require rule SplitFilter 
anymore if we were to add this utility to rule PushUpFilter itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1490) Make Pig storers work with remote HDFS in secure mode

2010-07-12 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1490:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
Release Note: Committed to both trunk and 0.7 branch
  Resolution: Fixed

 Make Pig storers work with remote HDFS in secure mode
 -

 Key: PIG-1490
 URL: https://issues.apache.org/jira/browse/PIG-1490
 Project: Pig
  Issue Type: Bug
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0, 0.7.0

 Attachments: PIG-1490.patch


 PIG-1403 fixed the problem for Pig loaders. We need to do the same for Pig 
 storers. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: PIG Logical Optimization: Use CNF in SplitFilter

2010-07-12 Thread Yan Zhou
Yes, I already implemented the NOT push down upfront, so you do not
need to do that.

 

The support of CNF will probably be the most difficulty part. But as I
mentioned last time, you should compare the cost after the trimming CNF
to get the post-split filtering logic. Given the complexity of
manipulating CNF and undetermined benefits, I am not sure it should be
in scope at this moment or not.

 

To handle CNF, I think it's a good idea to create a new plan and connect
the nodes in the new plan to the base plan as you envisioned. In my
changes, which uses DNF instead of CNF but processing is similar
otherwise, I use a LogicalExpressionProxy, which contains a source
member that is just the node in the original plan, to link the nodes in
the new plan and old plan.  The original LogicalExpression is enhanced
with a counter to trace the # of proxies of the original nodes since
normal form creation will spread the nodes in the original tree across
many normalized nodes. The benefit, aside from not setting the plan, is
that the original expression is trimmed according to the processing
results from DNF; while DNF is created separately and as a kinda utility
so that complex features can be used. In my changes, I used
multiple-child tree in DNF while not changing the original binary
expression tree structure. Another benefit is that the original tree is
kept as much as it is at the start, i.e., I do not attempt to optimize
its overall structure beyond trimming based upon the simplification
logics. (I also control the size of DNF to 100 nodes.) The down side of
this is added complexity.

 

But in your case, for scenario 2 which is the whole point to use CNF,
you would need to change the original expression tree structurally
beyond trimming for post-split filtering logic. The other benefit of
using multiple-child expression is depending upon if you plan to support
such expression to replace current binary tree

in the final plan. Even though I think it's a good idea to support that,
but it is not in my scope now.

 

I'll add my algorithm details soon to my jira. Please take a look and
comment as you see appropriate.

 

Thanks,

 

Yan

 

 



From: Swati Jain [mailto:swat...@aggiemail.usu.edu] 
Sent: Friday, July 09, 2010 11:00 PM
To: Yan Zhou
Cc: pig-dev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter

 

Hi Yan,

I agree that the first scenario (filter logic applied to individual
input sources) doesn't need conversion to CNF and that it will be a good
idea to add CNF functionality for the second scenario. I was also
planning to provide a configurable threshold value to control the
complexity of CNF conversion.

As part of the above, I wrote a utility to push the NOT operator in
predicates below the AND and OR operators (Scenario 2 in PIG-1399).
I am considering making this utility to push NOT a separate rule in
itself. Lmk if you have already implemented this.

While implementing this utility I am facing some trouble in maintaining
OperatorPlan consistent as I rewrite the expression. This is because
each operator is referencing the main filter logical plan. Here is my
current approach of implementation:

1. I am creating a new LogicalExpressionPlan for the converted boolean
expression.
2. I am creating new logical expressions while pushing the NOT
operation, converting AND into OR, OR into AND eliminating NOT NOT
pairs.
3. However, I am having trouble updating the LogicalExpressionPlan if it
reaches the base case ( i.e. root operator is not NOT,AND,OR).

D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1  10) AND (c3+b3  10 ) ) )
);

In the above, for example, I am not sure how to integrate base
expression (c2 == 5) into the new LogicalExpressionPlan. There is no
routine to set the plan for a given operator and its children. Also,
there is currently no way to deepCopy an expression into a new
OperatorPlan. It would be great if you could give me some suggestions on
what approach to take for this.

One approach I thought of is to visit the base expression and create and
connect the base expression to the LogicalExpressionPlan as I visit it.

Thoughts?
Swati

ps: About your other point regarding binary vs multi way trees, the way
I am creating the normal form is a list of conjuncts, where each
conjunct is a list of disjuncts. This is logically similar to a multi
waytree. However, the current modeling of boolean expressions (modeled
as binary expressions) requires a conversion back to the binary tree
model when adding back to the main plan.

On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou y...@yahoo-inc.com wrote:

Swati,

I happen to be working on the logical expression simplification effort
(https://issues.apache.org/jira/browse/PIG-1399), but not on the filter
split front. So I guess our interests will have some overlaps.

I think the filter logic split problem can be divided into 2 parts:
1) the filtering logic that can be applied to individual input sources;

[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-12 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887441#action_12887441
 ] 

Thejas M Nair commented on PIG-1472:


bq. 1. The following code are never used in BinStorage and InterStorage, should 
be removed. 
I will remove that.

bq. 3. Seems InterStorage is a replacement for BinStorage, why do we make it 
private? Shall we encourage user use InterStorage in the place of BinStorage, 
and make BinStorage deprecate?
In future, we are likely to find better ways to serialize data between MR jobs 
of a pig query. ie the InterSedes serialization format is likely to change, and 
the change is not likely to be compatible with its old format. So it will not 
be suitable for storing persistent data. 
This replaces BinStorage only for its use within pig. Since BinStorage is used 
in pig queries and it should be easy to maintain the code, I think we don't 
have to deprecate BinStorage.



 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-12 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1472:
---

Attachment: PIG-1472.4.patch

Removed unused static constants from InterStorage and BinStorage , addressing 
comment#1 from Daniel. 


 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
 PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1436) Print number of records outputted at each step of a Pig script

2010-07-12 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887446#action_12887446
 ] 

Richard Ding commented on PIG-1436:
---

Russell,

PIG-1478 implemented a callback mechanism that allows users to retrieve stats 
after each job. Will this meet your needs? 

 Print number of records outputted at each step of a Pig script
 --

 Key: PIG-1436
 URL: https://issues.apache.org/jira/browse/PIG-1436
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Affects Versions: 0.7.0
Reporter: Russell Jurney
Assignee: Richard Ding
Priority: Minor
 Fix For: 0.8.0


 I often run a script multiple times, or have to go and look through Hadoop 
 task logs, to figure out where I broke a long script in such a way that I get 
 0 records out of it.  I think this is a common problem.
 If someone can point me in the right direction, I can make a pass at this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-884) Have a way to export RulePlan and other kinds of OperatorPlan to common representaiton (dot?) and import from dot to RulePlan

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-884.


Resolution: Fixed

dot notation for explain was used as part of Pig 0.3.0 work.

 Have a way to export RulePlan and other kinds of OperatorPlan to common 
 representaiton (dot?) and import from dot to RulePlan
 -

 Key: PIG-884
 URL: https://issues.apache.org/jira/browse/PIG-884
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

 Have a way to export RulePlan and other kinds of OperatorPlan to common 
 representaiton (dot?) and import from dot to RulePlan

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-886) clone should be updated in LogicalOperators to include cloning of projection map information and any other information used by LogicalOptimizer

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-886.


Resolution: Fixed

This is no longer relevant with the optimizer re-work

 clone should be updated in LogicalOperators to include cloning of projection 
 map information and any other information used by LogicalOptimizer
 ---

 Key: PIG-886
 URL: https://issues.apache.org/jira/browse/PIG-886
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

 clone should be updated in LogicalOperators to include cloning of projection 
 map information and any other information used by LogicalOptimizer

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-900:
---

Fix Version/s: 0.9.0

 ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and 
 FILTER BY
 -

 Key: PIG-900
 URL: https://issues.apache.org/jira/browse/PIG-900
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz
 Fix For: 0.9.0


 With GROUP BY, you must put parentheses around the aliases in the BY clause:
 {code}
 B = group A by ( a, b, c );
 {code}
 With FILTER BY, you can optionally put parentheses around the aliases in the 
 BY clause:
 {code}
 B = filter A by ( a is not null and b is not null and c is not null );
 {code}
 However, with ORDER BY, if you put parenthesis around the BY clause, you get 
 a syntax error:
 {code}
  A = order A by ( a, b, c );
 {code}
 Produces the error:
 {code}
 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 
 19.
 Was expecting:
 ) ...
 {code}
 This is an annoyance really.
 Here's my full code example ...
 {code}
 A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
 chararray );
 A = order A by ( a, b, c );
 dump A;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-902) Allow schema matching for UDF with variable length arguments

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-902:
---

Fix Version/s: 0.9.0

 Allow schema matching for UDF with variable length arguments
 

 Key: PIG-902
 URL: https://issues.apache.org/jira/browse/PIG-902
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
 Fix For: 0.9.0


 Pig pick the right version of UDF using a similarity measurement. This 
 mechanism pick the UDF with right input schema to use. However, some UDFs use 
 various number of inputs and currently there is no way to declare such input 
 schema in UDF and similarity measurement do not match against variable number 
 of inputs. We can still write variable inputs UDF, but we cannot rely on 
 schema matching to pick the right UDF version and do the automatic data type 
 conversion.
 Eg:
 If we have:
 Integer udf1(Integer, ..);
 Integer udf1(String, ..);
 Currently we cannot do this:
 a: {chararray, chararray}
 b = foreach a generate udf1(a.$0, a.$1);  // Pig cannot pick the udf(String, 
 ..) automatically, currently, this statement fails
 Eg:
 If we have:
 Integer udf2(Integer, ..);
 Currently, this script fail
 a: {chararray, chararray}
 b = foreach a generate udf1(a.$0, a.$1);  // Currently, Pig cannot convert 
 a.$0 into Integer automatically

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-903) ILLUSTRATE fails on 'Distinct' operator

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-903:
---

Fix Version/s: 0.9.0

 ILLUSTRATE fails on 'Distinct' operator
 ---

 Key: PIG-903
 URL: https://issues.apache.org/jira/browse/PIG-903
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.9.0


 Using the latest Pig from trunk (0.3+) in mapreduce mode, running through the 
 tutorial script script1-hadoop.pig works fine.
 However, executing the following illustrate command throws an exception:
 illustrate ngramed2
 Pig Stack Trace
 ---
 ERROR 2999: Unexpected internal error. Unrecognized logical operator.
 java.lang.RuntimeException: Unrecognized logical operator.
 at 
 org.apache.pig.pen.EquivalenceClasses.GetEquivalenceClasses(EquivalenceClasses.java:60)
 at 
 org.apache.pig.pen.DerivedDataVisitor.evaluateOperator(DerivedDataVisitor.java:368)
 at 
 org.apache.pig.pen.DerivedDataVisitor.visit(DerivedDataVisitor.java:226)
 at 
 org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:104)
 at 
 org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:37)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:98)
 at 
 org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:90)
 at 
 org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:106)
 at org.apache.pig.PigServer.getExamples(PigServer.java:724)
 at 
 org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:541)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:195)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
 at org.apache.pig.Main.main(Main.java:361)
 
 This works:
 illustrate ngramed1;
 Although it does throw a few NPEs :
 java.lang.NullPointerException
   at 
 org.apache.pig.pen.util.DisplayExamples.ShortenField(DisplayExamples.java:205)
   at 
 org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
   at 
 org.apache.pig.pen.util.DisplayExamples.PrintTabular(DisplayExamples.java:86)
 [...]
 (illustrate also doesn't work on bzipped input, but that's a separate issue)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-898) TextDataParser does not handle delimiters from one complex type in another

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-898.


Fix Version/s: 0.7.0
   Resolution: Fixed

This has been addressed as part of 613

 TextDataParser does not handle delimiters from one complex type in another
 --

 Key: PIG-898
 URL: https://issues.apache.org/jira/browse/PIG-898
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.7.0


 Currently, TextDataParser does not handle delimiters of one complex type in 
 another. An example of such a case is key1(#value1} will not be parsed 
 correctly. The production for strings matches any sequence of character that 
 do not contain any delimiters for the complex types.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: PIG Logical Optimization: Use CNF in SplitFilter

2010-07-12 Thread Swati Jain
I was wondering if you are not going to check in your patch soon then it
would be great if you could share it with me. I believe I might be able to
reuse some of your (utility) functionality directly or get some ideas.

About your cost-benefit question:
1) I will control the complexity of CNF conversion by providing a
configurable threshold value which will limit the OR-nesting.
2) One benefit of this conversion is that it will allow pushing parts of a
filter (conjuncts) across the joins which is not happening in the current
PushUpFilter optimization. Moreover, it may result in a cascading effect to
push the conjuncts below other operators by other rules that may be fired as
a result. The benefit from this is really data dependent, but in big-data
workloads, any kind of predicate pushdown may eventually lead to big savings
in amount of data read or amount of data transfered/shuffled across the
network (I need to understand the LogicalPlan to PhysicalPlan conversion
better to give concrete examples).

Thanks!
Swati

On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou y...@yahoo-inc.com wrote:

  Yes, I already implemented the “NOT push down” upfront, so you do not
 need to do that.



 The support of CNF will probably be the most difficulty part. But as I
 mentioned last time, you should compare the cost after the trimming CNF to
 get the post-split filtering logic. Given the complexity of manipulating CNF
 and undetermined benefits, I am not sure it should be in scope at this
 moment or not.



 To handle CNF, I think it’s a good idea to create a new plan and connect
 the nodes in the new plan to the base plan as you envisioned. In my changes,
 which uses DNF instead of CNF but processing is similar otherwise, I use a
 LogicalExpressionProxy, which contains a “source” member that is just the
 node in the original plan, to link the nodes in the new plan and old plan.
  The original LogicalExpression is enhanced with a counter to trace the # of
 proxies of the original nodes since normal form creation will “spread” the
 nodes in the original tree across many normalized nodes. The benefit, aside
 from not setting the plan, is that the original expression is trimmed
 according to the processing results from DNF; while DNF is created
 separately and as a kinda utility so that complex features can be used. In
 my changes, I used multiple-child tree in DNF while not changing the
 original binary expression tree structure. Another benefit is that the
 original tree is kept as much as it is at the start, i.e., I do not attempt
 to optimize its overall structure beyond trimming based upon the
 simplification logics. (I also control the size of DNF to 100 nodes.) The
 down side of this is added complexity.



 But in your case, for scenario 2 which is the whole point to use CNF, you
 would need to change the original expression tree structurally beyond
 trimming for post-split filtering logic. The other benefit of using
 multiple-child expression is depending upon if you plan to support such
 expression to replace current binary tree

 in the final plan. Even though I think it’s a good idea to support that,
 but it is not in my scope now.



 I’ll add my algorithm details soon to my jira. Please take a look and
 comment as you see appropriate.



 Thanks,



 Yan




  --

 *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu]
 *Sent:* Friday, July 09, 2010 11:00 PM
 *To:* Yan Zhou
 *Cc:* pig-dev@hadoop.apache.org
 *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter



 Hi Yan,

 I agree that the first scenario (filter logic applied to individual input
 sources) doesn't need conversion to CNF and that it will be a good idea to
 add CNF functionality for the second scenario. I was also planning to
 provide a configurable threshold value to control the complexity of CNF
 conversion.

 As part of the above, I wrote a utility to push the NOT operator in
 predicates below the AND and OR operators (Scenario 2 in PIG-1399). I am
 considering making this utility to push NOT a separate rule in itself. Lmk
 if you have already implemented this.

 While implementing this utility I am facing some trouble in maintaining
 OperatorPlan consistent as I rewrite the expression. This is because each
 operator is referencing the main filter logical plan. Here is my current
 approach of implementation:

 1. I am creating a new LogicalExpressionPlan for the converted boolean
 expression.
 2. I am creating new logical expressions while pushing the NOT operation,
 converting AND into OR, OR into AND eliminating NOT NOT pairs.
 3. However, I am having trouble updating the LogicalExpressionPlan if it
 reaches the base case ( i.e. root operator is not NOT,AND,OR).

 D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1  10) AND (c3+b3  10 ) ) ) );

 In the above, for example, I am not sure how to integrate base expression
 (c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the
 plan for a given 

[jira] Commented: (PIG-914) Change the PIG hbase interface to use bytes along with strings

2010-07-12 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887479#action_12887479
 ] 

Olga Natkovich commented on PIG-914:


ALex, are you still planning to work on this?

 Change the PIG hbase interface to use bytes along with strings
 --

 Key: PIG-914
 URL: https://issues.apache.org/jira/browse/PIG-914
 Project: Pig
  Issue Type: Improvement
Reporter: Alex Newman
Priority: Minor

 Currently start rows, tablenames, column names are all strings, and HBase 
 supports bytes we might want to change the Pig interface to support bytes 
 along with strings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-916) Change the pig hbase interface to get more than one row at a time when scanning

2010-07-12 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887480#action_12887480
 ] 

Olga Natkovich commented on PIG-916:


Alex, are you still planning to work on this?

 Change the pig hbase interface to get more than one row at a time when 
 scanning
 ---

 Key: PIG-916
 URL: https://issues.apache.org/jira/browse/PIG-916
 Project: Pig
  Issue Type: Improvement
Reporter: Alex Newman
Priority: Trivial

 It should be significantly faster to get numerous rows at the same time 
 rather than one row at a time for large table extraction processes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig

2010-07-12 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887476#action_12887476
 ] 

Olga Natkovich commented on PIG-909:


Did this actually get checked in? Should this be resurrected for Pig 0.8.0 or 
closed?

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-932) Required fields projection in Loader: nested fields in bag/tuple, map key lookup more than two levels

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-932:
---

Fix Version/s: 0.8.0

 Required fields projection in Loader: nested fields in bag/tuple, map key 
 lookup more than two levels
 -

 Key: PIG-932
 URL: https://issues.apache.org/jira/browse/PIG-932
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0


 To leverage the performance features provided by Zebra, Pig should be able to 
 figure out which input fields are actually used in Pig script, and prune 
 unnecessary inputs. This feature is being implementing in 
 [PIG-922|https://issues.apache.org/jira/browse/PIG-922]. However, there are 
 two limitations currently:
 1. Pruning nested fields only apply to map. We do not prune sub-field inside 
 a bag or tuple
 2. For map, currently we only go one level deep. Eg, if in Pig script, user 
 uses a#'key0'#'key1', a#'key0' will be asked
 These two limitations are in line with current limitation of Zebra loader. 
 Once Zebra loader can handle this, we need to work to lift these limitations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-931) Samples Syntax Error in Pig UDF Manual

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-931:
---

 Assignee: Corinne Chandel
Fix Version/s: 0.8.0

 Samples Syntax Error in Pig UDF Manual
 --

 Key: PIG-931
 URL: https://issues.apache.org/jira/browse/PIG-931
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.2.0, 0.3.0
 Environment: Windows XP, firefox 3.5.2
Reporter: Yiwei Chen
Assignee: Corinne Chandel
Priority: Trivial
 Fix For: 0.8.0


 All samples with 'extends EvalFunc' have syntax errors in 
 http://hadoop.apache.org/pig/docs/r0.3.0/udf.html .
 There shouldn't be parentheses; they are angle brackets.
 For example in How to Write a Simple Eval Function section:
   public class UPPER extends EvalFunc (String)
 should be 
   public class UPPER extends EvalFuncString

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-930) merge join should handle compressed bz2 sorted files

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-930:
---

Fix Version/s: 0.8.0

Likely, this is no longer an issue in 0.7.0. Need to verify and add unit tests

 merge join should handle compressed bz2 sorted files
 

 Key: PIG-930
 URL: https://issues.apache.org/jira/browse/PIG-930
 Project: Pig
  Issue Type: Bug
Reporter: Pradeep Kamath
 Fix For: 0.8.0


 There are two issues - POLoad which is used to read the right side input does 
 not handle bz2 files right now. This needs to be fixed.
 Further inn the index map job we bindTo(startOfBlockOffSet) (this will 
 internally discard first tuple if offset  0). Then we do the following:
 {noformat}
 While(tuple survives pipeline) {
   Pos =  getPosition()
   getNext() 
   run the tuple  through pipeline in the right side which could have filter
 }
 Emit(key, pos, filename).
 {noformat}
  
 Then in the map job which does the join, we bindTo(pos  0 ? pos  1 : pos) 
 (we do pos -1 because bindTo will discard first tuple for pos 0). Then we do 
 getNext()
 Now in bz2 compressed files, getPosition() returns a position which is not 
 really accurate. The problem is it could be a position in the middle of a 
 compressed bz2 block. Then when we use that position to bindTo() in the final 
 map job, the code would first hunt for a bz2 block header thus skipping the 
 whole current bz2 block. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-932) Required fields projection in Loader: nested fields in bag/tuple, map key lookup more than two levels

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-932:
---

Assignee: Daniel Dai

Possible work for 0.8.0. Need to see if we have time

 Required fields projection in Loader: nested fields in bag/tuple, map key 
 lookup more than two levels
 -

 Key: PIG-932
 URL: https://issues.apache.org/jira/browse/PIG-932
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0


 To leverage the performance features provided by Zebra, Pig should be able to 
 figure out which input fields are actually used in Pig script, and prune 
 unnecessary inputs. This feature is being implementing in 
 [PIG-922|https://issues.apache.org/jira/browse/PIG-922]. However, there are 
 two limitations currently:
 1. Pruning nested fields only apply to map. We do not prune sub-field inside 
 a bag or tuple
 2. For map, currently we only go one level deep. Eg, if in Pig script, user 
 uses a#'key0'#'key1', a#'key0' will be asked
 These two limitations are in line with current limitation of Zebra loader. 
 Once Zebra loader can handle this, we need to work to lift these limitations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-947) Parsing Bags by PigStorage is not handled correctly if whitespace before start of tuple.

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-947:
---

Fix Version/s: 0.8.0

 Parsing Bags by PigStorage is not handled correctly if whitespace before 
 start of tuple.
 

 Key: PIG-947
 URL: https://issues.apache.org/jira/browse/PIG-947
 Project: Pig
  Issue Type: Bug
  Components: data
 Environment: Pig on Hadoop 18
Reporter: Gandul Azul
 Fix For: 0.8.0


 PigStorage parser for bags is not working correctly when a tuple in a bag is 
 proceeded by a space. For example, the following is parsed correctly:
 {(-5.243084,3.142401,0.000138,2.071200,0),(-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}
 while this is not: (Note the space before the second tuple)
 {(-5.243084,3.142401,0.000138,2.071200,0), 
 (-6.021349,0.992683,0.44,0.992683,0),(-10.426160,20.251774,0.000892,5.691086,0)}
 It seems that the parser when it encounters the space, treats the rest of the 
 line as a String. With a schema, this results in a typecast of string to 
 databag which results in exception. 
 |WARN builtin.PigStorage: Unable to interpret value [...@2c9b42e6 in field 
 being converted to type bag, caught ParseException Encountered  STRING   
  at |line 1, column 43.
 |Was expecting:
 |( ...
 | field discarded
 Below is the parser debug output for the parsing of the above error sequence: 
 2.071200,0), ( from above...
 ** FOUND A DOUBLENUMBER MATCH (2.071200) **
   Call:   AtomDatum
 Consumed token: DOUBLENUMBER: 2.071200 at line 1 column 31
   Return: AtomDatum
 Return: Datum
Matched the empty string as STRING token.
 Current character : , (44) at line 1 column 39
No more string literal token matches are possible.
Currently matched the first 1 characters as a , token.
 ** FOUND A , MATCH (,) **
 Consumed token: , at line 1 column 39
 Call:   Datum
Matched the empty string as STRING token.
 Current character : 0 (48) at line 1 column 40
No string literal matches possible.
Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER 
 }
 Current character : 0 (48) at line 1 column 40
Currently matched the first 1 characters as a SIGNEDINTEGER token.
Possible kinds of longer matches : { STRING, SIGNEDINTEGER, 
 DOUBLENUMBER, LONGINTEGER, 
  FLOATNUMBER }
 Current character : ) (41) at line 1 column 41
Currently matched the first 1 characters as a SIGNEDINTEGER token.
Putting back 1 characters into the input stream.
 ** FOUND A SIGNEDINTEGER MATCH (0) **
   Call:   AtomDatum
 Consumed token: SIGNEDINTEGER: 0 at line 1 column 40
   Return: AtomDatum
 Return: Datum
Matched the empty string as STRING token.
 Current character : ) (41) at line 1 column 41
No more string literal token matches are possible.
Currently matched the first 1 characters as a ) token.
 ** FOUND A ) MATCH ()) **
   Return: Tuple
   Consumed token: ) at line 1 column 41
Matched the empty string as STRING token.
 Current character : , (44) at line 1 column 42
No more string literal token matches are possible.
Currently matched the first 1 characters as a , token.
 ** FOUND A , MATCH (,) **
   Consumed token: , at line 1 column 42
Matched the empty string as STRING token.
 Current character :   (32) at line 1 column 43
No string literal matches possible.
Starting NFA to match one of : { STRING, SIGNEDINTEGER, DOUBLENUMBER 
 }
 Current character :   (32) at line 1 column 43
Currently matched the first 1 characters as a STRING token.
Possible kinds of longer matches : { STRING, SIGNEDINTEGER, 
 DOUBLENUMBER }
 Current character : ( (40) at line 1 column 44
Currently matched the first 1 characters as a STRING token.
Putting back 1 characters into the input stream.
 ** FOUND A STRING MATCH ( ) **
 Return: Bag
   Return: Datum
 Return: Parse

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-969) Default constructor of UDF gets called for UDF with parameterised constructor , if the udf has a getArgToFuncMapping function defined

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-969:
---

Fix Version/s: 0.9.0
  Description: 
This issue is discussed in  
http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg00524.html . I am 
able to reproduce the issue. While it is easy to fix the udf, it can take a lot 
of time to figure out the problem (until they find this email conversation!).

The root cause is that when getArgToFuncMapping is defined in the udf , the 
FuncSpec returned by the method replaces one set by define statement . The 
constructor arguments get lost.  We can handle this in following ways -

1. Preserve the constructor arguments, and use it with the class name of the 
matching FuncSpec from getArgToFuncMapping . 
2. Give an error if constructor paramerters are given for a udf which has 
FuncSpecs returned from getArgToFuncMapping .

The problem with  approach 1 is that we are letting the user define the 
FuncSpec , so user could have defined a FuncSpec with constructor (though they 
don't have a valid reason to do so.). It is also possible the the constructor 
of the different class that matched might not support same constructor 
parameters. The use of this function outside builtin udfs are also probably not 
common.

With option 2, we are telling the user that this is not a supported use case, 
and user can easily change the udf to fix the issue, or use the udf which would 
have matched given parameters (which unlikely to have the getArgToFuncMapping 
method defined).

I am proposing that we go with option 2 . 


  was:

This issue is discussed in  
http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg00524.html . I am 
able to reproduce the issue. While it is easy to fix the udf, it can take a lot 
of time to figure out the problem (until they find this email conversation!).

The root cause is that when getArgToFuncMapping is defined in the udf , the 
FuncSpec returned by the method replaces one set by define statement . The 
constructor arguments get lost.  We can handle this in following ways -

1. Preserve the constructor arguments, and use it with the class name of the 
matching FuncSpec from getArgToFuncMapping . 
2. Give an error if constructor paramerters are given for a udf which has 
FuncSpecs returned from getArgToFuncMapping .

The problem with  approach 1 is that we are letting the user define the 
FuncSpec , so user could have defined a FuncSpec with constructor (though they 
don't have a valid reason to do so.). It is also possible the the constructor 
of the different class that matched might not support same constructor 
parameters. The use of this function outside builtin udfs are also probably not 
common.

With option 2, we are telling the user that this is not a supported use case, 
and user can easily change the udf to fix the issue, or use the udf which would 
have matched given parameters (which unlikely to have the getArgToFuncMapping 
method defined).

I am proposing that we go with option 2 . 



 Default constructor of UDF gets called for UDF with parameterised constructor 
 , if the udf has a getArgToFuncMapping function defined
 -

 Key: PIG-969
 URL: https://issues.apache.org/jira/browse/PIG-969
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
 Fix For: 0.9.0


 This issue is discussed in  
 http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg00524.html . I am 
 able to reproduce the issue. While it is easy to fix the udf, it can take a 
 lot of time to figure out the problem (until they find this email 
 conversation!).
 The root cause is that when getArgToFuncMapping is defined in the udf , the 
 FuncSpec returned by the method replaces one set by define statement . The 
 constructor arguments get lost.  We can handle this in following ways -
 1. Preserve the constructor arguments, and use it with the class name of the 
 matching FuncSpec from getArgToFuncMapping . 
 2. Give an error if constructor paramerters are given for a udf which has 
 FuncSpecs returned from getArgToFuncMapping .
 The problem with  approach 1 is that we are letting the user define the 
 FuncSpec , so user could have defined a FuncSpec with constructor (though 
 they don't have a valid reason to do so.). It is also possible the the 
 constructor of the different class that matched might not support same 
 constructor parameters. The use of this function outside builtin udfs are 
 also probably not common.
 With option 2, we are telling the user that this is not a supported use case, 
 and user can easily change the udf to fix the issue, or use the udf which 
 would have matched given parameters (which unlikely to have the 
 

[jira] Resolved: (PIG-1182) Pig reference manual does not mention syntax for comments

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1182.
-

Resolution: Fixed

Closing. If we do want to do an comprehansive index, please, create a separate 
JIRA

 Pig reference manual does not mention syntax for comments
 -

 Key: PIG-1182
 URL: https://issues.apache.org/jira/browse/PIG-1182
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.5.0
Reporter: David Ciemiewicz

 The Pig 0.5.0 reference manual does not mention how to write comments in your 
 pig code using -- (two dashes).
 http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html
 Also, does /* */ also work?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-999) sorting on map-value fails if map-value is not of bytearray type

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-999:
---

Fix Version/s: 0.9.0

 sorting on map-value fails if map-value is not of bytearray type
 

 Key: PIG-999
 URL: https://issues.apache.org/jira/browse/PIG-999
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
 Fix For: 0.9.0


 When query execution plan is created by pig, it assumes the type to be 
 bytearray because there is no schema information associated with map fields.
 But at run time, the loader might return the actual type. This results in a 
 ClassCastException.
 This issue points to the larger issue of the way pig is handling types for 
 map-value. 
 This issue should be fixed in the context of revisiting the frontend logic 
 and pig-latin semantics.
 This is related to PIG-880 . The patch in PIG-880 changed PigStorage to 
 always return bytearray for map values to work around this, but other loaders 
 like BinStorage can return the actual type causing this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-998) revisit frontend logic and pig-latin semantics

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-998:
---

Fix Version/s: 0.9.0

 revisit frontend logic and pig-latin semantics
 --

 Key: PIG-998
 URL: https://issues.apache.org/jira/browse/PIG-998
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
 Fix For: 0.9.0


 This jira has been created to keep track of issues with current frontend 
 logic and pig-latin semantics.
 One example is handling of type information of map-values. At time of  query 
 plan generation pig does not know the type for map-values and assumes it is 
 bytearray. This leads to problems when the loader returns map-value of other 
 types.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-967) Proposal for adding a metadata interface to Pig

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-967.


Resolution: Won't Fix

This is an obsolete proposal

 Proposal for adding a metadata interface to Pig
 ---

 Key: PIG-967
 URL: https://issues.apache.org/jira/browse/PIG-967
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig needs to have an interface to connect to metadata systems.  
 http://wiki.apache.org/pig/MetadataInterfaceProposal proposes and interface 
 for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1065) In-determinate behaviour of Union when there are 2 non-matching schema's

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1065:


Fix Version/s: 0.9.0

 In-determinate behaviour of Union when there are 2 non-matching schema's
 

 Key: PIG-1065
 URL: https://issues.apache.org/jira/browse/PIG-1065
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.9.0


 I have a script which first does a union of these schemas and then does a 
 ORDER BY of this result.
 {code}
 f1 = LOAD '1.txt' as (key:chararray, v:chararray);
 f2 = LOAD '2.txt' as (key:chararray);
 u0 = UNION f1, f2;
 describe u0;
 dump u0;
 u1 = ORDER u0 BY $0;
 dump u1;
 {code}
 When I run in Map Reduce mode I get the following result:
 $java -cp pig.jar:$HADOOP_HOME/conf org.apache.pig.Main broken.pig
 
 Schema for u0 unknown.
 
 (1,2)
 (2,3)
 (1)
 (2)
 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias u1
 at org.apache.pig.PigServer.openIterator(PigServer.java:475)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:532)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:190)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 
 Caused by: java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 
 When I run the same script in local mode I get a different result, as we know 
 that local mode does not use any Hadoop Classes.
 $java -cp pig.jar org.apache.pig.Main -x local broken.pig
 
 Schema for u0 unknown
 
 (1,2)
 (1)
 (2,3)
 (2)
 
 (1,2)
 (1)
 (2,3)
 (2)
 
 Here are some questions
 1) Why do we allow union if the schemas do not match
 2) Should we not print an error message/warning so that the user knows that 
 this is not allowed or he can get unexpected results?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1066) ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected internal error. null

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1066:


Fix Version/s: 0.9.0

 ILLUSTRATE called after DESCRIBE results in Grunt: ERROR 2999: Unexpected 
 internal error. null
 

 Key: PIG-1066
 URL: https://issues.apache.org/jira/browse/PIG-1066
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.4.0
Reporter: Bogdan Dorohonceanu
 Fix For: 0.9.0


 -- load the QID_CT_QP20 data
 x = LOAD '$FS_TD/$QID_IN_FILES' USING PigStorage('\t') AS 
 (unstem_qid:chararray, jid_score_pairs:chararray);
 DESCRIBE x;
 --ILLUSTRATE x;
 -- load the ID_RQ data
 y0 = LOAD '$FS_USER/$ID_RQ_IN_FILE' USING PigStorage('\t') AS (sid:chararray, 
 query:chararray);
 -- force parallelization
 -- y1 = ORDER y0 BY sid PARALLEL $NUM;
 -- compute unstem_qid
 DEFINE f `text_streamer_query j3_unicode.dat prop.dat normal.txt TAB TAB 
 1:yes:UNSTEM_ID:%llx` INPUT(stdin USING PigStorage('\t')) OU\
 TPUT(stdout USING PigStorage('\t')) SHIP('$USER/text_streamer_query', 
 '$USER/j3_unicode.dat', '$USER/prop.dat', '$USER/normal.txt');
 y = STREAM y0 THROUGH f AS (sid:chararray, query:chararray, 
 unstem_qid:chararray);
 DESCRIBE y;
 --ILLUSTRATE y;
 rmf /user/vega/zoom/y_debug
 STORE y INTO '/user/vega/zoom/y_debug' USING PigStorage('\t');
 2009-10-30 13:36:48,437 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: 
 hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name
 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to hadoop 
 file system at: 
 hdfs://dd-9c32d03:8887/,/teoma/dd-9c34d04/middleware/hadoop.test.data/dfs/name
 2009-10-30 13:36:48,495 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: dd-9c32d04:8889
 09/10/30 13:36:48 INFO executionengine.HExecutionEngine: Connecting to 
 map-reduce job tracker at: dd-9c32d04:8889
 2009-10-30 13:36:49,242 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2999: Unexpected internal error. null
 09/10/30 13:36:49 ERROR grunt.Grunt: ERROR 2999: Unexpected internal error. 
 null
 Details at logfile: /disk1/vega/zoom/pig_1256909801304.log

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1056) table can not be loaded after store

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1056.
-

Resolution: Invalid

The script is invalid and that's why you see the error

 table can not be loaded after store
 ---

 Key: PIG-1056
 URL: https://issues.apache.org/jira/browse/PIG-1056
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang

 Pig Stack Trace
 ---
 ERROR 1018: Problem determining schema during load
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Problem determining schema during load
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1023)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:967)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:716)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem 
 determining schema during load
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)
 ... 8 more
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018: 
 Problem determining schema during load
 at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732)
 ... 10 more
 Caused by: java.io.IOException: No table specified for input
 at 
 org.apache.hadoop.zebra.pig.TableLoader.checkConf(TableLoader.java:238)
 at 
 org.apache.hadoop.zebra.pig.TableLoader.determineSchema(TableLoader.java:258)
 at org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148)
 ... 11 more
 
 ~ 
 
 script:
 register /grid/0/dev/hadoopqa/hadoop/lib/zebra.jar;
 A = load 'filter.txt' as (name:chararray, age:int);
 B = filter A by age  20;
 --dump B;
 store B into 'filter1' using 
 org.apache.hadoop.zebra.pig.TableStorer('[name];[age]');
 rec1 = load 'B' using org.apache.hadoop.zebra.pig.TableLoader();
 dump rec1;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1092) Pig Latin Parser fails to recognize \n as a whitespace

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1092:


Fix Version/s: 0.9.0

 Pig Latin Parser fails to recognize \n as a whitespace
 

 Key: PIG-1092
 URL: https://issues.apache.org/jira/browse/PIG-1092
 Project: Pig
  Issue Type: Bug
  Components: grunt
 Environment: RHEL linux
Reporter: Yang Yang
Priority: Minor
 Fix For: 0.9.0


 the following pig latin script fails to parse
 a = load 'input_file' as
 ( field1 : int );
 note that there is no char after the as, so there is only one \n char 
 between the as and ( on the next line.
 adding a whitespace after as solves it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1112) FLATTEN eliminates the alias

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1112:


Fix Version/s: 0.9.0

 FLATTEN eliminates the alias
 

 Key: PIG-1112
 URL: https://issues.apache.org/jira/browse/PIG-1112
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Assignee: Daniel Dai
 Fix For: 0.9.0


 If schema for a field of type 'bag' is partially defined then FLATTEN() 
 incorrectly eliminates the field and throws an error. 
 Consider the following example:-
 A = LOAD 'sample' using PigStorage() as (first:chararray, second:chararray, 
 ladder:bag{});  
 B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second; 
   
 C = GROUP B by (first,third);
 This throws the error
  ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. 
 Invalid alias: third in {first: chararray,second: chararray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1017) Converts strings to text in Pig

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1017:


Fix Version/s: 0.9.0

 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Fix For: 0.9.0

 Attachments: stotext.patch


 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1152) bincond operator throws parser error

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1152:


Fix Version/s: 0.9.0

 bincond operator throws parser error
 

 Key: PIG-1152
 URL: https://issues.apache.org/jira/browse/PIG-1152
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
 Fix For: 0.9.0


 Bincond operator throws parser error when true condition contains a constant 
 bag with 1 tuple containing a single field of int type with -ve value. 
 Here is the script to reproduce the issue
 A = load 'A' as (s: chararray, x: int, y: int);
 B = group A by s;
 C = foreach B generate group, flatten(((COUNT(A)  1L) ? {(-1)} : A.x));
 dump C;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1178:


Fix Version/s: 0.8.0

 LogicalPlan and Optimizer are too complex and hard to work with
 ---

 Key: PIG-1178
 URL: https://issues.apache.org/jira/browse/PIG-1178
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: expressions-2.patch, expressions.patch, lp.patch, 
 lp.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch, pig_1178_2.patch, 
 pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch, pig_1178_3.patch


 The current implementation of the logical plan and the logical optimizer in 
 Pig has proven to not be easily extensible. Developer feedback has indicated 
 that adding new rules to the optimizer is quite burdensome. In addition, the 
 logical plan has been an area of numerous bugs, many of which have been 
 difficult to fix. Developers also feel that the logical plan is difficult to 
 understand and maintain. The root cause for these issues is that a number of 
 design decisions that were made as part of the 0.2 rewrite of the front end 
 have now proven to be sub-optimal. The heart of this proposal is to revisit a 
 number of those proposals and rebuild the logical plan with a simpler design 
 that will make it much easier to maintain the logical plan as well as extend 
 the logical optimizer. 
 See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
 details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1235) OptimizerException: Problem while rebuilding projection map or schema in logical optimizer

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1235.
-

Resolution: Won't Fix

This is not relevant with new optimizer

 OptimizerException: Problem while rebuilding projection map or schema in 
 logical optimizer
 --

 Key: PIG-1235
 URL: https://issues.apache.org/jira/browse/PIG-1235
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding

 Here is the script that throws this exception:
 {code}
 A = load '1.txt' as (x, y, z);
 B = group A by (x  0 ? x : 0);
 C = filter B by group  10;  
 explain C   
 {code}
 Pig Stack Trace
 ---
 ERROR 2157: Error while fixing projections. No mapping available in old 
 predecessor to replace column.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1067: Unable to 
 explain alias C
 at org.apache.pig.PigServer.explain(PigServer.java:593)
 at 
 org.apache.pig.tools.grunt.GruntParser.explainCurrentBatch(GruntParser.java:315)
 at 
 org.apache.pig.tools.grunt.GruntParser.processExplain(GruntParser.java:268)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.Explain(PigScriptParser.java:517)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:265)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
 at org.apache.pig.Main.main(Main.java:352)
 Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2145: 
 Problem while rebuilding projection map or schema in logical optimizer.
 at 
 org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:215)
 at org.apache.pig.PigServer.compileLp(PigServer.java:856)
 at org.apache.pig.PigServer.compileLp(PigServer.java:792)
 at org.apache.pig.PigServer.getStorePlan(PigServer.java:734)
 at org.apache.pig.PigServer.explain(PigServer.java:576)
 ... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1247) Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1247:


Fix Version/s: 0.9.0

 Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. 
 org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
 java.lang.Error
 -

 Key: PIG-1247
 URL: https://issues.apache.org/jira/browse/PIG-1247
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.9.0


 I have a large script in which there are intermediate stores statements, one 
 of them writes to a directory I do not have permission to write to. 
 The stack trace I get from Pig is this:
 2010-02-20 02:16:32,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2999: Unexpected internal error. 
 org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
 java.lang.Error
 Details at logfile: /home/viraj/pig_1266632145355.log
 Pig Stack Trace
 ---
 ERROR 2999: Unexpected internal error. 
 org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
 java.lang.Error
 java.lang.ClassCastException: 
 org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
 java.lang.Error
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3583)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1407)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:762)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:386)
 
 The only way to find the error was to look at the javacc generated 
 QueryParser.java code and do a System.out.println()
 Here is a script to reproduce the problem:
 {code}
 A = load '/user/viraj/three.txt' using PigStorage();
 B = foreach A generate ['a'#'12'] as b:map[] ;
 store B into '/user/secure/pigtest' using PigStorage();
 {code}
 three.txt has 3 lines which contain nothing but the number 1.
 {code}
 $ hadoop fs -ls /user/secure/
 ls: could not get get listing for 'hdfs://mynamenode/user/secure' : 
 org.apache.hadoop.security.AccessControlException: Permission denied: 
 user=viraj, access=READ_EXECUTE, inode=secure:secure:users:rwx--
 {code}
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1277) Pig should give error message when cogroup on tuple keys of different inner type

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1277:


Fix Version/s: 0.9.0

 Pig should give error message when cogroup on tuple keys of different inner 
 type
 

 Key: PIG-1277
 URL: https://issues.apache.org/jira/browse/PIG-1277
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
 Fix For: 0.9.0


 When we cogroup on a tuple, if the inner type of tuple does not match, we 
 treat them as different keys. This is confusing. It is desirable to give 
 error/warnings when it happens.
 Here is one example:
 UDF:
 {code}
 public class MapGenerate extends EvalFuncMap {
 @Override
 public Map exec(Tuple input) throws IOException {
 // TODO Auto-generated method stub
 Map m = new HashMap();
 m.put(key, new Integer(input.size()));
 return m;
 }
 
 @Override
 public Schema outputSchema(Schema input) {
 return new Schema(new Schema.FieldSchema(null, DataType.MAP));
 }
 }
 {code}
 Pig script: 
 {code}
 a = load '1.txt' as (a0);
 b = foreach a generate a0, MapGenerate(*) as m:map[];
 c = foreach b generate a0, m#'key' as key;
 d = load '2.txt' as (c0, c1);
 e = cogroup c by (a0, key), d by (c0, c1);
 dump e;
 {code}
 1.txt
 {code}
 1
 {code}
 2.txt
 {code}
 1 1
 {code}
 User expected result (which is not right):
 {code}
 ((1,1),{(1,1)},{(1,1)})
 {code}
 Real result:
 {code}
 ((1,1),{(1,1)},{})
 ((1,1),{},{(1,1)})
 {code}
 We shall give user the message that we can not merge the key due to the type 
 mismatch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1319) New logical optimization rules

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1319:


Fix Version/s: 0.8.0

 New logical optimization rules
 --

 Key: PIG-1319
 URL: https://issues.apache.org/jira/browse/PIG-1319
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0


 In [PIG-1178|https://issues.apache.org/jira/browse/PIG-1178], we build a new 
 logical optimization framework. One design goal for the new logical optimizer 
 is to make it easier to add new logical optimization rules. In this Jira, we 
 keep track of the development of these new logical optimization rules.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1328) pigtest ant target fails pigtrunk builds

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1328.
-

Resolution: Fixed

I believe all tests are running now. Please, re-open and clarify if this is 
still an issue

 pigtest ant target fails pigtrunk builds
 

 Key: PIG-1328
 URL: https://issues.apache.org/jira/browse/PIG-1328
 Project: Pig
  Issue Type: Bug
  Components: build
Reporter: Giridharan Kesavan

 java.lang.NoClassDefFoundError:com_cenqua_clover/CloverVersionInfo)
 [junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.154 sec
 [junit] Test org.apache.hadoop.zebra.pig.TestTableSortStorer FAILED
 [junit] Running org.apache.hadoop.zebra.pig.TestTableSortStorerDesc
 [junit] log4j:WARN No appenders could be found for logger 
 (org.apache.hadoop.conf.Configuration).
 [junit] log4j:WARN Please initialize the log4j system properly.
 [junit] [CLOVER] FATAL ERROR: Clover could not be initialised. Are you 
 sure you have Clover in the runtime classpath? (class 
 java.lang.NoClassDefFoundError:com_cenqua_clover/CloverVersionInfo)
 [junit] Tests run: 0, Failures: 0, Errors: 2, Time elapsed: 0.164 sec
 [junit] Test org.apache.hadoop.zebra.pig.TestTableSortStorerDesc FAILED

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1188) Padding nulls to the input tuple according to input schema

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1188:


Fix Version/s: 0.9.0

 Padding nulls to the input tuple according to input schema
 --

 Key: PIG-1188
 URL: https://issues.apache.org/jira/browse/PIG-1188
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.9.0


 Currently, the number of fields in the input tuple is determined by the data. 
 When we have schema, we should generate input data according to the schema, 
 and padding nulls if necessary. Here is one example:
 Pig script:
 {code}
 a = load '1.txt' as (a0, a1);
 dump a;
 {code}
 Input file:
 {code}
 1   2
 1   2   3
 1
 {code}
 Current result:
 {code}
 (1,2)
 (1,2,3)
 (1)
 {code}
 Desired result:
 {code}
 (1,2)
 (1,2)
 (1, null)
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1452:


Fix Version/s: 0.8.0

 to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
 --

 Key: PIG-1452
 URL: https://issues.apache.org/jira/browse/PIG-1452
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.8.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: PIG-1452.PATCH


 pig use ivy for dependency management. But still it uses hadoop20.jar from 
 the lib folder. 
 Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
 should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1387) Syntactical Sugar for PIG-1385

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1387:


Fix Version/s: 0.9.0

 Syntactical Sugar for PIG-1385
 --

 Key: PIG-1387
 URL: https://issues.apache.org/jira/browse/PIG-1387
 Project: Pig
  Issue Type: Wish
  Components: grunt
Affects Versions: 0.6.0
Reporter: hc busy
 Fix For: 0.9.0


 From this conversation, extend PIG-1385 to instead of calling UDF use 
 built-in behavior when the (),{},[] groupings are encountered.
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
  
   import java.io.IOException;
   import java.net.URISyntaxException;
   import java.net.URL;
  
   import static org.junit.Assert.assertTrue;
  
   /**
   * @author astepachev
   */
   public class ToBagTest {
PigServer pigServer;
URL inputTxt;
  
@Before
public void init() throws IOException, URISyntaxException {
pigServer = new PigServer(ExecType.LOCAL);
inputTxt =
   this.getClass().getResource(bagTest.txt).toURI().toURL();
}
  
@Test
public void testSimple() throws IOException {
pigServer.registerQuery(a = load ' + inputTxt.toExternalForm()
  +
   ' using PigStorage(',')  +
as (id:int, a:chararray, b:chararray, c:chararray,
   d:chararray););
pigServer.registerQuery(last = 

[jira] Updated: (PIG-1341) BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1341:


Fix Version/s: 0.9.0

 BinStorage cannot convert DataByteArray to Chararray and results in 
 FIELD_DISCARDED_TYPE_CONVERSION_FAILED
 --

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.9.0

 Attachments: PIG-1341.patch


 Script reads in BinStorage data and tries to convert a column which is in 
 DataByteArray to Chararray. 
 {code}
 raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
 --filter out null columns
 A = filter raw by col1#'bcookie' is not null;
 B = foreach A generate col1#'bcookie'  as reqcolumn;
 describe B;
 --B: {regcolumn: bytearray}
 X = limit B 5;
 dump X;
 B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
 describe B;
 --B: {convertedcol: chararray}
 X = limit B 5;
 dump X;
 {code}
 The first dump produces:
 (36co9b55onr8s)
 (36co9b55onr8s)
 (36hilul5oo1q1)
 (36hilul5oo1q1)
 (36l4cj15ooa8a)
 The second dump produces:
 ()
 ()
 ()
 ()
 ()
 It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
 time(s).
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1358) [piggybank] String functions should handle exceptions in a consistent manner

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1358:


Fix Version/s: 0.9.0

 [piggybank] String functions should handle exceptions in a consistent manner 
 -

 Key: PIG-1358
 URL: https://issues.apache.org/jira/browse/PIG-1358
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
 Fix For: 0.9.0


 The String functions in piggybank handles exceptions differently. Some 
 catches all exceptions, some catches only ClassCastException, while some 
 catches only ExecException. The exception handling code in these functions 
 should be consistent.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1399:


Fix Version/s: 0.8.0

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1459) Need a standard way to communicate the requested fields between front and back end for loaders

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1459:


Fix Version/s: 0.9.0

 Need a standard way to communicate the requested fields between front and 
 back end for loaders
 --

 Key: PIG-1459
 URL: https://issues.apache.org/jira/browse/PIG-1459
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Alan Gates
 Fix For: 0.9.0


 Pig currently provides no mechanism for loader writers to communicate which 
 fields have been requested between the front and back end.  Since any loader 
 that accepts pushed projections has to deal with this issue it would make 
 sense for Pig to provide a standard mechanism for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1477) Syntax error in tutorial Pig Script 1: Query Phrase Popularity (ORDER operator)

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1477:


Assignee: Corinne Chandel

 Syntax error in tutorial Pig Script 1: Query Phrase Popularity (ORDER 
 operator)
 ---

 Key: PIG-1477
 URL: https://issues.apache.org/jira/browse/PIG-1477
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Brian Mansell
Assignee: Corinne Chandel
Priority: Trivial
 Fix For: 0.8.0


 Documentation syntax should reflect the correct code indicated in the 
 tutorial script.
 Documentation syntax 
 {code}
 ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour, score);
 {code}
 Above syntax results in this error:
 {code}
 2010-06-30 22:12:16,412 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Encountered  , ,  at line 1, column 64.
 Was expecting:
 ) ..
 {code}
 (Correct) Tutorial script syntax
 {code}
 ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1436) Print number of records outputted at each step of a Pig script

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1436.
-

Resolution: Duplicate

This looks like duplicate of PIG-1478. Please, re-open if this is not the case

 Print number of records outputted at each step of a Pig script
 --

 Key: PIG-1436
 URL: https://issues.apache.org/jira/browse/PIG-1436
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Affects Versions: 0.7.0
Reporter: Russell Jurney
Assignee: Richard Ding
Priority: Minor
 Fix For: 0.8.0


 I often run a script multiple times, or have to go and look through Hadoop 
 task logs, to figure out where I broke a long script in such a way that I get 
 0 records out of it.  I think this is a common problem.
 If someone can point me in the right direction, I can make a pass at this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1465) Filter inside foreach is broken

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1465:


Fix Version/s: 0.8.0

 Filter inside foreach is broken
 ---

 Key: PIG-1465
 URL: https://issues.apache.org/jira/browse/PIG-1465
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: hc busy
 Fix For: 0.8.0


 {quote}
 % cat data.txt
 x,a,1,a
 x,a,2,a
 x,a,3,b
 x,a,4,b
 y,a,1,a
 y,a,2,a
 y,a,3,b
 y,a,4,b
 % cat script.pig
 a = load 'data' as (ind:chararray, f1:chararray, num:int, f2:chararray);
 b = group a by ind;
 describe b;
 f = foreach b\{
 all_total = SUM(a.num);
 fed  = filter a by (f1==f2);
 some_total = (int)SUM(fed.num);
 generate group as ind, all_total, some_total;
 \}
 describe f;
 dump f;
 % pig -f script.pig
 (x,a,1,a,,)
 (x,a,2,a,,)
 (x,a,3,b,,)
 (x,a,4,b,,)
 (y,a,1,a,,)
 (y,a,2,a,,)
 (y,a,3,b,,)
 (y,a,4,b,,)
 % cat what_I_expected
 (x,10,3)
 (y,10,3)
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1470.
-

Resolution: Won't Fix

Closing since there is no fix in Pig required. Feel gree to continue the 
discussion on the mailing lists.

 map/red jobs fail using G1 GC (Couldn't find heap)
 --

 Key: PIG-1470
 URL: https://issues.apache.org/jira/browse/PIG-1470
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 
 x86_64 x86_64 x86_64 GNU/Linux
 Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
 Hadoop: 0.20.1
Reporter: Randy Prager

 Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails
 {noformat}
  property
 namemapred.child.java.opts/name
 value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops 
 -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC/value
 /property
 {noformat}
 Here is the hadoop map/red configuration that succeeds
 {noformat}
  property
 namemapred.child.java.opts/name
 value-Xmx300m -XX:+DoEscapeAnalysis 
 -XX:+UseCompressedOops/value
 /property
 {noformat}
 Here is the exception from the pig script.
 {noformat}
 Backend error message
 -
 org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
 set up the load function.
 at 
 org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' 
 with arguments '[,]'
 at 
 org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519)
 at 
 org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85)
 ... 5 more
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487)
 ... 6 more
 Caused by: java.lang.RuntimeException: Couldn't find heap
 at 
 org.apache.pig.impl.util.SpillableMemoryManager.init(SpillableMemoryManager.java:95)
 at org.apache.pig.data.BagFactory.init(BagFactory.java:106)
 at 
 org.apache.pig.data.DefaultBagFactory.init(DefaultBagFactory.java:71)
 at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76)
 at 
 org.apache.pig.builtin.Utf8StorageConverter.init(Utf8StorageConverter.java:49)
 at org.apache.pig.builtin.PigStorage.init(PigStorage.java:69)
 at org.apache.pig.builtin.PigStorage.init(PigStorage.java:79)
 ... 11 more
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1492) DefaultTuple and DefaultMemory understimate their memory footprint

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1492:


 Assignee: Thejas M Nair
Fix Version/s: 0.8.0

 DefaultTuple and DefaultMemory understimate their memory footprint
 --

 Key: PIG-1492
 URL: https://issues.apache.org/jira/browse/PIG-1492
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 There are several places where we highly underestimate the memory footprint . 
 For example, for map datatypes, we don't account for the per entry cost for 
 the map container data structures. The estimated size of a tuple having map 
 with 100 integer key-value entries , as per current version of code is 3260 
 bytes, while what is observed is around 6775 bytes .  To verify the memory 
 footprint, i checked free memory before and after creating multiple instances 
 of the object , using code on the lines of 
 http://www.javaspecialists.eu/archive/Issue029.html . 
 In PIG-1443 similar change was done to fix this for CHARARRAY .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-523) help in grunt should show all commands

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-523:
--

Assignee: Olga Natkovich

 help in grunt should show all commands
 --

 Key: PIG-523
 URL: https://issues.apache.org/jira/browse/PIG-523
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
Priority: Minor
 Fix For: 0.8.0


 curently, it only show commands directly supported by grunt parser and not 
 command supported by pig parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-347) Pig (help) Commands

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-347:
--

Assignee: Olga Natkovich

 Pig (help) Commands
 ---

 Key: PIG-347
 URL: https://issues.apache.org/jira/browse/PIG-347
 Project: Pig
  Issue Type: Bug
Reporter: Corinne Chandel
Assignee: Olga Natkovich
Priority: Minor
 Fix For: 0.8.0


 Pig help can be specified 2 ways: $pig -help and $pig -h
 I. $pig -help (seen by external/internal users)
 (1) fix
 -c, -cluster clustername, kryptonite is default 
  remove kryptonite is default
 (2) change 
 -x, -exectype local|mapreduce, mapreduce is default 
  change mapdreduce to hadoop (maintain backward compatibility)
 II. $pig -h (seen by internal users users only)
 (1) fix typos
 -l, --latest   use latest, untested, unsupported version of pig.jar instaed 
 of relased, tested, supported version.
instead of released 
 (2) fix
 -c, -cluster clustername, kryptonite is default 
  remove kryptonite is default 
 (same as above)
 (3) change:  -x, -exectype local|mapreduce, mapreduce is default ... 
  change mapdreduce to hadoop (maintain backward compatibility)
 (same as above)
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-07-12 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887538#action_12887538
 ] 

Olga Natkovich commented on PIG-1494:
-

Swati, I am assigning it to you since I am assuming you plan to work on it for 
0.8. Otherwise, it is unlikely to happen in 0.8 timeframe. Feel free to 
unassign and unlink from this release if this is not the case

 PIG Logical Optimization: Use CNF in PushUpFilter
 -

 Key: PIG-1494
 URL: https://issues.apache.org/jira/browse/PIG-1494
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Swati Jain
Priority: Minor
 Fix For: 0.8.0


 The PushUpFilter rule is not able to handle complicated boolean expressions.
 For example, SplitFilter rule is splitting one LOFilter into two by AND. 
 However it will not be able to split LOFilter if the top level operator is 
 OR. For example:
 *ex script:*
 A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
 B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
 C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
 J1 = JOIN B by b1, C by c1;
 J2 = JOIN J1 by $0, A by a1;
 D = *Filter J2 by ( (c1  10) AND (a3+b3  10) ) OR (c2 == 5);*
 explain D;
 In the above example, the PushUpFilter is not able to push any filter 
 condition across any join as it contains columns from all branches (inputs). 
 But if we convert this expression into Conjunctive Normal Form (CNF) then 
 we would be able to push filter condition c1 10 and c2 == 5 below both join 
 conditions. Here is the CNF expression for highlighted line:
 ( (c1  10) OR (c2 == 5) ) AND ( (a3+b3  10) OR (c2 ==5) )
 *Suggestion:* It would be a good idea to convert LOFilter's boolean 
 expression into CNF, it would then be easy to push parts (conjuncts) of the 
 LOFilter boolean expression selectively. We would also not require rule 
 SplitFilter anymore if we were to add this utility to rule PushUpFilter 
 itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1494:
---

Assignee: Swati Jain

 PIG Logical Optimization: Use CNF in PushUpFilter
 -

 Key: PIG-1494
 URL: https://issues.apache.org/jira/browse/PIG-1494
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Swati Jain
Assignee: Swati Jain
Priority: Minor
 Fix For: 0.8.0


 The PushUpFilter rule is not able to handle complicated boolean expressions.
 For example, SplitFilter rule is splitting one LOFilter into two by AND. 
 However it will not be able to split LOFilter if the top level operator is 
 OR. For example:
 *ex script:*
 A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
 B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
 C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
 J1 = JOIN B by b1, C by c1;
 J2 = JOIN J1 by $0, A by a1;
 D = *Filter J2 by ( (c1  10) AND (a3+b3  10) ) OR (c2 == 5);*
 explain D;
 In the above example, the PushUpFilter is not able to push any filter 
 condition across any join as it contains columns from all branches (inputs). 
 But if we convert this expression into Conjunctive Normal Form (CNF) then 
 we would be able to push filter condition c1 10 and c2 == 5 below both join 
 conditions. Here is the CNF expression for highlighted line:
 ( (c1  10) OR (c2 == 5) ) AND ( (a3+b3  10) OR (c2 ==5) )
 *Suggestion:* It would be a good idea to convert LOFilter's boolean 
 expression into CNF, it would then be easy to push parts (conjuncts) of the 
 LOFilter boolean expression selectively. We would also not require rule 
 SplitFilter anymore if we were to add this utility to rule PushUpFilter 
 itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887545#action_12887545
 ] 

Daniel Dai commented on PIG-1472:
-

+1 for commit.

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
 PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1430) ISODateTime - DateTime: DateTime UDFs Should Also Support int/second Unix Times in All Operations

2010-07-12 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887546#action_12887546
 ] 

Alan Gates commented on PIG-1430:
-

I think it's fine to start with just putting conversion functions into Pig 
Latin.  What I'd like to clarify though is what is the desired end state?  Does 
Pig eventually have a datetime type that does all the datetime stuff you can 
dream of (timezones, etc.)?  Or does Pig only ever have longs or strings to 
represent times and a set of functions to work with those?  Are you proposing 
that latter, or delaying the former in interest of getting something into 0.8?  

 ISODateTime - DateTime: DateTime UDFs Should Also Support int/second Unix 
 Times in All Operations
 --

 Key: PIG-1430
 URL: https://issues.apache.org/jira/browse/PIG-1430
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Russell Jurney
 Fix For: 0.8.0


 All functions in 
 contrib.piggybank.java.src.main.java.org.apache.pig.piggybank.evaluation.datetime
  should seamlessly accept integer Unix/POSIX times, and return Unix time 
 output when given an int, and ISO output when given a chararray.
 Note: Unix/POSIX times are the number of seconds elapsed since midnight 
 proleptic Coordinated Universal Time (UTC) of January 1, 1970, not counting 
 leap seconds.  See http://en.wikipedia.org/wiki/Unix_time

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1495) Add -q command line option to set queue name for Pig jobs from command line

2010-07-12 Thread Russell Jurney (JIRA)
Add -q command line option to set queue name for Pig jobs from command line
---

 Key: PIG-1495
 URL: https://issues.apache.org/jira/browse/PIG-1495
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.7.0
Reporter: Russell Jurney
 Fix For: 0.8.0


rjurney$ pig -q default

This sets the mapred.job.queue.name property in the execution engine from the 
pig properties for MAPRED type jobs.  

Patch attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1321:


Fix Version/s: 0.8.0

 Logical Optimizer: Merge cascading foreach
 --

 Key: PIG-1321
 URL: https://issues.apache.org/jira/browse/PIG-1321
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0


 We can merge consecutive foreach statement.
 Eg:
 b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1;
 c = foreach b generate b0#'kk1', b0#'kk2', b1, a1;
 = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: PIG Logical Optimization: Use CNF in SplitFilter

2010-07-12 Thread Yan Zhou
Hopefully by this week. I'm still in the debugging phase of the work.
While you are welcome to reuse some of my algorithms, I doubt you can
reuse the code as much as you want. It's basically for my DNF use. You
might need to factor out some general codes which you can find

reusable.

 

I fully understand the I/O benefits as I put in my first message. And it
is classified as Scenario 1. There is no doubt that it should be
considered as part of your work. However, for this, CNF is not
necessary.

 

For scenario 2, the benefits will be from less in-core logical
expression evaluation costs and no I/O benefits as I can see. And use of
CNF may or may not lead to cheaper evaluations as the example in my
first message shows. In other words, after use of CNF, you should

compare the eval cost with that in the original expression eval before
deciding either the CNF or the original form should be evaluated.

 

Please let me know if I miss any of your points.

 

Thanks,

 

Yan



From: Swati Jain [mailto:swat...@aggiemail.usu.edu] 
Sent: Monday, July 12, 2010 11:52 AM
To: Yan Zhou
Cc: pig-dev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter

 

I was wondering if you are not going to check in your patch soon then it
would be great if you could share it with me. I believe I might be able
to reuse some of your (utility) functionality directly or get some
ideas. 

About your cost-benefit question:
1) I will control the complexity of CNF conversion by providing a
configurable threshold value which will limit the OR-nesting.
2) One benefit of this conversion is that it will allow pushing parts of
a filter (conjuncts) across the joins which is not happening in the
current PushUpFilter optimization. Moreover, it may result in a
cascading effect to push the conjuncts below other operators by other
rules that may be fired as a result. The benefit from this is really
data dependent, but in big-data workloads, any kind of predicate
pushdown may eventually lead to big savings in amount of data read or
amount of data transfered/shuffled across the network (I need to
understand the LogicalPlan to PhysicalPlan conversion better to give
concrete examples).

Thanks!
Swati

On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou y...@yahoo-inc.com wrote:

Yes, I already implemented the NOT push down upfront, so you do not
need to do that.

 

The support of CNF will probably be the most difficulty part. But as I
mentioned last time, you should compare the cost after the trimming CNF
to get the post-split filtering logic. Given the complexity of
manipulating CNF and undetermined benefits, I am not sure it should be
in scope at this moment or not.

 

To handle CNF, I think it's a good idea to create a new plan and connect
the nodes in the new plan to the base plan as you envisioned. In my
changes, which uses DNF instead of CNF but processing is similar
otherwise, I use a LogicalExpressionProxy, which contains a source
member that is just the node in the original plan, to link the nodes in
the new plan and old plan.  The original LogicalExpression is enhanced
with a counter to trace the # of proxies of the original nodes since
normal form creation will spread the nodes in the original tree across
many normalized nodes. The benefit, aside from not setting the plan, is
that the original expression is trimmed according to the processing
results from DNF; while DNF is created separately and as a kinda utility
so that complex features can be used. In my changes, I used
multiple-child tree in DNF while not changing the original binary
expression tree structure. Another benefit is that the original tree is
kept as much as it is at the start, i.e., I do not attempt to optimize
its overall structure beyond trimming based upon the simplification
logics. (I also control the size of DNF to 100 nodes.) The down side of
this is added complexity.

 

But in your case, for scenario 2 which is the whole point to use CNF,
you would need to change the original expression tree structurally
beyond trimming for post-split filtering logic. The other benefit of
using multiple-child expression is depending upon if you plan to support
such expression to replace current binary tree

in the final plan. Even though I think it's a good idea to support that,
but it is not in my scope now.

 

I'll add my algorithm details soon to my jira. Please take a look and
comment as you see appropriate.

 

Thanks,

 

Yan

 

 



From: Swati Jain [mailto:swat...@aggiemail.usu.edu] 
Sent: Friday, July 09, 2010 11:00 PM
To: Yan Zhou
Cc: pig-dev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter

 

Hi Yan,

I agree that the first scenario (filter logic applied to individual
input sources) doesn't need conversion to CNF and that it will be a good
idea to add CNF functionality for the second scenario. I was also
planning to provide a configurable threshold value to 

[jira] Updated: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-07-12 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1472:
---

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to trunk.

 Optimize serialization/deserialization between Map and Reduce and between MR 
 jobs
 -

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
 PIG-1472.patch


 In certain types of pig queries most of the execution time is spent in 
 serializing/deserializing (sedes) records between Map and Reduce and between 
 MR jobs. 
 For example, if PigMix queries are modified to specify types for all the 
 fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
 pigmix v1) that have records with bags and maps being transmitted across map 
 or reduce boundaries run a lot longer (runtime increase of few times has been 
 seen.
 There are a few optimizations that have shown to improve the performance of 
 sedes in my tests -
 1. Use smaller number of bytes to store length of the column . For example if 
 a bytearray is smaller than 255 bytes , a byte can be used to store the 
 length instead of the integer that is currently used.
 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
 DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
 Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
 serialization format that these loaders use cannot change, so after the 
 optimization their format is going to be different from the format used 
 between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1495) Add -q command line option to set queue name for Pig jobs from command line

2010-07-12 Thread Russell Jurney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Jurney updated PIG-1495:


Status: Patch Available  (was: Open)

 Add -q command line option to set queue name for Pig jobs from command line
 ---

 Key: PIG-1495
 URL: https://issues.apache.org/jira/browse/PIG-1495
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.7.0
Reporter: Russell Jurney
 Fix For: 0.8.0

 Attachments: set_queue.patch


 rjurney$ pig -q default
 This sets the mapred.job.queue.name property in the execution engine from the 
 pig properties for MAPRED type jobs.  
 Patch attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1368) Utf8StorageConvertor's bytesToTuple and bytesToBag methods need to be tightened for corner cases

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1368.
-

Resolution: Duplicate

This will be addressed as part of PIG-1271

 Utf8StorageConvertor's bytesToTuple and bytesToBag methods need to be 
 tightened for corner cases
 

 Key: PIG-1368
 URL: https://issues.apache.org/jira/browse/PIG-1368
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Pradeep Kamath

 Consider the following data:
 1\t ( hello , bye ) \n
 1\t( hello , bye )a\n
 2 \t (good , bye)\n
 The following script gives the results below:
 a = load 'junk' as (i:int, t:tuple(s:chararray, r:chararray)); dump a;
 (1,( hello , bye ))
 (1,( hello , bye ))
 (2,(good , bye))
 The current bytesToTuple implementation discards leading and trailing 
 characters before the tuple delimiters and parses the tuple out - I think 
 instead it should treat any leading and trailing characters (including space) 
 near the delimiters as an indication of a malformed tuple and return null.
 Also in the code, consumeBag() should handle the special case of {} and not 
 delegate the handling to consumeTuple(). 
 In consumeBag() null tuples should not be skipped.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1466) Improve log messages for memory usage

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1466:
---

Assignee: Thejas M Nair

Thejas, can you update the messages since you are already looking at the memory 
stuff, thanks

 Improve log messages for memory usage
 -

 Key: PIG-1466
 URL: https://issues.apache.org/jira/browse/PIG-1466
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Thejas M Nair
Priority: Minor
 Fix For: 0.8.0


 For anything more then a moderately sized dataset Pig usually spits following 
 messages:
 {code}
 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
 low memory handler called (Usage
 threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed 
 = 954466304(932096K) max =
 954466304(932096K)
 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
 low memory handler called (Collection
 threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed 
 = 954466304(932096K) max =
 954466304(932096K)
 {code}
 This seems to confuse users a lot. Once these messages are printed, users 
 tend to believe that Pig is having hard time with memory, is spilling to disk 
 etc. but in fact Pig might be cruising along at ease. We should be little 
 more careful what to print in logs. Currently these are printed when a 
 notification is sent by JVM and some other conditions are met which may not 
 necessarily indicate low memory condition. Furthermore, with 
 {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these 
 messages have lost their usefulness. At the every least, we should lower the 
 log level at which these are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1495) Add -q command line option to set queue name for Pig jobs from command line

2010-07-12 Thread Russell Jurney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Jurney updated PIG-1495:


Status: Open  (was: Patch Available)

 Add -q command line option to set queue name for Pig jobs from command line
 ---

 Key: PIG-1495
 URL: https://issues.apache.org/jira/browse/PIG-1495
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.7.0
Reporter: Russell Jurney
 Fix For: 0.8.0

 Attachments: set_queue.patch


 rjurney$ pig -q default
 This sets the mapred.job.queue.name property in the execution engine from the 
 pig properties for MAPRED type jobs.  
 Patch attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



PIG Logical Optimization: Use CNF in SplitFilter

2010-07-12 Thread Swati Jain
Yan,

What I meant in my last email was that scenario 2 optimizations would lead
to more opportunities for scenario 1 kind of optimizations.

Consider the conjunct list [C1;C2;C3] as the source of a JOIN.

(a)  Suppose none of these are computable on a join input, in this case we
retain the original expression and discard the CNF.

(b)  Suppose C1 is computable on join input J1 and C2 is computable on join
input J2 but C3 requires a combination of both join inputs. In this case, we
push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note that
C1 and C2 may be further pushed up (with additional iterations of the
optimizer). If they are now the source of single input operators, it is
similar to scenario 1.

Thanks,
Swati


On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou y...@yahoo-inc.com wrote:

  Hopefully by this week. I’m still in the debugging phase of the work.
 While you are welcome to reuse some of my algorithms, I doubt you can reuse
 the code as much as you want. It’s basically for my DNF use. You might need
 to factor out some general codes which you can find

 reusable.



 I fully understand the I/O benefits as I put in my first message. And it is
 classified as “Scenario 1”. There is no doubt that it should be considered
 as part of your work. However, for this, CNF is not necessary.



 For scenario 2, the benefits will be from less in-core logical expression
 evaluation costs and no I/O benefits as I can see. And use of CNF may or may
 not lead to cheaper evaluations as the example in my first message shows. In
 other words, after use of CNF, you should

 compare the eval cost with that in the original expression eval before
 deciding either the CNF or the original form should be evaluated.



 Please let me know if I miss any of your points.



 Thanks,



 Yan
  --

 *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu]
 *Sent:* Monday, July 12, 2010 11:52 AM

 *To:* Yan Zhou
 *Cc:* pig-dev@hadoop.apache.org
 *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter



 I was wondering if you are not going to check in your patch soon then it
 would be great if you could share it with me. I believe I might be able to
 reuse some of your (utility) functionality directly or get some ideas.

 About your cost-benefit question:
 1) I will control the complexity of CNF conversion by providing a
 configurable threshold value which will limit the OR-nesting.
 2) One benefit of this conversion is that it will allow pushing parts of a
 filter (conjuncts) across the joins which is not happening in the current
 PushUpFilter optimization. Moreover, it may result in a cascading effect to
 push the conjuncts below other operators by other rules that may be fired as
 a result. The benefit from this is really data dependent, but in big-data
 workloads, any kind of predicate pushdown may eventually lead to big savings
 in amount of data read or amount of data transfered/shuffled across the
 network (I need to understand the LogicalPlan to PhysicalPlan conversion
 better to give concrete examples).

 Thanks!
 Swati

 On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou y...@yahoo-inc.com wrote:

 Yes, I already implemented the “NOT push down” upfront, so you do not need
 to do that.



 The support of CNF will probably be the most difficulty part. But as I
 mentioned last time, you should compare the cost after the trimming CNF to
 get the post-split filtering logic. Given the complexity of manipulating CNF
 and undetermined benefits, I am not sure it should be in scope at this
 moment or not.



 To handle CNF, I think it’s a good idea to create a new plan and connect
 the nodes in the new plan to the base plan as you envisioned. In my changes,
 which uses DNF instead of CNF but processing is similar otherwise, I use a
 LogicalExpressionProxy, which contains a “source” member that is just the
 node in the original plan, to link the nodes in the new plan and old plan.
  The original LogicalExpression is enhanced with a counter to trace the # of
 proxies of the original nodes since normal form creation will “spread” the
 nodes in the original tree across many normalized nodes. The benefit, aside
 from not setting the plan, is that the original expression is trimmed
 according to the processing results from DNF; while DNF is created
 separately and as a kinda utility so that complex features can be used. In
 my changes, I used multiple-child tree in DNF while not changing the
 original binary expression tree structure. Another benefit is that the
 original tree is kept as much as it is at the start, i.e., I do not attempt
 to optimize its overall structure beyond trimming based upon the
 simplification logics. (I also control the size of DNF to 100 nodes.) The
 down side of this is added complexity.



 But in your case, for scenario 2 which is the whole point to use CNF, you
 would need to change the original expression tree structurally beyond
 trimming for post-split filtering 

[jira] Commented: (PIG-1478) Add progress notification listener to PigRunner API

2010-07-12 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887578#action_12887578
 ] 

Alan Gates commented on PIG-1478:
-

I don't understand the difference between launchStartedNotification() and 
jobsSubmittedNotification().

When will outputCompletedNotification() be called?  Only after the job is 
completely done?  What, if any, guarantees are we making on the order of this 
relative to when PigRunner.run returns?

It isn't clear to me that launchCompleteNotification() is useful.  Once the 
launch has completed the user will start getting jobStartedNotification() calls.


 Add progress notification listener to PigRunner API
 ---

 Key: PIG-1478
 URL: https://issues.apache.org/jira/browse/PIG-1478
 Project: Pig
  Issue Type: Improvement
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1478.patch


 PIG-1333 added PigRunner API to allow Pig users and tools to get a 
 status/stats object back after executing a Pig script. The new API, however, 
 is synchronous (blocking). It's known that a Pig script can spawn tens (even 
 hundreds) MR jobs and take hours to complete. Therefore it'll be nice to give 
 progress feedback to the callers during the execution.
 The proposal is to add an optional parameter to the API:
 {code}
 public abstract class PigRunner {
 public static PigStats run(String[] args, PigProgressNotificationListener 
 listener) {...}
 }
 {code} 
 The new listener is defined as following:
 {code}
 package org.apache.pig.tools.pigstats;
 public interface PigProgressNotificationListener extends 
 java.util.EventListener {
 // just before the launch of MR jobs for the script
 public void LaunchStartedNotification(int numJobsToLaunch);
 // number of jobs submitted in a batch
 public void jobsSubmittedNotification(int numJobsSubmitted);
 // a job is started
 public void jobStartedNotification(String assignedJobId);
 // a job is completed successfully
 public void jobFinishedNotification(JobStats jobStats);
 // a job is failed
 public void jobFailedNotification(JobStats jobStats);
 // a user output is completed successfully
 public void outputCompletedNotification(OutputStats outputStats);
 // updates the progress as percentage
 public void progressUpdatedNotification(int progress);
 // the script execution is done
 public void launchCompletedNotification(int numJobsSucceeded);
 }
 {code}
 Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1460) UDF manual and javadocs should make clear how to use RequiredFieldList

2010-07-12 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1460:
---

Assignee: Pradeep Kamath

Pradeep, could you provide the information needed and also update the javadoc. 
Then, please, re-assign to Corinne so that she can update the UDF manual, 
thanks.

 UDF manual and javadocs should make clear how to use RequiredFieldList
 --

 Key: PIG-1460
 URL: https://issues.apache.org/jira/browse/PIG-1460
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Pradeep Kamath
Priority: Minor
 Fix For: 0.8.0


 The UDF manual mentions that load function writers need to handle 
 RequiredFieldList passed to LoadPushDown.pushProjection, but it does not 
 specify how the writer should interpret the contents of that list.  The 
 javadoc is similarly vague. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1373) We need to add jdiff output to docs on the website

2010-07-12 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887582#action_12887582
 ] 

Daniel Dai commented on PIG-1373:
-

All the changes are made, need to verify API changes link when 0.8 release.

 We need to add jdiff output to docs on the website
 --

 Key: PIG-1373
 URL: https://issues.apache.org/jira/browse/PIG-1373
 Project: Pig
  Issue Type: Bug
Reporter: Alan Gates
Assignee: Daniel Dai
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1373-1.patch, PIG-1373-2.patch


 Our build process constructs a jdiff between APIs for different versions.  
 But we don't post the results of that to the website when we deploy the docs. 
  We should, in order to help users understand changes across versions of pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-07-12 Thread Daniel Dai (JIRA)
Mandatory rule ImplicitSplitInserter


 Key: PIG-1496
 URL: https://issues.apache.org/jira/browse/PIG-1496
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.8.0


Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1497) Mandatory rule PartitionFilterOptimizer

2010-07-12 Thread Daniel Dai (JIRA)
Mandatory rule PartitionFilterOptimizer
---

 Key: PIG-1497
 URL: https://issues.apache.org/jira/browse/PIG-1497
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.8.0


Need to migrate PartitionFilterOptimizer to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1495) Add -q command line option to set queue name for Pig jobs from command line

2010-07-12 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887585#action_12887585
 ] 

Russell Jurney commented on PIG-1495:
-

This doesn't work yet.  Doh!

 Add -q command line option to set queue name for Pig jobs from command line
 ---

 Key: PIG-1495
 URL: https://issues.apache.org/jira/browse/PIG-1495
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.7.0
Reporter: Russell Jurney
 Fix For: 0.8.0

 Attachments: set_queue.patch


 rjurney$ pig -q default
 This sets the mapred.job.queue.name property in the execution engine from the 
 pig properties for MAPRED type jobs.  
 Patch attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: PIG Logical Optimization: Use CNF in SplitFilter

2010-07-12 Thread Yan Zhou
I see. There looks like some disconnect about Scenario 1. To me, all
filtering logics that can be pushed above JOIN can be figured out
without use of CNF, which is scenario 1; while CNF helps to derive the
filtering logic after (or, in your example, below) JOIN, which is
Scenario 2.

 

In your example, C1 and C2, or their equivalent, above JOIN can be
easily figured out without resorting to CNF; C3 may have to be figured
out with CNF, but evaluation cost of the post-Join filtering logic thus
generated may not be cheaper than the original one before pushing up.

 

In summary, if we want to support scenario 2(and 1), we should use CNF;
if we JUST want to support scenario 1, which will push up all possible
filters closer to source and have all benefits on pruned I/O, we should
not use CNF.

 

Thanks,

 

Yan

 

-Original Message-
From: Swati Jain [mailto:swat...@aggiemail.usu.edu] 
Sent: Monday, July 12, 2010 4:04 PM
To: pig-dev@hadoop.apache.org
Subject: PIG Logical Optimization: Use CNF in SplitFilter

 

Yan,

 

What I meant in my last email was that scenario 2 optimizations would
lead

to more opportunities for scenario 1 kind of optimizations.

 

Consider the conjunct list [C1;C2;C3] as the source of a JOIN.

 

(a)  Suppose none of these are computable on a join input, in this case
we

retain the original expression and discard the CNF.

 

(b)  Suppose C1 is computable on join input J1 and C2 is computable on
join

input J2 but C3 requires a combination of both join inputs. In this
case, we

push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note
that

C1 and C2 may be further pushed up (with additional iterations of the

optimizer). If they are now the source of single input operators, it is

similar to scenario 1.

 

Thanks,

Swati

 

 

On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou y...@yahoo-inc.com wrote:

 

  Hopefully by this week. I'm still in the debugging phase of the work.

 While you are welcome to reuse some of my algorithms, I doubt you can
reuse

 the code as much as you want. It's basically for my DNF use. You might
need

 to factor out some general codes which you can find

 

 reusable.

 

 

 

 I fully understand the I/O benefits as I put in my first message. And
it is

 classified as Scenario 1. There is no doubt that it should be
considered

 as part of your work. However, for this, CNF is not necessary.

 

 

 

 For scenario 2, the benefits will be from less in-core logical
expression

 evaluation costs and no I/O benefits as I can see. And use of CNF may
or may

 not lead to cheaper evaluations as the example in my first message
shows. In

 other words, after use of CNF, you should

 

 compare the eval cost with that in the original expression eval before

 deciding either the CNF or the original form should be evaluated.

 

 

 

 Please let me know if I miss any of your points.

 

 

 

 Thanks,

 

 

 

 Yan

  --

 

 *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu]

 *Sent:* Monday, July 12, 2010 11:52 AM

 

 *To:* Yan Zhou

 *Cc:* pig-dev@hadoop.apache.org

 *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter

 

 

 

 I was wondering if you are not going to check in your patch soon then
it

 would be great if you could share it with me. I believe I might be
able to

 reuse some of your (utility) functionality directly or get some ideas.

 

 About your cost-benefit question:

 1) I will control the complexity of CNF conversion by providing a

 configurable threshold value which will limit the OR-nesting.

 2) One benefit of this conversion is that it will allow pushing parts
of a

 filter (conjuncts) across the joins which is not happening in the
current

 PushUpFilter optimization. Moreover, it may result in a cascading
effect to

 push the conjuncts below other operators by other rules that may be
fired as

 a result. The benefit from this is really data dependent, but in
big-data

 workloads, any kind of predicate pushdown may eventually lead to big
savings

 in amount of data read or amount of data transfered/shuffled across
the

 network (I need to understand the LogicalPlan to PhysicalPlan
conversion

 better to give concrete examples).

 

 Thanks!

 Swati

 

 On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou y...@yahoo-inc.com wrote:

 

 Yes, I already implemented the NOT push down upfront, so you do not
need

 to do that.

 

 

 

 The support of CNF will probably be the most difficulty part. But as I

 mentioned last time, you should compare the cost after the trimming
CNF to

 get the post-split filtering logic. Given the complexity of
manipulating CNF

 and undetermined benefits, I am not sure it should be in scope at this

 moment or not.

 

 

 

 To handle CNF, I think it's a good idea to create a new plan and
connect

 the nodes in the new plan to the base plan as you envisioned. In my
changes,

 which uses DNF instead of CNF but processing is similar otherwise, I
use a

 

[jira] Commented: (PIG-1478) Add progress notification listener to PigRunner API

2010-07-12 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887602#action_12887602
 ] 

Richard Ding commented on PIG-1478:
---

bq. I don't understand the difference between launchStartedNotification() and 
jobsSubmittedNotification().

launchStartedNotification() tells the listeners the total number of jobs ready 
to submit for the script. jobsSubmittedNotification() tells the listeners the 
number of jobs submitted in a batch. Because of the dependency between jobs, 
Pig may not be able to submit all the jobs together. So the numJobsToLaunch 
passed to launchStartedNotification() should equal to the sum of 
numJobsSubmitted of all  jobsSubmittedNotification() calls.

bq. When will outputCompletedNotification() be called? Only after the job is 
completely done? What, if any, guarantees are we making on the order of this 
relative to when PigRunner.run returns?

outputCompletedNotification() is called after the job that writes this output 
is done. This is only called for user outputs. As a script can have multiple 
user outputs, some outputs may be written before all jobs are done. 

bq. It isn't clear to me that launchCompleteNotification() is useful. Once the 
launch has completed the user will start getting jobStartedNotification() calls.

Just try to be complete. launchCompleteNotification() is called when all jobs 
are done. If a script is executed successfully, the numJobsSucceeded should 
equal to the  numJobsToLaunch from launchStartedNotification().

An example log trace looks like this:

{code}
 numJobsToLaunch: 3
 jobs submitted: 1
 progress: 0%
 job started: job_20100702195434153_0002
 progress: 16%
 progress: 33%
 job finished: job_20100702195434153_0002
 jobs submitted: 1
 job started: job_20100702195434153_0003
 progress: 50%
 progress: 66%
 job finished: job_20100702195434153_0003
 jobs submitted: 1
 job started: job_20100702195434153_0004
 progress: 83%
 output done: hdfs://localhost.localdomain:52083/user/pig/myoutput
 job finished: job_20100702195434153_0004
 progress: 100%
 numJobsSucceeded: 3
{code}

 Add progress notification listener to PigRunner API
 ---

 Key: PIG-1478
 URL: https://issues.apache.org/jira/browse/PIG-1478
 Project: Pig
  Issue Type: Improvement
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1478.patch


 PIG-1333 added PigRunner API to allow Pig users and tools to get a 
 status/stats object back after executing a Pig script. The new API, however, 
 is synchronous (blocking). It's known that a Pig script can spawn tens (even 
 hundreds) MR jobs and take hours to complete. Therefore it'll be nice to give 
 progress feedback to the callers during the execution.
 The proposal is to add an optional parameter to the API:
 {code}
 public abstract class PigRunner {
 public static PigStats run(String[] args, PigProgressNotificationListener 
 listener) {...}
 }
 {code} 
 The new listener is defined as following:
 {code}
 package org.apache.pig.tools.pigstats;
 public interface PigProgressNotificationListener extends 
 java.util.EventListener {
 // just before the launch of MR jobs for the script
 public void LaunchStartedNotification(int numJobsToLaunch);
 // number of jobs submitted in a batch
 public void jobsSubmittedNotification(int numJobsSubmitted);
 // a job is started
 public void jobStartedNotification(String assignedJobId);
 // a job is completed successfully
 public void jobFinishedNotification(JobStats jobStats);
 // a job is failed
 public void jobFailedNotification(JobStats jobStats);
 // a user output is completed successfully
 public void outputCompletedNotification(OutputStats outputStats);
 // updates the progress as percentage
 public void progressUpdatedNotification(int progress);
 // the script execution is done
 public void launchCompletedNotification(int numJobsSucceeded);
 }
 {code}
 Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: PIG Logical Optimization: Use CNF in SplitFilter

2010-07-12 Thread Swati Jain
Hi Yan

Thanks for your prompt reply. I did not understand your statement “C1 and
C2, or their equivalent, above JOIN can be easily figured out without
resorting to CNF”.

Consider a LOFilter above a LOJoin. The predicate of LOFilter: ( (c1  10)
AND (a3+b3  10) ) OR (c2 == 5)

The schema for LOJoin:

A = (a1:int,a2:int,a3:int);
B = (b1:int,b2:int,b3:int);
C = (c1:int,c2:int,c3:int);

After CNF: ( (c1  10) OR (c2 == 5) ) AND ( (a3+b3  10) OR (c2 ==5) )

Now we can push ( (c1  10) OR (c2 == 5) ) above the JOIN (in the branch
leading up to the source C) while ( (a3+b3  10) OR (c2 ==5) ) stays put
below the JOIN.

Please let me know if there is a way of doing the above optimization without
converting the original expression to CNF.

Thanks,

Swati


On Mon, Jul 12, 2010 at 4:26 PM, Yan Zhou y...@yahoo-inc.com wrote:

 I see. There looks like some disconnect about Scenario 1. To me, all
 filtering logics that can be pushed above JOIN can be figured out
 without use of CNF, which is scenario 1; while CNF helps to derive the
 filtering logic after (or, in your example, below) JOIN, which is
 Scenario 2.



 In your example, C1 and C2, or their equivalent, above JOIN can be
 easily figured out without resorting to CNF; C3 may have to be figured
 out with CNF, but evaluation cost of the post-Join filtering logic thus
 generated may not be cheaper than the original one before pushing up.



 In summary, if we want to support scenario 2(and 1), we should use CNF;
 if we JUST want to support scenario 1, which will push up all possible
 filters closer to source and have all benefits on pruned I/O, we should
 not use CNF.



 Thanks,



 Yan



 -Original Message-
 From: Swati Jain [mailto:swat...@aggiemail.usu.edu]
 Sent: Monday, July 12, 2010 4:04 PM
 To: pig-dev@hadoop.apache.org
 Subject: PIG Logical Optimization: Use CNF in SplitFilter



 Yan,



 What I meant in my last email was that scenario 2 optimizations would
 lead

 to more opportunities for scenario 1 kind of optimizations.



 Consider the conjunct list [C1;C2;C3] as the source of a JOIN.



 (a)  Suppose none of these are computable on a join input, in this case
 we

 retain the original expression and discard the CNF.



 (b)  Suppose C1 is computable on join input J1 and C2 is computable on
 join

 input J2 but C3 requires a combination of both join inputs. In this
 case, we

 push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note
 that

 C1 and C2 may be further pushed up (with additional iterations of the

 optimizer). If they are now the source of single input operators, it is

 similar to scenario 1.



 Thanks,

 Swati





 On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou y...@yahoo-inc.com wrote:



   Hopefully by this week. I'm still in the debugging phase of the work.

  While you are welcome to reuse some of my algorithms, I doubt you can
 reuse

  the code as much as you want. It's basically for my DNF use. You might
 need

  to factor out some general codes which you can find

 

  reusable.

 

 

 

  I fully understand the I/O benefits as I put in my first message. And
 it is

  classified as Scenario 1. There is no doubt that it should be
 considered

  as part of your work. However, for this, CNF is not necessary.

 

 

 

  For scenario 2, the benefits will be from less in-core logical
 expression

  evaluation costs and no I/O benefits as I can see. And use of CNF may
 or may

  not lead to cheaper evaluations as the example in my first message
 shows. In

  other words, after use of CNF, you should

 

  compare the eval cost with that in the original expression eval before

  deciding either the CNF or the original form should be evaluated.

 

 

 

  Please let me know if I miss any of your points.

 

 

 

  Thanks,

 

 

 

  Yan

   --

 

  *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu]

  *Sent:* Monday, July 12, 2010 11:52 AM

 

  *To:* Yan Zhou

  *Cc:* pig-dev@hadoop.apache.org

  *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter

 

 

 

  I was wondering if you are not going to check in your patch soon then
 it

  would be great if you could share it with me. I believe I might be
 able to

  reuse some of your (utility) functionality directly or get some ideas.

 

  About your cost-benefit question:

  1) I will control the complexity of CNF conversion by providing a

  configurable threshold value which will limit the OR-nesting.

  2) One benefit of this conversion is that it will allow pushing parts
 of a

  filter (conjuncts) across the joins which is not happening in the
 current

  PushUpFilter optimization. Moreover, it may result in a cascading
 effect to

  push the conjuncts below other operators by other rules that may be
 fired as

  a result. The benefit from this is really data dependent, but in
 big-data

  workloads, any kind of predicate pushdown may eventually lead to big
 savings

  in amount of data read or amount of data 

RE: PIG Logical Optimization: Use CNF in SplitFilter

2010-07-12 Thread Yan Zhou
In the original expression, let (a3+b3  10) to be true, then it
transformed to (c1  10) OR (c2 == 5) ) since TRUE OR anything is
still TRUE; TRUE and anything is that anything. You can write a
visitor to easily do this type of partial evaluation. (a3+b310) is
chosen because it can not be determined from alias 'C'.

Thanks,

Yan

-Original Message-
From: Swati Jain [mailto:swat...@aggiemail.usu.edu] 
Sent: Monday, July 12, 2010 5:40 PM
To: pig-dev@hadoop.apache.org
Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter

Hi Yan

Thanks for your prompt reply. I did not understand your statement C1
and
C2, or their equivalent, above JOIN can be easily figured out without
resorting to CNF.

Consider a LOFilter above a LOJoin. The predicate of LOFilter: ( (c1 
10)
AND (a3+b3  10) ) OR (c2 == 5)

The schema for LOJoin:

A = (a1:int,a2:int,a3:int);
B = (b1:int,b2:int,b3:int);
C = (c1:int,c2:int,c3:int);

After CNF: ( (c1  10) OR (c2 == 5) ) AND ( (a3+b3  10) OR (c2 ==5) )

Now we can push ( (c1  10) OR (c2 == 5) ) above the JOIN (in the branch
leading up to the source C) while ( (a3+b3  10) OR (c2 ==5) ) stays put
below the JOIN.

Please let me know if there is a way of doing the above optimization
without
converting the original expression to CNF.

Thanks,

Swati


On Mon, Jul 12, 2010 at 4:26 PM, Yan Zhou y...@yahoo-inc.com wrote:

 I see. There looks like some disconnect about Scenario 1. To me, all
 filtering logics that can be pushed above JOIN can be figured out
 without use of CNF, which is scenario 1; while CNF helps to derive the
 filtering logic after (or, in your example, below) JOIN, which is
 Scenario 2.



 In your example, C1 and C2, or their equivalent, above JOIN can be
 easily figured out without resorting to CNF; C3 may have to be figured
 out with CNF, but evaluation cost of the post-Join filtering logic
thus
 generated may not be cheaper than the original one before pushing up.



 In summary, if we want to support scenario 2(and 1), we should use
CNF;
 if we JUST want to support scenario 1, which will push up all possible
 filters closer to source and have all benefits on pruned I/O, we
should
 not use CNF.



 Thanks,



 Yan



 -Original Message-
 From: Swati Jain [mailto:swat...@aggiemail.usu.edu]
 Sent: Monday, July 12, 2010 4:04 PM
 To: pig-dev@hadoop.apache.org
 Subject: PIG Logical Optimization: Use CNF in SplitFilter



 Yan,



 What I meant in my last email was that scenario 2 optimizations would
 lead

 to more opportunities for scenario 1 kind of optimizations.



 Consider the conjunct list [C1;C2;C3] as the source of a JOIN.



 (a)  Suppose none of these are computable on a join input, in this
case
 we

 retain the original expression and discard the CNF.



 (b)  Suppose C1 is computable on join input J1 and C2 is computable on
 join

 input J2 but C3 requires a combination of both join inputs. In this
 case, we

 push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note
 that

 C1 and C2 may be further pushed up (with additional iterations of the

 optimizer). If they are now the source of single input operators, it
is

 similar to scenario 1.



 Thanks,

 Swati





 On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou y...@yahoo-inc.com wrote:



   Hopefully by this week. I'm still in the debugging phase of the
work.

  While you are welcome to reuse some of my algorithms, I doubt you
can
 reuse

  the code as much as you want. It's basically for my DNF use. You
might
 need

  to factor out some general codes which you can find

 

  reusable.

 

 

 

  I fully understand the I/O benefits as I put in my first message.
And
 it is

  classified as Scenario 1. There is no doubt that it should be
 considered

  as part of your work. However, for this, CNF is not necessary.

 

 

 

  For scenario 2, the benefits will be from less in-core logical
 expression

  evaluation costs and no I/O benefits as I can see. And use of CNF
may
 or may

  not lead to cheaper evaluations as the example in my first message
 shows. In

  other words, after use of CNF, you should

 

  compare the eval cost with that in the original expression eval
before

  deciding either the CNF or the original form should be evaluated.

 

 

 

  Please let me know if I miss any of your points.

 

 

 

  Thanks,

 

 

 

  Yan

   --

 

  *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu]

  *Sent:* Monday, July 12, 2010 11:52 AM

 

  *To:* Yan Zhou

  *Cc:* pig-dev@hadoop.apache.org

  *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter

 

 

 

  I was wondering if you are not going to check in your patch soon
then
 it

  would be great if you could share it with me. I believe I might be
 able to

  reuse some of your (utility) functionality directly or get some
ideas.

 

  About your cost-benefit question:

  1) I will control the complexity of CNF conversion by providing a

  configurable threshold value which will limit the