[jira] Updated: (PIG-893) support cast of chararray to other simple types

2009-08-03 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-893:
---

Fix Version/s: 0.4.0
Affects Version/s: 0.4.0
   Status: Patch Available  (was: Open)

 support cast of chararray to other simple types
 ---

 Key: PIG-893
 URL: https://issues.apache.org/jira/browse/PIG-893
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Thejas M Nair
 Fix For: 0.4.0

 Attachments: Pig_893_Patch.txt


 Pig should support casting of chararray to 
 integer,long,float,double,bytearray. If the conversion fails for reasons such 
 as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-893) support cast of chararray to other simple types

2009-08-03 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-893:
---

Attachment: Pig_893_Patch.txt

attach the patch including the TestCase.

I extract the bytesTo* method from Utf8StorageConverter to CastUtil. Then these 
methods can been reused by other objects.



 support cast of chararray to other simple types
 ---

 Key: PIG-893
 URL: https://issues.apache.org/jira/browse/PIG-893
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Thejas M Nair
 Fix For: 0.4.0

 Attachments: Pig_893_Patch.txt


 Pig should support casting of chararray to 
 integer,long,float,double,bytearray. If the conversion fails for reasons such 
 as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-592) schema inferred incorrectly

2009-08-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738472#action_12738472
 ] 

Daniel Dai commented on PIG-592:


Also the following script produce the wrong schema:

a = load 'a';
b = load 'b';
c = join a by $0, b by $0;
describe c;

c: {bytearray,bytearray}

The correct behavior should be: If any of the input schema is unkown, the 
output schema is also unkown. 

 schema inferred incorrectly
 ---

 Key: PIG-592
 URL: https://issues.apache.org/jira/browse/PIG-592
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Christopher Olston

 A simple pig script, that never introduces any schema information:
 A = load 'foo';
 B = foreach (group A by $8) generate group, COUNT($1);
 C = load 'bar';   // ('bar' has two columns)
 D = join B by $0, C by $0;
 E = foreach D generate $0, $1, $3;
 Fails, complaining that $3 does not exist:
 java.io.IOException: Out of bound access. Trying to access non-existent 
 column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s).
 Apparently Pig gets confused, and thinks it knows the schema for C (a single 
 bytearray column).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY

2009-08-03 Thread David Ciemiewicz (JIRA)
ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER 
BY
-

 Key: PIG-900
 URL: https://issues.apache.org/jira/browse/PIG-900
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz


With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c);
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY

2009-08-03 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-900:
-

Description: 
With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c);
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

Here's my full code example ...

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}


  was:
With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c);
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}



 ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and 
 FILTER BY
 -

 Key: PIG-900
 URL: https://issues.apache.org/jira/browse/PIG-900
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 With GROUP BY, you must put parentheses around the aliases in the BY clause:
 {code}
 B = group A by ( a, b, c );
 {code}
 With FILTER BY, you can optionally put parentheses around the aliases in the 
 BY clause:
 {code}
 B = filter A by ( a is not null and b is not null and c is not null );
 {code}
 However, with ORDER BY, if you put parenthesis around the BY clause, you get 
 a syntax error:
 {code}
  A = order A by ( a, b, c);
 {code}
 Produces the error:
 {code}
 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 
 19.
 Was expecting:
 ) ...
 {code}
 This is an annoyance really.
 Here's my full code example ...
 {code}
 A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
 chararray );
 A = order A by ( a, b, c );
 dump A;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY

2009-08-03 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-900:
-

Description: 
With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c );
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

Here's my full code example ...

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}


  was:
With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c);
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

Here's my full code example ...

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}



 ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and 
 FILTER BY
 -

 Key: PIG-900
 URL: https://issues.apache.org/jira/browse/PIG-900
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 With GROUP BY, you must put parentheses around the aliases in the BY clause:
 {code}
 B = group A by ( a, b, c );
 {code}
 With FILTER BY, you can optionally put parentheses around the aliases in the 
 BY clause:
 {code}
 B = filter A by ( a is not null and b is not null and c is not null );
 {code}
 However, with ORDER BY, if you put parenthesis around the BY clause, you get 
 a syntax error:
 {code}
  A = order A by ( a, b, c );
 {code}
 Produces the error:
 {code}
 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 
 19.
 Was expecting:
 ) ...
 {code}
 This is an annoyance really.
 Here's my full code example ...
 {code}
 A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
 chararray );
 A = order A by ( a, b, c );
 dump A;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext

2009-08-03 Thread Pradeep Kamath (JIRA)
InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
PigContext


 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0


InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
PigContext. SliceWrapper only needs ExecType - so the entire PigContext should 
not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-902) Allow schema matching for UDF with variable length arguments

2009-08-03 Thread Daniel Dai (JIRA)
Allow schema matching for UDF with variable length arguments


 Key: PIG-902
 URL: https://issues.apache.org/jira/browse/PIG-902
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai


Pig pick the right version of UDF using a similarity measurement. This 
mechanism pick the UDF with right input schema to use. However, some UDFs use 
various number of inputs and currently there is no way to declare such input 
schema in UDF and similarity measurement do not match against variable number 
of inputs. We can still write variable inputs UDF, but we cannot rely on schema 
matching to pick the right UDF version and do the automatic data type 
conversion.

Eg:
If we have:
Integer udf1(Integer, ..);
Integer udf1(String, ..);

Currently we cannot do this:
a: {chararray, chararray}
b = foreach a generate udf1(a.$0, a.$1);  // Pig cannot pick the udf(String, 
..) automatically, currently, this statement fails

Eg:
If we have:
Integer udf2(Integer, ..);

Currently, this script fail
a: {chararray, chararray}
b = foreach a generate udf1(a.$0, a.$1);  // Currently, Pig cannot convert a.$0 
into Integer automatically


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext

2009-08-03 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738496#action_12738496
 ] 

Daniel Dai commented on PIG-901:


PigContext.packageImportList needs to be serialized as well. Otherwise 
InputSplit cannot instantiate Loader function.

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-200) Pig Performance Benchmarks

2009-08-03 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-200:


Attachment: perf.hadoop.patch

perf.hadoop.patch is used to support running DataGenerator in hadoop mode. It 
should be installed on top of perf.patch. 

The design doc is here.
http://twiki.corp.yahoo.com/view/Tiger/DataGeneratorHadoop

 Pig Performance Benchmarks
 --

 Key: PIG-200
 URL: https://issues.apache.org/jira/browse/PIG-200
 Project: Pig
  Issue Type: Task
Reporter: Amir Youssefi
 Attachments: generate_data.pl, perf.hadoop.patch, perf.patch


 To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
 plus Script Collection. This is used in comparison of different Pig releases, 
 Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
 Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
 I am currently running long-running Pig scripts over data-sets in the order 
 of tens of TBs. Next step is hundreds of TBs.
 We need to have an open large-data set (open source scripts which generate 
 data-set) and detailed scripts for important operations such as ORDER, 
 AGGREGATION etc.
 We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
 running scripts) and Triathlon (Mix). 
 I will update this JIRA with more details of current activities soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-903) ILLUSTRATE fails on 'Distinct' operator

2009-08-03 Thread Dmitriy V. Ryaboy (JIRA)
ILLUSTRATE fails on 'Distinct' operator
---

 Key: PIG-903
 URL: https://issues.apache.org/jira/browse/PIG-903
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy


Using the latest Pig from trunk (0.3+) in mapreduce mode, running through the 
tutorial script script1-hadoop.pig works fine.

However, executing the following illustrate command throws an exception:

illustrate ngramed2

Pig Stack Trace
---
ERROR 2999: Unexpected internal error. Unrecognized logical operator.

java.lang.RuntimeException: Unrecognized logical operator.
at 
org.apache.pig.pen.EquivalenceClasses.GetEquivalenceClasses(EquivalenceClasses.java:60)
at 
org.apache.pig.pen.DerivedDataVisitor.evaluateOperator(DerivedDataVisitor.java:368)
at 
org.apache.pig.pen.DerivedDataVisitor.visit(DerivedDataVisitor.java:226)
at 
org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:104)
at org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:37)
at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:98)
at 
org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:90)
at 
org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:106)
at org.apache.pig.PigServer.getExamples(PigServer.java:724)
at 
org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:541)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:195)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:361)


This works:
illustrate ngramed1;

Although it does throw a few NPEs :

java.lang.NullPointerException
at 
org.apache.pig.pen.util.DisplayExamples.ShortenField(DisplayExamples.java:205)
at 
org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
at 
org.apache.pig.pen.util.DisplayExamples.PrintTabular(DisplayExamples.java:86)
[...]

(illustrate also doesn't work on bzipped input, but that's a separate issue)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-200) Pig Performance Benchmarks

2009-08-03 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738556#action_12738556
 ] 

Olga Natkovich edited comment on PIG-200 at 8/3/09 2:01 PM:


perf.hadoop.patch is used to support running DataGenerator in hadoop mode. It 
should be installed on top of perf.patch. 

  was (Author: yinghe):
perf.hadoop.patch is used to support running DataGenerator in hadoop mode. 
It should be installed on top of perf.patch. 

The design doc is here.
http://twiki.corp.yahoo.com/view/Tiger/DataGeneratorHadoop
  
 Pig Performance Benchmarks
 --

 Key: PIG-200
 URL: https://issues.apache.org/jira/browse/PIG-200
 Project: Pig
  Issue Type: Task
Reporter: Amir Youssefi
 Attachments: generate_data.pl, perf.hadoop.patch, perf.patch


 To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
 plus Script Collection. This is used in comparison of different Pig releases, 
 Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
 Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
 I am currently running long-running Pig scripts over data-sets in the order 
 of tens of TBs. Next step is hundreds of TBs.
 We need to have an open large-data set (open source scripts which generate 
 data-set) and detailed scripts for important operations such as ORDER, 
 AGGREGATION etc.
 We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
 running scripts) and Triathlon (Mix). 
 I will update this JIRA with more details of current activities soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-200) Pig Performance Benchmarks

2009-08-03 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738609#action_12738609
 ] 

Ying He commented on PIG-200:
-

doc for DataGenerator in hadoop mode is here: 
http://wiki.apache.org/pig/DataGeneratorHadoop

 Pig Performance Benchmarks
 --

 Key: PIG-200
 URL: https://issues.apache.org/jira/browse/PIG-200
 Project: Pig
  Issue Type: Task
Reporter: Amir Youssefi
 Attachments: generate_data.pl, perf.hadoop.patch, perf.patch


 To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
 plus Script Collection. This is used in comparison of different Pig releases, 
 Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
 Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
 I am currently running long-running Pig scripts over data-sets in the order 
 of tens of TBs. Next step is hundreds of TBs.
 We need to have an open large-data set (open source scripts which generate 
 data-set) and detailed scripts for important operations such as ORDER, 
 AGGREGATION etc.
 We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
 running scripts) and Triathlon (Mix). 
 I will update this JIRA with more details of current activities soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-660) Integration with Hadoop 0.20

2009-08-03 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-660:
---

Attachment: PIG-660-for-branch-0.3.patch

Attached a patch for branch-0.3 based on PIG-660_5.patch. The only difference 
is that a couple of files (HConfiguration.java and HDataStorage.java) need 
ctrl-M end of lines for the patch to apply correctly to branch-0.3

 Integration with Hadoop 0.20
 

 Key: PIG-660
 URL: https://issues.apache.org/jira/browse/PIG-660
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: 0.4.0

 Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch, 
 PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, 
 PIG-660_5.patch


 With Hadoop 0.20, it will be possible to query the status of each map and 
 reduce in a map reduce job. This will allow better error reporting. Some of 
 the other items that could be on Hadoop's feature requests/bugs are 
 documented here for tracking.
 1. Hadoop should return objects instead of strings when exceptions are thrown
 2. The JobControl should handle all exceptions and report them appropriately. 
 For example, when the JobControl fails to launch jobs, it should handle 
 exceptions appropriately and should support APIs that query this state, i.e., 
 failure to launch jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext

2009-08-03 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-901:
---

Attachment: PIG-901-1.patch

Add a unit test to make sure this change will not affect udf.import.list

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-904) Conversion from double to chararray for udf input arguments does not occur

2009-08-03 Thread Pradeep Kamath (JIRA)
Conversion from double to chararray for udf input arguments does not occur
--

 Key: PIG-904
 URL: https://issues.apache.org/jira/browse/PIG-904
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath


Script showing the problem:
{noformat}
 a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, 
gpa:double); b = foreach a generate CONCAT(gpa, 'dummy'); dump b;

Error shown:
2009-08-03 17:04:27,573 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1045: Could not infer the matching function for org.apache.pig.builtin.CONCAT 
as multiple or none of them fit. Please use an explicit cast.

{noformat}

The error goes away if gpa is casted to chararray.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext

2009-08-03 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-901:
---

Attachment: PIG-901-branch-0.3.patch

Patch for 0.3 branch

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext

2009-08-03 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738734#action_12738734
 ] 

Olga Natkovich commented on PIG-901:


+1 on the patch for the 0.3 branch. Please, commit

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext

2009-08-03 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738740#action_12738740
 ] 

Arun C Murthy commented on PIG-901:
---

It would be nice to add a test case which (for now) checks to ensure that the 
size of a serialized 'slice' is less than 500KB or so...

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Is it possible to access Configuration in UDF ?

2009-08-03 Thread Daniel Dai

Hi, Jeff,
This is not API at all, this is a hack to make things work. We do lack 
couples of features for UDF:

1. reporter and counter (PIG-889)
2. access global properties
3. ability to maintain states across different UDF invocations
4. input schema
5. variable length arguments (PIG-902)

Your suggestion sounds resonable. We need to provide a well designed 
interface for these features.


- Original Message - 
From: zhang jianfeng zjf...@gmail.com

To: pig-u...@hadoop.apache.org; pig-dev@hadoop.apache.org
Sent: Monday, August 03, 2009 8:03 PM
Subject: Re: Is it possible to access Configuration in UDF ?



Dmitriy,

Thank you for your help.

I find this way of using API is not so intuitive ,  I recommend the base
class of UDF to implements the Configurable interface.
Then each UDF can use the getConf() to get the Configuration object.
Because UDF is part of MapReduce , it makes sense to make it Configurable.

The following is what I recommend to change the EvalFunc

public abstract class EvalFuncT  implements Configurable{
..
protected Configuration conf;
..
public EvalFunc(){
conf=PigMapReduce.sJobConf;
}
..
@Override
   public void setConf(Configuration conf) {
   this.conf=conf;
   }

   @Override
   public Configuration getConf() {
   return this.conf;
   }




Jeff Zhang





On Mon, Aug 3, 2009 at 8:52 PM, Dmitriy Ryaboy 
dvrya...@cloudera.comwrote:



You can access the JobConf with the following call:

ConfigurationUtil.toProperties(PigMapReduce.sJobConf)

On Mon, Aug 3, 2009 at 12:40 AM, zhang jianfengzjf...@gmail.com wrote:
 Hi all,

 I'd like to set property in Configuration to customize my UDF. But  it
looks
 like I can not access the Configuration object in UDF.

 Does pig have a plan to support this feature ?


 Thank you.

 Jeff Zhang








[jira] Assigned: (PIG-891) Fixing dfs statement for Pig

2009-08-03 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang reassigned PIG-891:
--

Assignee: Jeff Zhang

 Fixing dfs statement for Pig
 

 Key: PIG-891
 URL: https://issues.apache.org/jira/browse/PIG-891
 Project: Pig
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Jeff Zhang
Priority: Minor

 Several hadoop dfs commands are not support or restrictive on current Pig. We 
 need to fix that. These include:
 1. Several commands do not supported: lsr, dus, count, rmr, expunge, put, 
 moveFromLocal, get, getmerge, text, moveToLocal, mkdir, touchz, test, stat, 
 tail, chmod, chown, chgrp. A reference for these command can be found in 
 http://hadoop.apache.org/common/docs/current/hdfs_shell.html
 2. All existing dfs commands do not support globing.
 3. Pig should provide a programmatic way to perform dfs commands. Several of 
 them exist in PigServer, but not all of them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.