[jira] Updated: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-893: --- Fix Version/s: 0.4.0 Affects Version/s: 0.4.0 Status: Patch Available (was: Open) support cast of chararray to other simple types --- Key: PIG-893 URL: https://issues.apache.org/jira/browse/PIG-893 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Thejas M Nair Fix For: 0.4.0 Attachments: Pig_893_Patch.txt Pig should support casting of chararray to integer,long,float,double,bytearray. If the conversion fails for reasons such as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-893: --- Attachment: Pig_893_Patch.txt attach the patch including the TestCase. I extract the bytesTo* method from Utf8StorageConverter to CastUtil. Then these methods can been reused by other objects. support cast of chararray to other simple types --- Key: PIG-893 URL: https://issues.apache.org/jira/browse/PIG-893 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Thejas M Nair Fix For: 0.4.0 Attachments: Pig_893_Patch.txt Pig should support casting of chararray to integer,long,float,double,bytearray. If the conversion fails for reasons such as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-592) schema inferred incorrectly
[ https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738472#action_12738472 ] Daniel Dai commented on PIG-592: Also the following script produce the wrong schema: a = load 'a'; b = load 'b'; c = join a by $0, b by $0; describe c; c: {bytearray,bytearray} The correct behavior should be: If any of the input schema is unkown, the output schema is also unkown. schema inferred incorrectly --- Key: PIG-592 URL: https://issues.apache.org/jira/browse/PIG-592 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Christopher Olston A simple pig script, that never introduces any schema information: A = load 'foo'; B = foreach (group A by $8) generate group, COUNT($1); C = load 'bar'; // ('bar' has two columns) D = join B by $0, C by $0; E = foreach D generate $0, $1, $3; Fails, complaining that $3 does not exist: java.io.IOException: Out of bound access. Trying to access non-existent column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s). Apparently Pig gets confused, and thinks it knows the schema for C (a single bytearray column). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY
ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY - Key: PIG-900 URL: https://issues.apache.org/jira/browse/PIG-900 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY
[ https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-900: - Description: With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} was: With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY - Key: PIG-900 URL: https://issues.apache.org/jira/browse/PIG-900 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY
[ https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-900: - Description: With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c ); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} was: With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY - Key: PIG-900 URL: https://issues.apache.org/jira/browse/PIG-900 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c ); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext
InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext Key: PIG-901 URL: https://issues.apache.org/jira/browse/PIG-901 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.4.0 InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext. SliceWrapper only needs ExecType - so the entire PigContext should not be serialized and only the ExecType should be serialized. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-902) Allow schema matching for UDF with variable length arguments
Allow schema matching for UDF with variable length arguments Key: PIG-902 URL: https://issues.apache.org/jira/browse/PIG-902 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Pig pick the right version of UDF using a similarity measurement. This mechanism pick the UDF with right input schema to use. However, some UDFs use various number of inputs and currently there is no way to declare such input schema in UDF and similarity measurement do not match against variable number of inputs. We can still write variable inputs UDF, but we cannot rely on schema matching to pick the right UDF version and do the automatic data type conversion. Eg: If we have: Integer udf1(Integer, ..); Integer udf1(String, ..); Currently we cannot do this: a: {chararray, chararray} b = foreach a generate udf1(a.$0, a.$1); // Pig cannot pick the udf(String, ..) automatically, currently, this statement fails Eg: If we have: Integer udf2(Integer, ..); Currently, this script fail a: {chararray, chararray} b = foreach a generate udf1(a.$0, a.$1); // Currently, Pig cannot convert a.$0 into Integer automatically -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext
[ https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738496#action_12738496 ] Daniel Dai commented on PIG-901: PigContext.packageImportList needs to be serialized as well. Otherwise InputSplit cannot instantiate Loader function. InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext Key: PIG-901 URL: https://issues.apache.org/jira/browse/PIG-901 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.4.0 InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext. SliceWrapper only needs ExecType - so the entire PigContext should not be serialized and only the ExecType should be serialized. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-200) Pig Performance Benchmarks
[ https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ying He updated PIG-200: Attachment: perf.hadoop.patch perf.hadoop.patch is used to support running DataGenerator in hadoop mode. It should be installed on top of perf.patch. The design doc is here. http://twiki.corp.yahoo.com/view/Tiger/DataGeneratorHadoop Pig Performance Benchmarks -- Key: PIG-200 URL: https://issues.apache.org/jira/browse/PIG-200 Project: Pig Issue Type: Task Reporter: Amir Youssefi Attachments: generate_data.pl, perf.hadoop.patch, perf.patch To benchmark Pig performance, we need to have a TPC-H like Large Data Set plus Script Collection. This is used in comparison of different Pig releases, Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only). Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance I am currently running long-running Pig scripts over data-sets in the order of tens of TBs. Next step is hundreds of TBs. We need to have an open large-data set (open source scripts which generate data-set) and detailed scripts for important operations such as ORDER, AGGREGATION etc. We can call those the Pig Workouts: Cardio (short processing), Marathon (long running scripts) and Triathlon (Mix). I will update this JIRA with more details of current activities soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-903) ILLUSTRATE fails on 'Distinct' operator
ILLUSTRATE fails on 'Distinct' operator --- Key: PIG-903 URL: https://issues.apache.org/jira/browse/PIG-903 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Using the latest Pig from trunk (0.3+) in mapreduce mode, running through the tutorial script script1-hadoop.pig works fine. However, executing the following illustrate command throws an exception: illustrate ngramed2 Pig Stack Trace --- ERROR 2999: Unexpected internal error. Unrecognized logical operator. java.lang.RuntimeException: Unrecognized logical operator. at org.apache.pig.pen.EquivalenceClasses.GetEquivalenceClasses(EquivalenceClasses.java:60) at org.apache.pig.pen.DerivedDataVisitor.evaluateOperator(DerivedDataVisitor.java:368) at org.apache.pig.pen.DerivedDataVisitor.visit(DerivedDataVisitor.java:226) at org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:104) at org.apache.pig.impl.logicalLayer.LODistinct.visit(LODistinct.java:37) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:98) at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:90) at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:106) at org.apache.pig.PigServer.getExamples(PigServer.java:724) at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:541) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:195) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:361) This works: illustrate ngramed1; Although it does throw a few NPEs : java.lang.NullPointerException at org.apache.pig.pen.util.DisplayExamples.ShortenField(DisplayExamples.java:205) at org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190) at org.apache.pig.pen.util.DisplayExamples.PrintTabular(DisplayExamples.java:86) [...] (illustrate also doesn't work on bzipped input, but that's a separate issue) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-200) Pig Performance Benchmarks
[ https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738556#action_12738556 ] Olga Natkovich edited comment on PIG-200 at 8/3/09 2:01 PM: perf.hadoop.patch is used to support running DataGenerator in hadoop mode. It should be installed on top of perf.patch. was (Author: yinghe): perf.hadoop.patch is used to support running DataGenerator in hadoop mode. It should be installed on top of perf.patch. The design doc is here. http://twiki.corp.yahoo.com/view/Tiger/DataGeneratorHadoop Pig Performance Benchmarks -- Key: PIG-200 URL: https://issues.apache.org/jira/browse/PIG-200 Project: Pig Issue Type: Task Reporter: Amir Youssefi Attachments: generate_data.pl, perf.hadoop.patch, perf.patch To benchmark Pig performance, we need to have a TPC-H like Large Data Set plus Script Collection. This is used in comparison of different Pig releases, Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only). Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance I am currently running long-running Pig scripts over data-sets in the order of tens of TBs. Next step is hundreds of TBs. We need to have an open large-data set (open source scripts which generate data-set) and detailed scripts for important operations such as ORDER, AGGREGATION etc. We can call those the Pig Workouts: Cardio (short processing), Marathon (long running scripts) and Triathlon (Mix). I will update this JIRA with more details of current activities soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-200) Pig Performance Benchmarks
[ https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738609#action_12738609 ] Ying He commented on PIG-200: - doc for DataGenerator in hadoop mode is here: http://wiki.apache.org/pig/DataGeneratorHadoop Pig Performance Benchmarks -- Key: PIG-200 URL: https://issues.apache.org/jira/browse/PIG-200 Project: Pig Issue Type: Task Reporter: Amir Youssefi Attachments: generate_data.pl, perf.hadoop.patch, perf.patch To benchmark Pig performance, we need to have a TPC-H like Large Data Set plus Script Collection. This is used in comparison of different Pig releases, Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only). Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance I am currently running long-running Pig scripts over data-sets in the order of tens of TBs. Next step is hundreds of TBs. We need to have an open large-data set (open source scripts which generate data-set) and detailed scripts for important operations such as ORDER, AGGREGATION etc. We can call those the Pig Workouts: Cardio (short processing), Marathon (long running scripts) and Triathlon (Mix). I will update this JIRA with more details of current activities soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-660) Integration with Hadoop 0.20
[ https://issues.apache.org/jira/browse/PIG-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-660: --- Attachment: PIG-660-for-branch-0.3.patch Attached a patch for branch-0.3 based on PIG-660_5.patch. The only difference is that a couple of files (HConfiguration.java and HDataStorage.java) need ctrl-M end of lines for the patch to apply correctly to branch-0.3 Integration with Hadoop 0.20 Key: PIG-660 URL: https://issues.apache.org/jira/browse/PIG-660 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: 0.4.0 Attachments: PIG-660-for-branch-0.3.patch, PIG-660.patch, PIG-660_1.patch, PIG-660_2.patch, PIG-660_3.patch, PIG-660_4.patch, PIG-660_5.patch With Hadoop 0.20, it will be possible to query the status of each map and reduce in a map reduce job. This will allow better error reporting. Some of the other items that could be on Hadoop's feature requests/bugs are documented here for tracking. 1. Hadoop should return objects instead of strings when exceptions are thrown 2. The JobControl should handle all exceptions and report them appropriately. For example, when the JobControl fails to launch jobs, it should handle exceptions appropriately and should support APIs that query this state, i.e., failure to launch jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext
[ https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-901: --- Attachment: PIG-901-1.patch Add a unit test to make sure this change will not affect udf.import.list InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext Key: PIG-901 URL: https://issues.apache.org/jira/browse/PIG-901 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.4.0 Attachments: PIG-901-1.patch InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext. SliceWrapper only needs ExecType - so the entire PigContext should not be serialized and only the ExecType should be serialized. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-904) Conversion from double to chararray for udf input arguments does not occur
Conversion from double to chararray for udf input arguments does not occur -- Key: PIG-904 URL: https://issues.apache.org/jira/browse/PIG-904 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Script showing the problem: {noformat} a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa:double); b = foreach a generate CONCAT(gpa, 'dummy'); dump b; Error shown: 2009-08-03 17:04:27,573 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast. {noformat} The error goes away if gpa is casted to chararray. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext
[ https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pradeep Kamath updated PIG-901: --- Attachment: PIG-901-branch-0.3.patch Patch for 0.3 branch InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext Key: PIG-901 URL: https://issues.apache.org/jira/browse/PIG-901 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.4.0 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext. SliceWrapper only needs ExecType - so the entire PigContext should not be serialized and only the ExecType should be serialized. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext
[ https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738734#action_12738734 ] Olga Natkovich commented on PIG-901: +1 on the patch for the 0.3 branch. Please, commit InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext Key: PIG-901 URL: https://issues.apache.org/jira/browse/PIG-901 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.4.0 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext. SliceWrapper only needs ExecType - so the entire PigContext should not be serialized and only the ExecType should be serialized. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext
[ https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738740#action_12738740 ] Arun C Murthy commented on PIG-901: --- It would be nice to add a test case which (for now) checks to ensure that the size of a serialized 'slice' is less than 500KB or so... InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext Key: PIG-901 URL: https://issues.apache.org/jira/browse/PIG-901 Project: Pig Issue Type: Bug Affects Versions: 0.3.1 Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: 0.4.0 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext. SliceWrapper only needs ExecType - so the entire PigContext should not be serialized and only the ExecType should be serialized. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Is it possible to access Configuration in UDF ?
Hi, Jeff, This is not API at all, this is a hack to make things work. We do lack couples of features for UDF: 1. reporter and counter (PIG-889) 2. access global properties 3. ability to maintain states across different UDF invocations 4. input schema 5. variable length arguments (PIG-902) Your suggestion sounds resonable. We need to provide a well designed interface for these features. - Original Message - From: zhang jianfeng zjf...@gmail.com To: pig-u...@hadoop.apache.org; pig-dev@hadoop.apache.org Sent: Monday, August 03, 2009 8:03 PM Subject: Re: Is it possible to access Configuration in UDF ? Dmitriy, Thank you for your help. I find this way of using API is not so intuitive , I recommend the base class of UDF to implements the Configurable interface. Then each UDF can use the getConf() to get the Configuration object. Because UDF is part of MapReduce , it makes sense to make it Configurable. The following is what I recommend to change the EvalFunc public abstract class EvalFuncT implements Configurable{ .. protected Configuration conf; .. public EvalFunc(){ conf=PigMapReduce.sJobConf; } .. @Override public void setConf(Configuration conf) { this.conf=conf; } @Override public Configuration getConf() { return this.conf; } Jeff Zhang On Mon, Aug 3, 2009 at 8:52 PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote: You can access the JobConf with the following call: ConfigurationUtil.toProperties(PigMapReduce.sJobConf) On Mon, Aug 3, 2009 at 12:40 AM, zhang jianfengzjf...@gmail.com wrote: Hi all, I'd like to set property in Configuration to customize my UDF. But it looks like I can not access the Configuration object in UDF. Does pig have a plan to support this feature ? Thank you. Jeff Zhang
[jira] Assigned: (PIG-891) Fixing dfs statement for Pig
[ https://issues.apache.org/jira/browse/PIG-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang reassigned PIG-891: -- Assignee: Jeff Zhang Fixing dfs statement for Pig Key: PIG-891 URL: https://issues.apache.org/jira/browse/PIG-891 Project: Pig Issue Type: Bug Reporter: Daniel Dai Assignee: Jeff Zhang Priority: Minor Several hadoop dfs commands are not support or restrictive on current Pig. We need to fix that. These include: 1. Several commands do not supported: lsr, dus, count, rmr, expunge, put, moveFromLocal, get, getmerge, text, moveToLocal, mkdir, touchz, test, stat, tail, chmod, chown, chgrp. A reference for these command can be found in http://hadoop.apache.org/common/docs/current/hdfs_shell.html 2. All existing dfs commands do not support globing. 3. Pig should provide a programmatic way to perform dfs commands. Several of them exist in PigServer, but not all of them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.