[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884551#action_12884551 ] Ashutosh Chauhan commented on PIG-1449: --- Reran the contrib tests. All passed. Patch committed. Thanks, Christian and Justin for working on this ! RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1449: -- Status: Resolved (was: Patch Available) Fix Version/s: 0.8.0 Resolution: Fixed RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Fix For: 0.8.0 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1480) An object oriented Java API for Pig statements
An object oriented Java API for Pig statements -- Key: PIG-1480 URL: https://issues.apache.org/jira/browse/PIG-1480 Project: Pig Issue Type: New Feature Reporter: Julien Le Dem A java API for Pig statements would enable third party libraries to generate Pig scripts much easier than it is actually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1478) Add progress notification listener to PigRunner API
[ https://issues.apache.org/jira/browse/PIG-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884554#action_12884554 ] Hadoop QA commented on PIG-1478: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448532/PIG-1478.patch against trunk revision 958666. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/336/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/336/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/336/console This message is automatically generated. Add progress notification listener to PigRunner API --- Key: PIG-1478 URL: https://issues.apache.org/jira/browse/PIG-1478 Project: Pig Issue Type: Improvement Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1478.patch PIG-1333 added PigRunner API to allow Pig users and tools to get a status/stats object back after executing a Pig script. The new API, however, is synchronous (blocking). It's known that a Pig script can spawn tens (even hundreds) MR jobs and take hours to complete. Therefore it'll be nice to give progress feedback to the callers during the execution. The proposal is to add an optional parameter to the API: {code} public abstract class PigRunner { public static PigStats run(String[] args, PigProgressNotificationListener listener) {...} } {code} The new listener is defined as following: {code} package org.apache.pig.tools.pigstats; public interface PigProgressNotificationListener extends java.util.EventListener { // just before the launch of MR jobs for the script public void LaunchStartedNotification(int numJobsToLaunch); // number of jobs submitted in a batch public void jobsSubmittedNotification(int numJobsSubmitted); // a job is started public void jobStartedNotification(String assignedJobId); // a job is completed successfully public void jobFinishedNotification(JobStats jobStats); // a job is failed public void jobFailedNotification(JobStats jobStats); // a user output is completed successfully public void outputCompletedNotification(OutputStats outputStats); // updates the progress as percentage public void progressUpdatedNotification(int progress); // the script execution is done public void launchCompletedNotification(int numJobsSucceeded); } {code} Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1430) ISODateTime - DateTime: DateTime UDFs Should Also Support int/second Unix Times in All Operations
[ https://issues.apache.org/jira/browse/PIG-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884557#action_12884557 ] Russell Jurney commented on PIG-1430: - I've been thinking about the feedback at the contributors meeting Monday. I propose that we postpone the addition of a full datetime PIG-1314 type in lieu of the builtins described below. This change is easy and I can do it immediately and get it in 0.8. The original proposal is quite hard, and I can't really estimate when I could have it completed. I'm not sure we need it. There are many other more important things I would rather do. I'd like to remove the piggybank classes org.apache.pig.piggybank.evaluation.datetime.* or at least deprecate them. I'd like to add the following builtins, which act on both ISO8601 datetime strings and long unix times. These could be made into many functions each, but I'd prefer to keep them as short as possible. I suggest we mirror the oracle date/time functions when possible: http://psoug.org/reference/date_func.html * Units When listed below, units are defined as one of: YEAR MONTH WEEK DAY HOUR MINUTE SECOND * Truncations TRUNC(date, unit) or TRUNC_DATE(date, unit) long/epoch input returns long/epoch output. ISO8601 string input returns IS08601 datetime output. * Dates to durations DURATION(date, unit) long/epoch input returns long output in the unit specified. ISO8601 input returns an ISO8601 duration * Adding/subtracting durations and dates: use longs. * Utilities CURRENT_ISOTIME CURRENT_UNIXTIME ISOTOUNIX UNIXTOISO The only ugly part to this is that ISO times are 2nd class citizens in that they cannot be added/subtracted. I'm prepared to live with that :) ISODateTime - DateTime: DateTime UDFs Should Also Support int/second Unix Times in All Operations -- Key: PIG-1430 URL: https://issues.apache.org/jira/browse/PIG-1430 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Russell Jurney Fix For: 0.8.0 All functions in contrib.piggybank.java.src.main.java.org.apache.pig.piggybank.evaluation.datetime should seamlessly accept integer Unix/POSIX times, and return Unix time output when given an int, and ISO output when given a chararray. Note: Unix/POSIX times are the number of seconds elapsed since midnight proleptic Coordinated Universal Time (UTC) of January 1, 1970, not counting leap seconds. See http://en.wikipedia.org/wiki/Unix_time -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884552#action_12884552 ] Ashutosh Chauhan commented on PIG-1449: --- @Christian, It would definitely be useful to get the execution time for running the tests down. It takes a while currently to run all Pig tests. RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Fix For: 0.8.0 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1314) Add DateTime Support to Pig
[ https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884558#action_12884558 ] Russell Jurney commented on PIG-1314: - Been thinking about this... I don't think we should add a full datetime type at this time. See comments in PIG-1314 on alternative approach using builtins. Add DateTime Support to Pig --- Key: PIG-1314 URL: https://issues.apache.org/jira/browse/PIG-1314 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Fix For: 0.8.0 Original Estimate: 672h Remaining Estimate: 672h Hadoop/Pig are primarily used to parse log data, and most logs have a timestamp component. Therefore Pig should support dates as a primitive. Can someone familiar with adding types to pig comment on how hard this is? We're looking at doing this, rather than use UDFs. Is this a patch that would be accepted? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1480) An object oriented Java API for Pig statements
[ https://issues.apache.org/jira/browse/PIG-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PIG-1480: --- Attachment: java-api.zip Here is an initial prototype of how the API looks like. To compile and run, add pig-0.6.0-core.jar to the classpath. see the example of usage in org.apache.pig.api.example.TransitiveClosure Generated Pig scripts are dumped on stdout while executing them. It is not complete and runs only in local mode but is meant only as an illustration. The real implementation would create Pig operator objects instead of Pig latin statements. An object oriented Java API for Pig statements -- Key: PIG-1480 URL: https://issues.apache.org/jira/browse/PIG-1480 Project: Pig Issue Type: New Feature Reporter: Julien Le Dem Attachments: java-api.zip A java API for Pig statements would enable third party libraries to generate Pig scripts much easier than it is actually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1314) Add DateTime Support to Pig
[ https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884562#action_12884562 ] Russell Jurney commented on PIG-1314: - I suck at JIRA. See proposal in PIG-1430. Add DateTime Support to Pig --- Key: PIG-1314 URL: https://issues.apache.org/jira/browse/PIG-1314 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Fix For: 0.8.0 Original Estimate: 672h Remaining Estimate: 672h Hadoop/Pig are primarily used to parse log data, and most logs have a timestamp component. Therefore Pig should support dates as a primitive. Can someone familiar with adding types to pig comment on how hard this is? We're looking at doing this, rather than use UDFs. Is this a patch that would be accepted? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884577#action_12884577 ] Jeff Zhang commented on PIG-794: We can leverage AvroInputFormat and AvroOutputFormat in Avro trunk, (see AVRO-493) Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1478) Add progress notification listener to PigRunner API
[ https://issues.apache.org/jira/browse/PIG-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884677#action_12884677 ] Hadoop QA commented on PIG-1478: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448532/PIG-1478.patch against trunk revision 959865. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/358/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/358/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/358/console This message is automatically generated. Add progress notification listener to PigRunner API --- Key: PIG-1478 URL: https://issues.apache.org/jira/browse/PIG-1478 Project: Pig Issue Type: Improvement Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1478.patch PIG-1333 added PigRunner API to allow Pig users and tools to get a status/stats object back after executing a Pig script. The new API, however, is synchronous (blocking). It's known that a Pig script can spawn tens (even hundreds) MR jobs and take hours to complete. Therefore it'll be nice to give progress feedback to the callers during the execution. The proposal is to add an optional parameter to the API: {code} public abstract class PigRunner { public static PigStats run(String[] args, PigProgressNotificationListener listener) {...} } {code} The new listener is defined as following: {code} package org.apache.pig.tools.pigstats; public interface PigProgressNotificationListener extends java.util.EventListener { // just before the launch of MR jobs for the script public void LaunchStartedNotification(int numJobsToLaunch); // number of jobs submitted in a batch public void jobsSubmittedNotification(int numJobsSubmitted); // a job is started public void jobStartedNotification(String assignedJobId); // a job is completed successfully public void jobFinishedNotification(JobStats jobStats); // a job is failed public void jobFailedNotification(JobStats jobStats); // a user output is completed successfully public void outputCompletedNotification(OutputStats outputStats); // updates the progress as percentage public void progressUpdatedNotification(int progress); // the script execution is done public void launchCompletedNotification(int numJobsSucceeded); } {code} Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1478) Add progress notification listener to PigRunner API
[ https://issues.apache.org/jira/browse/PIG-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884732#action_12884732 ] Hadoop QA commented on PIG-1478: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448532/PIG-1478.patch against trunk revision 959865. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/337/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/337/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/337/console This message is automatically generated. Add progress notification listener to PigRunner API --- Key: PIG-1478 URL: https://issues.apache.org/jira/browse/PIG-1478 Project: Pig Issue Type: Improvement Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1478.patch PIG-1333 added PigRunner API to allow Pig users and tools to get a status/stats object back after executing a Pig script. The new API, however, is synchronous (blocking). It's known that a Pig script can spawn tens (even hundreds) MR jobs and take hours to complete. Therefore it'll be nice to give progress feedback to the callers during the execution. The proposal is to add an optional parameter to the API: {code} public abstract class PigRunner { public static PigStats run(String[] args, PigProgressNotificationListener listener) {...} } {code} The new listener is defined as following: {code} package org.apache.pig.tools.pigstats; public interface PigProgressNotificationListener extends java.util.EventListener { // just before the launch of MR jobs for the script public void LaunchStartedNotification(int numJobsToLaunch); // number of jobs submitted in a batch public void jobsSubmittedNotification(int numJobsSubmitted); // a job is started public void jobStartedNotification(String assignedJobId); // a job is completed successfully public void jobFinishedNotification(JobStats jobStats); // a job is failed public void jobFailedNotification(JobStats jobStats); // a user output is completed successfully public void outputCompletedNotification(OutputStats outputStats); // updates the progress as percentage public void progressUpdatedNotification(int progress); // the script execution is done public void launchCompletedNotification(int numJobsSucceeded); } {code} Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy closed PIG-1449. -- RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Fix For: 0.8.0 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1469) DefaultDataBag assumes ArrayList as default List type
[ https://issues.apache.org/jira/browse/PIG-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1469: --- Status: Resolved (was: Patch Available) Resolution: Fixed I committed this. DefaultDataBag assumes ArrayList as default List type - Key: PIG-1469 URL: https://issues.apache.org/jira/browse/PIG-1469 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.8.0 Reporter: Gianmarco De Francisci Morales Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1469.patch In org.apache.pig.data.DefaultDataBag, the field mContents is assumed to be of type ArrayList but the user can actually pass a different List to the constructor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884759#action_12884759 ] Richard Ding commented on PIG-1434: --- I agree that we should use the right syntax. What I meant was that it can be implemented as a 'replicated' cross which seems to solve the problems of implicit dependency and using distributed cache. Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884763#action_12884763 ] Dmitriy V. Ryaboy commented on PIG-928: --- Aniket, the patch does not apply cleanly to trunk, can you rebase it? UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-928: -- Attachment: PIG-928.patch I rebased the patch and made it pull jython down via maven. 2.5.1 doesn't appear to be available right now, so this pulls down 2.5.0. Hope that's ok. Looks like the tabulation is wrong in most of this patch.. someone please hit ctrl-a, ctrl-i next time :). Needless to say, this thing needs tests, desperately. Also imho in order for it to make it into trunk, it should be a compile-time option to support (and pull down) jython or jruby or whatnot, not a default option. Otherwise we are well on our way to making people pull down the internet in order to compile pig. UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1309) Map-side Cogroup
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-1309: -- Attachment: PIG_1309_7.patch Backport of merge cogroup for 0.7 branch. Since, hudson can test only for trunk. Manually ran all the tests, all passed. Map-side Cogroup Key: PIG-1309 URL: https://issues.apache.org/jira/browse/PIG-1309 Project: Pig Issue Type: Bug Components: impl Reporter: Ashutosh Chauhan Assignee: Ashutosh Chauhan Fix For: 0.8.0 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, PIG_1309_7.patch In never ending quest to make Pig go faster, we want to parallelize as many relational operations as possible. Its already possible to do Group-by( PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira is to add map-side implementation of Cogroup in Pig. Details to follow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884791#action_12884791 ] Thejas M Nair commented on PIG-1434: I think the replicated cross is a good alternative to this feature, though this feature is probably more friendly for a beginner pig user. But if this feature makes the pig code very complicated/hacky (the dependency order and stuff), I think it might not be a bad idea to encourage the use of replicated-join instead . As a side note, we can actually get 'replicated cross' working using replicated join - eg - {code} j = join l1 by 1, l2 by 1 using 'replicated'; {code} Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884794#action_12884794 ] Richard Ding commented on PIG-1434: --- So all one needs to do is internally replace the line: {code} Y = foreach X generate $1/(long) C.count, $2-(long) C.max; {code} with {code} Z = join X by 1, C by 1 using 'replicated'; Y = foreach Z generate X::$1/(long) C.count, X::$2-(long) C.max; {code} Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1321) Logical Optimizer: Merge cascading foreach
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884811#action_12884811 ] Xuefu Zhang commented on PIG-1321: -- Add one more pre-condition: 3. The first foreach statement cannot contain flatten due to its complexity. Logical Optimizer: Merge cascading foreach -- Key: PIG-1321 URL: https://issues.apache.org/jira/browse/PIG-1321 Project: Pig Issue Type: Sub-task Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Xuefu Zhang We can merge consecutive foreach statement. Eg: b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1; c = foreach b generate b0#'kk1', b0#'kk2', b1, a1; = c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1404) PigUnit - Pig script testing simplified.
[ https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romain Rigaux updated PIG-1404: --- Status: Open (was: Patch Available) PigUnit - Pig script testing simplified. - Key: PIG-1404 URL: https://issues.apache.org/jira/browse/PIG-1404 Project: Pig Issue Type: New Feature Reporter: Romain Rigaux Assignee: Romain Rigaux Fix For: 0.8.0 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404.patch The goal is to provide a simple xUnit framework that enables our Pig scripts to be easily: - unit tested - regression tested - quickly prototyped No cluster set up is required. For example: TestCase {code} @Test public void testTop3Queries() { String[] args = { n=3, }; test = new PigTest(top_queries.pig, args); String[] input = { yahoo\t10, twitter\t7, facebook\t10, yahoo\t15, facebook\t5, }; String[] output = { (yahoo,25L), (facebook,15L), (twitter,7L), }; test.assertOutput(data, input, queries_limit, output); } {code} top_queries.pig {code} data = LOAD '$input' AS (query:CHARARRAY, count:INT); ... queries_sum = FOREACH queries_group GENERATE group AS query, SUM(queries.count) AS count; ... queries_limit = LIMIT queries_ordered $n; STORE queries_limit INTO '$output'; {code} They are 3 modes: * LOCAL (if pigunit.exectype.local properties is present) * MAPREDUCE (use the cluster specified in the classpath, same as HADOOP_CONF_DIR) ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in the class path will be: ~/pigtest/conf) ** pointing to an existing cluster (if pigunit.exectype.cluster properties is present) For now, it would be nice to see how this idea could be integrated in Piggybank and if PigParser/PigServer could improve their interfaces in order to make PigUnit simple. Other components based on PigUnit could be built later: - standalone MiniCluster - notion of workspaces for each test - standalone utility that reads test configuration and generates a test report... It is a first prototype, open to suggestions and can definitely take advantage of feedbacks. How to test, in pig_trunk: {code} Apply patch $pig_trunk ant compile-test $pig_trunk ant $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99 {code} (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the future between 'unit' and 'integration') Many examples are in: {code} contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java {code} When used as a standalone, do not forget commons-lang-2.4.jar and the HADOOP_CONF_DIR to your cluster in your CLASSPATH. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1404) PigUnit - Pig script testing simplified.
[ https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romain Rigaux updated PIG-1404: --- Status: Patch Available (was: Open) PigUnit - Pig script testing simplified. - Key: PIG-1404 URL: https://issues.apache.org/jira/browse/PIG-1404 Project: Pig Issue Type: New Feature Reporter: Romain Rigaux Assignee: Romain Rigaux Fix For: 0.8.0 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404.patch The goal is to provide a simple xUnit framework that enables our Pig scripts to be easily: - unit tested - regression tested - quickly prototyped No cluster set up is required. For example: TestCase {code} @Test public void testTop3Queries() { String[] args = { n=3, }; test = new PigTest(top_queries.pig, args); String[] input = { yahoo\t10, twitter\t7, facebook\t10, yahoo\t15, facebook\t5, }; String[] output = { (yahoo,25L), (facebook,15L), (twitter,7L), }; test.assertOutput(data, input, queries_limit, output); } {code} top_queries.pig {code} data = LOAD '$input' AS (query:CHARARRAY, count:INT); ... queries_sum = FOREACH queries_group GENERATE group AS query, SUM(queries.count) AS count; ... queries_limit = LIMIT queries_ordered $n; STORE queries_limit INTO '$output'; {code} They are 3 modes: * LOCAL (if pigunit.exectype.local properties is present) * MAPREDUCE (use the cluster specified in the classpath, same as HADOOP_CONF_DIR) ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in the class path will be: ~/pigtest/conf) ** pointing to an existing cluster (if pigunit.exectype.cluster properties is present) For now, it would be nice to see how this idea could be integrated in Piggybank and if PigParser/PigServer could improve their interfaces in order to make PigUnit simple. Other components based on PigUnit could be built later: - standalone MiniCluster - notion of workspaces for each test - standalone utility that reads test configuration and generates a test report... It is a first prototype, open to suggestions and can definitely take advantage of feedbacks. How to test, in pig_trunk: {code} Apply patch $pig_trunk ant compile-test $pig_trunk ant $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99 {code} (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the future between 'unit' and 'integration') Many examples are in: {code} contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java {code} When used as a standalone, do not forget commons-lang-2.4.jar and the HADOOP_CONF_DIR to your cluster in your CLASSPATH. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1481) PigServer throws exception if it cannot find hadoop-site.xml or core-site.xml
PigServer throws exception if it cannot find hadoop-site.xml or core-site.xml - Key: PIG-1481 URL: https://issues.apache.org/jira/browse/PIG-1481 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Sameer M Hi We've been using the Hadoop MiniCluster to do unit testing of our pig scripts in the following way. MiniCluster minicluster = MiniCluster.buildCluster(2,2); pigServer = new PigServer(ExecType.MAPREDUCE, minicluster.getProperties()); This has been working fine for 0.6 and 0.7. However in the trunk (0.8) looks like there is change due to which an exception is thrown if hadoop-site.xml or core-site.xml is not found in the classpath. org.apache.pig.backend.executionengine.ExecException: ERROR 4010: Cannot find hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath).If you plan to use local mode, please put -x local option in command line at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:149) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:114) at org.apache.pig.impl.PigContext.connect(PigContext.java:177) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.PigServer.init(PigServer.java:204) at org.apache.pig.PigServer.init(PigServer.java:200) The problem seems to be org.apache.pig.backend.hadoop.executionengine.HExecutionEngine: 148 if( hadoop_site == null core_site == null ) { throw new ExecException(Cannot find hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath). + If you plan to use local mode, please put -x local option in command line, 4010); } We would like to use the mapreduce mode but with the minicluster and have a lot of unit test with that setup. Can this check be removed from this level ? Thanks Sameer -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files
[ https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884834#action_12884834 ] Richard Ding commented on PIG-1389: --- We use Hadoop Path as a parser to parse the input/output locations for two use cases: * Determine the scheme of the location (e.g., hdfs, hbase, file, har, ...), and * Get the short file name of the location Ashutosh is right that this approach doesn't work with location string as in PIG-1229: {code} jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100 {code} A RuntimeException is thrown when trying to parse it. The proposal is to use Java URI as parser instead for these use cases (Java URI throws a checked exception for invalid syntax). Implement Pig counter to track number of rows for each input files --- Key: PIG-1389 URL: https://issues.apache.org/jira/browse/PIG-1389 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch A MR job generated by Pig not only can have multiple outputs (in the case of multiquery) but also can have multiple inputs (in the case of join or cogroup). In both cases, the existing Hadoop counters (e.g. MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number of records in the given input or output. PIG-1299 addressed the case of multiple outputs. We need to add new counters for jobs with multiple inputs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884841#action_12884841 ] Aniket Mokashi commented on PIG-928: The fix needed some changes in queryparser to support namespace, I found this in test cases I added. Current EvalFuncSpec logic is convoluted, I replaced it with a cleaner one. I have attached the updated patch with changes mentioned above. I am not sure what needs to be done for jython.jar, my guess was to check-in that in /lib. Thoughts? UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884845#action_12884845 ] Dmitriy V. Ryaboy commented on PIG-928: --- Aniket, I already made the changes you need to pull down jython -- take a look at the patch I attached. One more general note -- let's say jython instead of python (in the grammar, the keywords, everywhere), as there may be slight incompatibilities between the two and we want to be clear on what we are using. UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDFFinale.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-928: --- Attachment: RegisterPythonUDFFinale.patch Changes needed for script UDF. TODO- jython.jar related changes UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDFFinale.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884863#action_12884863 ] Julien Le Dem commented on PIG-928: --- Aniket, this is assuming the ScriptEngine requires only one jar. I would suggest instead having a method ScriptEngine.init(PigContext) that would be called after the ScriptEngine instance has been retrieved from the factory. That would let the script engine add whatever is needed to the job. {code} if(scriptingLang != null) { ScriptEngine se = ScriptEngine.getInstance(scriptingLang); //pigContext.scriptJars.add(se.getStandardScriptJarPath()); se.init(pigContext); se.registerFunctions(path, namespace, pigContext); } {code} Have a good week end, Julien UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDFFinale.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.