[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725602#action_12725602 ] Hudson commented on PIG-820: Integrated in Pig-trunk #490 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/490/]) : Change RandomSampleLoader to take a LoadFunc instead of extending BinStorage. Added new Samplable interface for loaders to implement allowing them to be used by RandomSampleLoader. PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader - Key: PIG-820 URL: https://issues.apache.org/jira/browse/PIG-820 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0, 0.4.0 Reporter: Alan Gates Assignee: Ashutosh Chauhan Fix For: 0.4.0 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out in BinaryStorage format. As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable. Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job. Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like: {code} public interface SamplableLoader extends LoadFunc { /** * Skip ahead in the input stream. * @param n number of bytes to skip * @return number of bytes actually skipped. The return semantics are * exactly the same as {...@link java.io.InpuStream#skip(long)} */ public long skip(long n) throws IOException; /** * Get the current position in the stream. * @return position in the stream. */ public long getPosition() throws IOException; } {code} The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-868) indexof / lastindexof / lower / replace / substring udf's
indexof / lastindexof / lower / replace / substring udf's - Key: PIG-868 URL: https://issues.apache.org/jira/browse/PIG-868 Project: Pig Issue Type: New Feature Reporter: Bennie Schut Priority: Trivial We parse some apache logs using pig and are using some pretty simple udf's like this: B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, '.txt')) as lang; It's pretty simple stuff but I figured someone else might find it useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-868) indexof / lastindexof / lower / replace / substring udf's
[ https://issues.apache.org/jira/browse/PIG-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bennie Schut updated PIG-868: - Attachment: addSomeUDFsPatch.patch Some udf's including unittests. indexof / lastindexof / lower / replace / substring udf's - Key: PIG-868 URL: https://issues.apache.org/jira/browse/PIG-868 Project: Pig Issue Type: New Feature Reporter: Bennie Schut Priority: Trivial Attachments: addSomeUDFsPatch.patch We parse some apache logs using pig and are using some pretty simple udf's like this: B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, '.txt')) as lang; It's pretty simple stuff but I figured someone else might find it useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-868) indexof / lastindexof / lower / replace / substring udf's
[ https://issues.apache.org/jira/browse/PIG-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bennie Schut updated PIG-868: - Attachment: dateExtractorPatch.patch dateExtractor unittest gave a failure on my machine on the -0600 time zone returning the wrong date. A bit unrelated to the other patch but I fixed it to get piggybank unittests to work. indexof / lastindexof / lower / replace / substring udf's - Key: PIG-868 URL: https://issues.apache.org/jira/browse/PIG-868 Project: Pig Issue Type: New Feature Reporter: Bennie Schut Priority: Trivial Attachments: addSomeUDFsPatch.patch, dateExtractorPatch.patch We parse some apache logs using pig and are using some pretty simple udf's like this: B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, '.txt')) as lang; It's pretty simple stuff but I figured someone else might find it useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-868) indexof / lastindexof / lower / replace / substring udf's
[ https://issues.apache.org/jira/browse/PIG-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bennie Schut updated PIG-868: - Status: Patch Available (was: Open) indexof / lastindexof / lower / replace / substring udf's - Key: PIG-868 URL: https://issues.apache.org/jira/browse/PIG-868 Project: Pig Issue Type: New Feature Reporter: Bennie Schut Priority: Trivial Attachments: addSomeUDFsPatch.patch, dateExtractorPatch.patch We parse some apache logs using pig and are using some pretty simple udf's like this: B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, '.txt')) as lang; It's pretty simple stuff but I figured someone else might find it useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-749) No attempt to check if 'flatten(group) as' has the same cardinality as 'group alias by'
[ https://issues.apache.org/jira/browse/PIG-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-749: --- Priority: Minor (was: Major) No attempt to check if 'flatten(group) as' has the same cardinality as 'group alias by' --- Key: PIG-749 URL: https://issues.apache.org/jira/browse/PIG-749 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Priority: Minor Pig script which does grouping for 3 columns and flattens as 4 columns works when in principle it should not and maybe fail as a front-end error. {code} A = load 'groupcardinalitycheck.txt' using PigStorage() as (col1:chararray, col2:chararray, col3:int, col4:chararray); B = group A by (col1, col2, col3); C = foreach B generate flatten(group) as (col1, col2, col3, col4), SIZE(A) as frequency; dump C; {code} == Data == hello CC 1 there hello YSO 2 out ouchCC 2 hey == Result of the preceding script == (ouch,CC,2,1L) (hello,CC,1,1L) (hello,YSO,2,1L) == -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader
[ https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-820: --- Resolution: Fixed Status: Resolved (was: Patch Available) v6 of the patch checked in. Thanks Ashutosh. PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader - Key: PIG-820 URL: https://issues.apache.org/jira/browse/PIG-820 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0, 0.4.0 Reporter: Alan Gates Assignee: Ashutosh Chauhan Fix For: 0.4.0 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch Currently a sampling job requires that data already be stored in BinaryStorage format, since RandomSampleLoader extends BinaryStorage. For order by this has mostly been acceptable, because users tend to use order by at the end of their script where other MR jobs have already operated on the data and thus it is already being stored in BinaryStorage. For pig scripts that just did an order by, an entire MR job is required to read the data and write it out in BinaryStorage format. As we begin work on join algorithms that will require sampling, this requirement to read the entire input and write it back out will not be acceptable. Join is often the first operation of a script, and thus is much more likely to trigger this useless up front translation job. Instead RandomSampleLoader can be changed to subsume an existing loader, using the user specified loader to read the tuples while handling the skipping between tuples itself. This will require the subsumed loader to implement a Samplable Interface, that will look something like: {code} public interface SamplableLoader extends LoadFunc { /** * Skip ahead in the input stream. * @param n number of bytes to skip * @return number of bytes actually skipped. The return semantics are * exactly the same as {...@link java.io.InpuStream#skip(long)} */ public long skip(long n) throws IOException; /** * Get the current position in the stream. * @return position in the stream. */ public long getPosition() throws IOException; } {code} The MRCompiler would then check if the loader being used to load data implemented the SamplableLoader interface. If so, rather than create an initial MR job to do the translation it would create the sampling job, having RandomSampleLoader use the user specified loader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-788) Proposal to remove float from Pig data types
[ https://issues.apache.org/jira/browse/PIG-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-788. Resolution: Won't Fix Avro has decided to keep float as a type. Proposal to remove float from Pig data types Key: PIG-788 URL: https://issues.apache.org/jira/browse/PIG-788 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Pig would like to use the new Hadoop Avro serialization package to pass data between MR jobs, and eventually between Pig and UDFs that are not written in Java. Avro will not be supporting the float data type, but only double (see AVRO-17). Pig currently support both float and double. Double is the default floating point type (so if the user says x + 1.0, 1.0 is taken to be a double, not a float). Float was initially included in the list of Pig types because Hadoop supported it as one of the Writable types, and we were trying to make sure all of Hadoop's writable types could be represented in Pig. In practice we do not see anyone using the float type. In order to be able to easily use Avro I propose dropping the float type. Please speak up if you are using the float type and you have a compelling reason not to use double. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-861) POJoinPackage lose tuple in large dataset
[ https://issues.apache.org/jira/browse/PIG-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-861: --- Affects Version/s: (was: 0.2.0) 0.3.0 Status: Patch Available (was: Open) POJoinPackage lose tuple in large dataset - Key: PIG-861 URL: https://issues.apache.org/jira/browse/PIG-861 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-861-1.patch Some script using POJoinPackage loses records when processing large amount of input data. We do not see this problem in smaller input. We can reproduce this problem, however, the dataset for the test case is too big to be included here. We suspect that POJoinPackage causes the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-861) POJoinPackage lose tuple in large dataset
[ https://issues.apache.org/jira/browse/PIG-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725880#action_12725880 ] Olga Natkovich commented on PIG-861: +1, changes look good. Great catch! Need to make sure all tests pass before committing POJoinPackage lose tuple in large dataset - Key: PIG-861 URL: https://issues.apache.org/jira/browse/PIG-861 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-861-1.patch Some script using POJoinPackage loses records when processing large amount of input data. We do not see this problem in smaller input. We can reproduce this problem, however, the dataset for the test case is too big to be included here. We suspect that POJoinPackage causes the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.