[jira] Commented: (PIG-1248) [piggybank] useful String functions
[ https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841789#action_12841789 ] Dmitriy V. Ryaboy commented on PIG-1248: Gerrit, as far as Pig the language is concerned, all bags are unordered. Alan, re capitalization -- I see your point, and agree (even made LENGTH all-caps originally). Can we put off renaming these until we copy them into builtins? Users will change the paths to take advantage of them anyway (from piggybank to builtin), so changing the capitalization shouldn't be a big deal. I think it should be a copy, not a move, by the way, so as not to break existing scripts. Perhaps we can deprecate the piggybank classes when we copy them to builtin. [piggybank] useful String functions --- Key: PIG-1248 URL: https://issues.apache.org/jira/browse/PIG-1248 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG_1248.diff, PIG_1248.diff, PIG_1248.diff Pig ships with very few evalFuncs for working with strings. This jira is for adding a few more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1248) [piggybank] useful String functions
[ https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841930#action_12841930 ] Alan Gates commented on PIG-1248: - bq. Can we put off renaming these until we copy them into builtins? Users will change the paths to take advantage of them anyway (from piggybank to builtin), so changing the capitalization shouldn't be a big deal. I think that's fine. It may even help reduce confusion for users, given the next point. bq. I think it should be a copy, not a move, by the way, so as not to break existing scripts. Perhaps we can deprecate the piggybank classes when we copy them to builtin. +1, we want to make life as easy as possible for users. So at this point, I'm +1 on committing this patch. [piggybank] useful String functions --- Key: PIG-1248 URL: https://issues.apache.org/jira/browse/PIG-1248 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG_1248.diff, PIG_1248.diff, PIG_1248.diff Pig ships with very few evalFuncs for working with strings. This jira is for adding a few more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1248) [piggybank] useful String functions
[ https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12840279#action_12840279 ] Bill Graham commented on PIG-1248: -- How exactly would split differ from the TOKENIZE function if split returned a bag? TOKENIZE returns an unordered bag of words. Having a function that returns an ordered tuple of words is very useful IMO. I had to write my own version of a tokenize UDF to do this. [piggybank] useful String functions --- Key: PIG-1248 URL: https://issues.apache.org/jira/browse/PIG-1248 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG_1248.diff, PIG_1248.diff, PIG_1248.diff Pig ships with very few evalFuncs for working with strings. This jira is for adding a few more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1248) [piggybank] useful String functions
[ https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12839891#action_12839891 ] Alan Gates commented on PIG-1248: - I agree camel case is easier on the eyes (and on the fingers). But we seem to have chosen all caps for built in functions. Some of these we'll eventually want to move into builtins for Pig. I really don't want to bring in builtins that aren't all caps. For functions we think we might want to bring into builtin someday it seems like it would be better to start them out in all caps now rather than changing them later. One other thought on the tuples via bags. If you have a script like: {code} A = load 'bla' using TextLoader(); B = foreach A generate flatten(Split($0)); {code} Assume a file that contains : Mary had a little lamb. If Split returns tuples then B will return 1 record, Mary had a little lamb. If Split returns a bag, then B will generate 5 records (Mary, had, a, little, lamb). I don't have any guess of which of those users will want more. [piggybank] useful String functions --- Key: PIG-1248 URL: https://issues.apache.org/jira/browse/PIG-1248 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG_1248.diff, PIG_1248.diff, PIG_1248.diff Pig ships with very few evalFuncs for working with strings. This jira is for adding a few more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1248) [piggybank] useful String functions
[ https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836584#action_12836584 ] Hadoop QA commented on PIG-1248: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12436549/PIG_1248.diff against trunk revision 912064. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 526 release audit warnings (more than the trunk's current 517 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/210/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/210/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/210/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/210/console This message is automatically generated. [piggybank] useful String functions --- Key: PIG-1248 URL: https://issues.apache.org/jira/browse/PIG-1248 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG_1248.diff Pig ships with very few evalFuncs for working with strings. This jira is for adding a few more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1248) [piggybank] useful String functions
[ https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836838#action_12836838 ] Alan Gates commented on PIG-1248: - Some of these methods use all caps, some camel case. Is there some reasoning here? We should pick one convention. The convention in Pig seems to be all caps (COUNT, etc.), so we should use that. I would like to see SUBSTRING match SQL semantics, where the 3rd argument is the length of the substring instead of the terminal index. Since this is a change in interface I'm open to delaying that until we pull SUBSTRING in as a Pig builtin. The javadocs for the Split class appear to have been copied from SUBSTRING. Should Split return a tuple of chararrays or a bag of tuples, each containing one chararray? The advantage I can see for the latter is it's possible to a priori determine the full schema of splits output, while in the former you cannot. (You can say that this returns a tuple, but you don't know what's in the tuple. So when that tuple is flattened, the schema will become unknown.) Since lack of schema is contagious, this could cause problems for Pig features (such as optimization, metadata, certain storage mechanisms like Zebra) that do much better with schemas. The same question applies to RegexExtractAll. [piggybank] useful String functions --- Key: PIG-1248 URL: https://issues.apache.org/jira/browse/PIG-1248 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG_1248.diff, PIG_1248.diff Pig ships with very few evalFuncs for working with strings. This jira is for adding a few more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1248) [piggybank] useful String functions
[ https://issues.apache.org/jira/browse/PIG-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12837064#action_12837064 ] Dmitriy V. Ryaboy commented on PIG-1248: I'll fix up the javadocs. I didn't modify substring behavior except to make it be nice to short strings, which I think is what people tend to expect. The contract should probably stay the same for backwards compatibility? We can change on migration, but that'd be confusing. I am returning tuples instead of bags because they are ordered, and bags are not. You are right about the flattening issue.. I guess a user could call the hypothetial tupleToBag udf if that was the use case, though. -D [piggybank] useful String functions --- Key: PIG-1248 URL: https://issues.apache.org/jira/browse/PIG-1248 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.7.0 Attachments: PIG_1248.diff, PIG_1248.diff Pig ships with very few evalFuncs for working with strings. This jira is for adding a few more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.