[jira] Created: (PIG-1400) add option for null field JOIN semantics
add option for null field JOIN semantics Key: PIG-1400 URL: https://issues.apache.org/jira/browse/PIG-1400 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: David Ciemiewicz Currently JOIN supports SQL semantics for joining null values in fields - they aren't matched. However, GROUP ... and COGROUP ... semantics DO match on null values in fields. This violated the principle of least astonishment for me - I expected JOIN on null value fields to work. As a work around, I must now go through ALL of my code to convert chararray null values to empty strings to get the JOIN to work appropriately. {code} A = foreach A generate ((a is not null) ? a : '') as a, ((b is not null) ? b : '') as b, ... {code} This does not really a satisfactory work around. My preference is that JOIN support an option (ala FULL, LEFT, RIGHT, OUTER) that directs JOIN to support null match join semantics just like COGROUP does. Something like: {code} AB = JOIN A by ( key, subkey ) FULL OUTER MATCHNULLS, B by ( key, subkey ); {code} Don't know if it should be called JOIN_NULLS, MATCHNULLS, NULLS, NULLSEMANTICS, what have you. I just think it would be much cleaner for the end user to be able get these semantics. We might also consider being explicit about the SQL null semantics by adding the option SQLNULLS or NONULLMATCH. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files
[ https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854208#action_12854208 ] David Ciemiewicz commented on PIG-42: - Hadoop Archives are not really the solution here. I want my code to with exactly the same file name references whether I have 100 gzip compressed (or bzip2 compressed) part files or a single concatenation of the individually compressed part files. I have to change all my filename references to use a har. What we really want are simple concatenations of gzip files and bzip2 files that work with map reduce. Pig should be able to split Gzip files like it can split Bzip files --- Key: PIG-42 URL: https://issues.apache.org/jira/browse/PIG-42 Project: Pig Issue Type: Improvement Components: impl Reporter: Benjamin Reed Assignee: Benjamin Reed Attachments: gzip.patch It would be nice to be able to split gzip files like we can split bzip files. Unfortunately, we don't have a sync point for the split in the gzip format. Gzip file format supports the notion of concatenate gzipped files. When gzipped files are concatenated together they are treated as a single file. So to make a gzipped file splittable we can used an empty compressed file with some salt in the headers as a sync signature. Then we can make the gzip file splittable by using this sync signature between compressed segments of the file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-282) Custom Partitioner
[ https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849279#action_12849279 ] David Ciemiewicz commented on PIG-282: -- How will the custom partitioner be used in Pig? Is this for map partitioning and/or output partitioning? For instance, I'd love to have something that created separate directories based on the value of some key. Custom Partitioner -- Key: PIG-282 URL: https://issues.apache.org/jira/browse/PIG-282 Project: Pig Issue Type: New Feature Reporter: Amir Youssefi Priority: Minor By adding custom partitioner we can give control over which output partition a key (/value) goes to. We can add keywords to language e.g. PARTITION BY UDF(...) or a similar syntax. UDF returns a number between 0 and n-1 where n is number of output partitions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Reopened: (PIG-1182) Pig reference manual does not mention syntax for comments
[ https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz reopened PIG-1182: --- Corinne, not sure what you are so resistant to following the basic principles of documenting ALL syntax, including comments, in the reference manual. If the document is open to the community to edit, I'm more than willing to do the work myself since I have contibuted as a technical writer for programming language reference manuals in my past as well as having been a developer of compilers and software development tools. Also, I think the passage you sited could use a little work on the English: Using Comments in Scripts If you place Pig Latin statements in a script, the script can include comments. For multi-line comments use /* */ For single line comments use -- /* myscript.pig My script includes three simple Pig Latin Statements. */ A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); -- load statement B = FOREACH A GENERATE name; -- foreach statement DUMP B; --dump statement Case Sensitivity Pig reference manual does not mention syntax for comments - Key: PIG-1182 URL: https://issues.apache.org/jira/browse/PIG-1182 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.5.0 Reporter: David Ciemiewicz Assignee: Corinne Chandel Fix For: 0.7.0 The Pig 0.5.0 reference manual does not mention how to write comments in your pig code using -- (two dashes). http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html Also, does /* */ also work? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-752) local mode doesn't read bzip2 and gzip compressed data files
[ https://issues.apache.org/jira/browse/PIG-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803795#action_12803795 ] David Ciemiewicz commented on PIG-752: -- Jeff, What do you mean when you say local mode has been removed? Does this mean that the option -exectype local has been removed? Or does this mean that the local mode execution code has been replaced or will be replaced by a M/R execution engine that operates on the users local computer without the need for an HDFS grid. If the former (no local exection), this is nuts. If the latter (M/R execution for local execution), and this will supply the means of doing bzip compression reading and writing, then this isn't a WON'T FIX, this is a FIXED by change in execution engine? So which is it? local mode doesn't read bzip2 and gzip compressed data files Key: PIG-752 URL: https://issues.apache.org/jira/browse/PIG-752 Project: Pig Issue Type: Bug Affects Versions: 0.4.0 Reporter: David Ciemiewicz Assignee: Jeff Zhang Attachments: Pig_752.Patch Problem 1) use of .bz2 file extension does not store results bzip2 compressed in Local mode (-exectype local) If I use the .bz2 filename extension in a STORE statement on HDFS, the results are stored with bzip2 compression. If I use the .bz2 filename extension in a STORE statement on local file system, the results are NOT stored with bzip2 compression. compact.bz2.pig: {code} A = load 'events.test' using PigStorage(); store A into 'events.test.bz2' using PigStorage(); C = load 'events.test.bz2' using PigStorage(); C = limit C 10; dump C; {code} {code} -bash-3.00$ pig -exectype local compact.bz2.pig -bash-3.00$ file events.test events.test: ASCII English text, with very long lines -bash-3.00$ file events.test.bz2 events.test.bz2: ASCII English text, with very long lines -bash-3.00$ cat events.test | bzip2 events.test.bz2 -bash-3.00$ file events.test.bz2 events.test.bz2: bzip2 compressed data, block size = 900k {code} The output format in local mode is definitely not bzip2, but it should be. {code} Problem 2) pig in local mode does not decompress bzip2 compressed files, but should, to be consistent with HDFS read.bz2.pig: {code} A = load 'events.test.bz2' using PigStorage(); A = limit A 10; dump A; {code} The output should be human readable but is instead garbage, indicating no decompression took place during the load: {code} -bash-3.00$ pig -exectype local read.bz2.pig USING: /grid/0/gs/pig/current 2009-04-03 18:26:30,455 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-03 18:26:30,456 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (BZh91AYsyoz?u?...@{x_?d?|u-??mK???;??4?C??) ((R? 6?*mg, ?6?Zj?k,???0?QT?d???hY?#mJ?[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a?? ??U?p@@MT?$?B?P??N??=???(z}gk...@c$\??i]?g:?J) a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m? (mP(i?4,#F[?I)@?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?wf??4z???4t?) (?oou?t???Kwl?3?nCM?WS?;l???P?s?x a???e)B??9? ?44 ((?...@4?) (f) (?...@+?d?0@?U) (Q?SR) -bash-3.00$ {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1182) Pig reference manual does not mention syntax for comments
Pig reference manual does not mention syntax for comments - Key: PIG-1182 URL: https://issues.apache.org/jira/browse/PIG-1182 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.5.0 Reporter: David Ciemiewicz The Pig 0.5.0 reference manual does not mention how to write comments in your pig code using -- (two dashes). http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html Also, does /* */ also work? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1182) Pig reference manual does not mention syntax for comments
[ https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798128#action_12798128 ] David Ciemiewicz commented on PIG-1182: --- Corinne, I made no changes. I'm pointing out that it is an omission to not have the comment syntax documented in the reference manual. Reference manuals for programming languages SHOULD ALWAYS have information on ALL syntax including comment syntax. Once you are done with learning things in the User's Guide, most of the time programmer's just go back to the Reference Manual for quick look up of information and syntax. So the documentation on comment syntax should be in BOTH the User's Guide AND the Reference Manual. Pig reference manual does not mention syntax for comments - Key: PIG-1182 URL: https://issues.apache.org/jira/browse/PIG-1182 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.5.0 Reporter: David Ciemiewicz Assignee: Corinne Chandel Fix For: 0.7.0 The Pig 0.5.0 reference manual does not mention how to write comments in your pig code using -- (two dashes). http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html Also, does /* */ also work? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1097) Pig do not support group by boolean type
[ https://issues.apache.org/jira/browse/PIG-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780223#action_12780223 ] David Ciemiewicz commented on PIG-1097: --- I think that one could argue that Filter functions are REALLY just EvalBoolean functions in disguise. That Filter functions were a way of adding return type to Pig for Boolean cases when Pig had no types. Further, I'd argue, that now that Pig does have data types, that Filter should be deprecated and all Filter functions should now become EvalBoolean. In otherwords, I believe it was an oversight in the types migration to not migrate Filter to EvalBoolean Pig do not support group by boolean type Key: PIG-1097 URL: https://issues.apache.org/jira/browse/PIG-1097 Project: Pig Issue Type: Improvement Components: impl Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Minor Fix For: 0.6.0 My Script is as following, the TestUDF return boolean type. {color:blue} DEFINE testUDF org.apache.pig.piggybank.util.TestUDF(); raw = LOAD 'data/input'; raw = FOREACH raw GENERATE testUDF(); raw = GROUP raw BY $0; DUMP raw; {color} *The above script will throw exception:* Exception in thread main org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias raw at org.apache.pig.PigServer.openIterator(PigServer.java:481) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:539) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.PigServer.registerScript(PigServer.java:409) at PigExample.main(PigExample.java:13) Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias raw at org.apache.pig.PigServer.store(PigServer.java:536) at org.apache.pig.PigServer.openIterator(PigServer.java:464) ... 5 more Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:269) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:780) at org.apache.pig.PigServer.store(PigServer.java:528) ... 6 more Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2036: Unhandled key type boolean at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.selectComparator(JobControlCompiler.java:856) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:561) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:251) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:128) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:249) ... 8 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1034) Pig does not support ORDER ... BY group alias
Pig does not support ORDER ... BY group alias - Key: PIG-1034 URL: https://issues.apache.org/jira/browse/PIG-1034 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz GROUP ... ALL and GROUP ... BY produce an alias group. Pig produces a syntax error if you attempt to ORDER ... BY group. This does seem like a perfectly reasonable thing to do. The workaround is to create an alias for group using an AS clause. But I think this workaround should be unnecessary. Here's sample code which elicits the syntax error: {code} A = load 'one.txt' using PigStorage as (one: int); B = group A all; C = foreach B generate group, COUNT(A) as count; D = order C by group parallel 1; -- group is one of the aliases in C, why does this throw a syntax error? dump D; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-979) Acummulator Interface for UDFs
[ https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12759813#action_12759813 ] David Ciemiewicz commented on PIG-979: -- This JIRA doesn't quite get the gist of why I believe the Accumulator interface is of interest. It isn't just about performance and avoiding retreading over the same data over and over again. It is also about providing an interface to support CUMMULATIVE_SUM, RANK, and other functions of it's ilk. A better code example for justifying this would be: {code} A = load 'data' using PigStorage() as ( query: chararray, int: count ); B = order A by count desc parallel 1; C = foreach B generate query, count, CUMULATIVE_SUM(count) as cumulative_count, RANK(count) as rank; {code} These functions RANK and CUMULATIVE_SUM would have persistent state and yet would emit a value per value or tuple passed. Bags would not be appropriate as coded. Additionally, the reason for the Accumulator inteface is to avoid multiple passes over the same data: For instance, consider the example: {code} A = load 'data' using PigStorage() as ( query: chararray, int: count ); B = group A all; C = foreach B generate group, SUM(A.count), AVG(A.count), VAR(A.count), STDEV(A.count), MIN(A.count), MAX(A.count), MEDIAN(A.count); {code} Repeatedly shuffling the same values just isn't an optimal way to process data. Acummulator Interface for UDFs -- Key: PIG-979 URL: https://issues.apache.org/jira/browse/PIG-979 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Ying He Add an accumulator interface for UDFs that would allow them to take a set number of records at a time instead of the entire bag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY
ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY - Key: PIG-900 URL: https://issues.apache.org/jira/browse/PIG-900 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY
[ https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-900: - Description: With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} was: With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY - Key: PIG-900 URL: https://issues.apache.org/jira/browse/PIG-900 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY
[ https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-900: - Description: With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c ); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} was: With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY - Key: PIG-900 URL: https://issues.apache.org/jira/browse/PIG-900 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz With GROUP BY, you must put parentheses around the aliases in the BY clause: {code} B = group A by ( a, b, c ); {code} With FILTER BY, you can optionally put parentheses around the aliases in the BY clause: {code} B = filter A by ( a is not null and b is not null and c is not null ); {code} However, with ORDER BY, if you put parenthesis around the BY clause, you get a syntax error: {code} A = order A by ( a, b, c ); {code} Produces the error: {code} 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered , , at line 3, column 19. Was expecting: ) ... {code} This is an annoyance really. Here's my full code example ... {code} A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: chararray ); A = order A by ( a, b, c ); dump A; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-875) Making COUNT and AVG semantics SQL compliant
[ https://issues.apache.org/jira/browse/PIG-875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733400#action_12733400 ] David Ciemiewicz commented on PIG-875: -- Can I suggest that we might the default to behavior to be to not count nulls, but we might want a way for nulls to be counted with AVG_WITH_NULLS and COUNT_WITH_NULLS or that we might want a DEFINE statement to set an option to turn on and off null count behavior. Making COUNT and AVG semantics SQL compliant Key: PIG-875 URL: https://issues.apache.org/jira/browse/PIG-875 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.4.0 Currently both AVG and COUNT counts NULLs -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-752) local mode doesn't read bzip2 and gzip compressed data files
[ https://issues.apache.org/jira/browse/PIG-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-752: - Summary: local mode doesn't read bzip2 and gzip compressed data files (was: bzip2 compression and local mode bugs) local mode doesn't read bzip2 and gzip compressed data files Key: PIG-752 URL: https://issues.apache.org/jira/browse/PIG-752 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz Problem 1) use of .bz2 file extension does not store results bzip2 compressed in Local mode (-exectype local) If I use the .bz2 filename extension in a STORE statement on HDFS, the results are stored with bzip2 compression. If I use the .bz2 filename extension in a STORE statement on local file system, the results are NOT stored with bzip2 compression. compact.bz2.pig: {code} A = load 'events.test' using PigStorage(); store A into 'events.test.bz2' using PigStorage(); C = load 'events.test.bz2' using PigStorage(); C = limit C 10; dump C; {code} {code} -bash-3.00$ pig -exectype local compact.bz2.pig -bash-3.00$ file events.test events.test: ASCII English text, with very long lines -bash-3.00$ file events.test.bz2 events.test.bz2: ASCII English text, with very long lines -bash-3.00$ cat events.test | bzip2 events.test.bz2 -bash-3.00$ file events.test.bz2 events.test.bz2: bzip2 compressed data, block size = 900k {code} The output format in local mode is definitely not bzip2, but it should be. {code} Problem 2) pig in local mode does not decompress bzip2 compressed files, but should, to be consistent with HDFS read.bz2.pig: {code} A = load 'events.test.bz2' using PigStorage(); A = limit A 10; dump A; {code} The output should be human readable but is instead garbage, indicating no decompression took place during the load: {code} -bash-3.00$ pig -exectype local read.bz2.pig USING: /grid/0/gs/pig/current 2009-04-03 18:26:30,455 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-03 18:26:30,456 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (BZh91AYsyoz?u?...@{x_?d?|u-??mK???;??4?C??) ((R? 6?*mg, ?6?Zj?k,???0?QT?d???hY?#mJ?[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a?? ??U?p@@MT?$?B?P??N??=???(z}gk...@c$\??i]?g:?J) a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m? (mP(i?4,#F[?I)@?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?wf??4z???4t?) (?oou?t???Kwl?3?nCM?WS?;l???P?s?x a???e)B??9? ?44 ((?...@4?) (f) (?...@+?d?0@?U) (Q?SR) -bash-3.00$ {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724703#action_12724703 ] David Ciemiewicz commented on PIG-793: -- Alan, This sounds good, but it sounds like it is only 12 out of 174 bytes that you are saving or less than 10%. Amdahl's law says this isn't sufficient in the grand scheme of things and so I won't expect a huge payback. It seems like an optimal encoding of the same tuple would be something like: 1 or 2 bytes for an index to the structure describing the contents of the tuple (keep a list of these tuple structures) 4 bytes for the int 8 bytes for the double 1 or 2 bytes for string length in fixed positions 20 bytes for string Total is 36 bytes or an 80% reduction in memory versus 174 bytes. If memory and not CPU is what is slowing down Pig processing, then Hong Tang's LazyTuple or something like it ultimately going to be what is needed. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Alan Gates Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-863) Function (UDF) automatic namespace resolution is really needed
Function (UDF) automatic namespace resolution is really needed -- Key: PIG-863 URL: https://issues.apache.org/jira/browse/PIG-863 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz The Apache PiggyBank documentation says that to reference a function, I need to specify a function as: org.apache.pig.piggybank.evaluation.string.UPPER(text) As in the example: {code} REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ; TweetsInaug = FILTER Tweets BY org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; {code} Why can't we implement automatic name space resolution as so we can just reference UPPER without namespace qualifiers? {code} REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ; TweetsInaug = FILTER Tweets BY UPPER(text) MATCHES '.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ; {code} I know about the workaround: {code} define org.apache.pig.piggybank.evaluation.string.UPPER UPPER {code} But this is really a pain to do if I have lots of functions. Just warn if there is a collision and suggest I use the define workaround in the warning messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-826) DISTINCT as Function/Operator rather than statement/operator - High Level Pig
[ https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715726#action_12715726 ] David Ciemiewicz commented on PIG-826: -- Alan, thanks! But what if I want to do the following: {code} foreach Grouped { dcountryurl = distinct Logs.(country,url); generate COUNT(dcountryurl); }; {code} Projecting multiple aliases doesn't seem to work. I also tried the following and it doesn't work either. {code} foreach Grouped { dcountryurl = distinct Logs.country, Logs.url; generate COUNT(dcountryurl); }; {code} DISTINCT as Function/Operator rather than statement/operator - High Level Pig --- Key: PIG-826 URL: https://issues.apache.org/jira/browse/PIG-826 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz In SQL, a user would think nothing of doing something like: {code} select COUNT(DISTINCT(user)) as user_count, COUNT(DISTINCT(country)) as country_count, COUNT(DISTINCT(url) as url_count from server_logs; {code} But in Pig, we'd need to do something like the following. And this is about the most compact version I could come up with. {code} Logs = load 'log' using PigStorage() as ( user: chararray, country: chararray, url: chararray); DistinctUsers = distinct (foreach Logs generate user); DistinctCountries = distinct (foreach Logs generate country); DistinctUrls = distinct (foreach Logs generate url); DistinctUsersCount = foreach (group DistinctUsers all) generate group, COUNT(DistinctUsers) as user_count; DistinctCountriesCount = foreach (group DistinctCountries all) generate group, COUNT(DistinctCountries) as country_count; DistinctUrlCount = foreach (group DistinctUrls all) generate group, COUNT(DistinctUrls) as url_count; AllDistinctCounts = cross DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount; Report = foreach AllDistinctCounts generate DistinctUsersCount::user_count, DistinctCountriesCount::country_count, DistinctUrlCount::url_count; store Report into 'log_report' using PigStorage(); {code} It would be good if there was a higher level version of Pig that permitted code to be written as: {code} Logs = load 'log' using PigStorage() as ( user: chararray, country: chararray, url: chararray); Report = overall Logs generate COUNT(DISTINCT(user)) as user_count, COUNT(DISTINCT(country)) as country_count, COUNT(DISTINCT(url)) as url_count; store Report into 'log_report' using PigStorage(); {code} I do want this in Pig and not as SQL. I'd expect High Level Pig to generate Lower Level Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-826) DISTINCT as Function rather than statement - High Level Pig
DISTINCT as Function rather than statement - High Level Pig - Key: PIG-826 URL: https://issues.apache.org/jira/browse/PIG-826 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz In SQL, a user would think nothing of doing something like: {code} select COUNT(DISTINCT(user)) as user_count, COUNT(DISTINCT(country)) as country_count, COUNT(DISTINCT(url) as url_count from server_logs; {code} But in Pig, we'd need to do something like the following. And this is about the most compact version I could come up with. {code} Logs = load 'log' using PigStorage() as ( user: chararray, country: chararray, url: chararray); DistinctUsers = distinct (foreach Logs generate user); DistinctCountries = distinct (foreach Logs generate country); DistinctUrls = distinct (foreach Logs generate url); DistinctUsersCount = foreach (group DistinctUsers all) generate group, COUNT(DistinctUsers) as user_count; DistinctCountriesCount = foreach (group DistinctCountries all) generate group, COUNT(DistinctCountries) as country_count; DistinctUrlCount = foreach (group DistinctUrls all) generate group, COUNT(DistinctUrls) as url_count; AllDistinctCounts = cross DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount; Report = foreach AllDistinctCounts generate DistinctUsersCount::user_count, DistinctCountriesCount::country_count, DistinctUrlCount::url_count; store Report into 'log_report' using PigStorage(); {code} It would be good if there was a higher level version of Pig that permitted code to be written as: {code} Logs = load 'log' using PigStorage() as ( user: chararray, country: chararray, url: chararray); Report = overall Logs generate COUNT(DISTINCT(user)) as user_count, COUNT(DISTINCT(country)) as country_count, COUNT(DISTINCT(url)) as url_count; store Report into 'log_report' using PigStorage(); {code} I do want this in Pig and not as SQL. I'd expect High Level Pig to generate Lower Level Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency
[ https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714972#action_12714972 ] David Ciemiewicz commented on PIG-801: -- I'm very much beginning to like the idea of introducing some syntactic sugar in Pig for an forall or overall statement that would allow one to write the high level pig for this case as: {code} Total = forall CountryPopulations generate SUM(CountryPopulations.population) as population; {code} or as: {code} Total = overall CountryPopulations generate SUM(CountryPopulations.population) as population; {code} Yeah, I know I could use construct: {code} Total = foreach (group CountryPopulations all) generate SUM(CountryPopulations.population) as population; {code} But I like syntactic sugar. Then again, it would be really good if Pig just supported: Since this would need to be done for SQL, it could be done for Pig as well. {code} CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long ); PopulationProportions = foreach CountryPopulations generate country, population, (double)population / (double)SUM(population) as global_proportion; {code} Pig needs to handle scalar aliases to improve programmer and code execution efficiency -- Key: PIG-801 URL: https://issues.apache.org/jira/browse/PIG-801 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz In Pig, it is often the case that the result of an operation is a scalar value that needs to be applied to the next step of processing. For example: * FILTER by MAX of group -- See: PIG-772 * Compute proportions by dividing by total (SUM) of grouped alias Today Pig programmers need to go through distasteful and slow contortions of using FLATTEN or CROSS to propagate the scalar computation to EVERY row of data to perform these operations creating needless copies of data. Or, the user must write the global sum to a file, then read it back in to gain the efficiency. If the language were simply extended to have the notion of scalar aliases, then coding would be simplified without contortions for the programmer and, I believe, execution of the code would be faster too. For instance, to compute global proportions, I want to do the following: {code} CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long ); AllCountryPopulations= group CountryPopulations all; Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population; PopulationProportions = foreach CountryPopulations generate country, population, (double)population / (double)Total.population as global_proportion; {code} One of the very distasteful workarounds for this is to do something like: {code} CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long ); AllCountryPopulations= group CountryPopulations all; Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population; CountryPopulationsTotal = cross CountryPopulations, Total; PopulationProportions = foreach CountryPopulations generate CountryPopulations::country, CountryPopulations::population, (double)CountryPopulations::population / (double)Total::population as global_proportion; {code} This just makes me cringe every time I have to do it. Constructing new rows of data simply to apply the same scalar value row after row after row for potentially billions of rows of data just feels horribly wrong and inefficient both from the coding standpoint and from the execution standpoint. In SQL, I'd just code this as: {code} select country, population, population / SUM(population) from CountryPopulations; {code} In writing a SQL to Pig translator, it would seem that this construct or idiom would need to be supported, so why not create a higher level of Pig which would support the notion of scalars efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-753) Provide support for UDFs without parameters
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-753: - Summary: Provide support for UDFs without parameters (was: Do not support UDF not providing parameter) Provide support for UDFs without parameters --- Key: PIG-753 URL: https://issues.apache.org/jira/browse/PIG-753 Project: Pig Issue Type: Improvement Reporter: Jeff Zhang Pig do not support UDF without parameters, it force me provide a parameter. like the following statement: B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to provide a parameter like following B = FOREACH A GENERATE bagGenerator($0); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)
[ https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714227#action_12714227 ] David Ciemiewicz commented on PIG-807: -- @Yiping I see what you mean. Maybe we should have FOREACH and FORALL as in B = FORALL A GENERATE SUM(m); Another version of this my be B = OVER A GENERATE SUM(m); or B = OVERALL A GENERATE SUM(m); There was a hallway conversation about the situation of: {code} B = GROUP A BY key; C = FOREACH B { SORTED = ORDER A BY value; GENERATE COUNT(SORTED) as count, QUANTILES(SORTED.value, 0.0, 0.5, 0.75, 0.9, 1.0) as quantiles: (p00, p50, p75, p90, p100); }; {code} I was told that a ReadOnce bag would not solve this problem because we'd need to pass through SORTED twice because there were two UDFs. I disagree. It is possible to pass over this data once and only once if we create a class of Accumulating or Running functions that differs from the current DataBag and AlgebraicDataBag functions. First, functions like SUM, COUNT, AVG, VAR, MIN, MAX, STDEV, ResevoirSampling, statistics.SUMMARY, can all computed on a ReadOnce / Streaming DataBag of unknown length or size. For each of these functions, we simply add or accumulate the values on row at a time, we can invoke a combiner for intermediate results across partitions, and produce a final result, all without materializing a DataBag as implemented today. QUANTILES is a different beast. To compute quantiles, the data must be sorted, which I prefer to do outside the UDF at this time. Also, the COUNT of the data is needed a prior. Fortunately sorting COULD produce a ReadOnce / Streaming DataBag of KNOWN as opposed to unknown length or size so only two scans through the data (sorting and quantiles) are needed without needing three scans (sort, count, quantiles). So, if Pig could understand two additional data types ReadOnceSizeUnknown -- COUNT() counts all individual rows ReadOnceSizeKnown -- COUNT() just returns size attribute of ReadOnce data reference And if Pig had RunningEval and RunningAlgebraicEval classes of functions which accumulate values a row at a time, many computations in Pig could be much much more efficient. In case anyone doesn't get what I mean by having running functions, here's some Perl code that implements what I'm suggesting. I'll leave it as an exercise for the Pig development team to figure out the RunningAlgebraicEval versions of these functions/classes. :^) runningsums.pl {code} #! /usr/bin/perl use RunningSum; use RunningCount; $a_count = RunningCount-new(); $a_sum = RunningSum-new(); $b_sum = RunningSum-new(); $c_sum = RunningSum-new(); while () { s/\r*\n*//g; ($a, $b, $c) = split(/\t/); $a_count-accumulate($a); $a_sum-accumulate($a); $b_sum-accumulate($b); $c_sum-accumulate($c); } print join(\t, $a_count-final(), $a_sum-final(), $b_sum-final(), $c_sum-final() ), \n; {code} RunningCount.pm {code} package RunningCount; sub new { my $class = shift; my $self = {}; bless $self, $class; return $self; } sub accumulate { my $self = shift; my $value = shift; $self-{'count'} ++; } sub final { my $self = shift; return $self-{'count'}; } 1; {code} RunningSum.pl {code} package RunningSum; sub new { my $class = shift; my $self = {}; bless $self, $class; return $self; } sub accumulate { my $self = shift; my $value = shift; $self-{'sum'} += $value; } sub final { my $self = shift; return $self-{'sum'}; } 1; {code} PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator) Key: PIG-807 URL: https://issues.apache.org/jira/browse/PIG-807 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath Fix For: 0.3.0 Currently all bags resulting from a group or cogroup are materialized as bags containing all of the contents. The issue with this is that if a particular key has many corresponding values, all these values get stuffed in a bag which may run out of memory and hence spill causing slow down in performance and sometime memory exceptions. In many cases, the udfs which use these bags coming out a group and cogroup only need to iterate over the bag in a unidirectional read-once manner. This can be implemented by having the bag implement its iterator by simply iterating over the underlying hadoop iterator provided in the reduce. This kind of a bag is also needed in
[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)
[ https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709007#action_12709007 ] David Ciemiewicz commented on PIG-807: -- Certainly SUM, COUNT, AVG could all use this. In fact, technically speaking, SUM, COUNT, and AVG shouldn't even necessarily need a prior GROUP ... ALL statement. How would this factor into the thinking on this? While you're thinking about this, we might also consider another optimization as well ... what if I have 10 to 100 SUM operations in the same FOREACH ... GENERATE statement. Materializing a DataBag or even a ReadOnce Bag for each column of data is REALLY slow. In working through this, providing access to the underlying hadoop iterators permit a single scan through the data rather than multiple scans, one for each column? Example: {code} A = load ... B = group A all; C = foreach B generate COUNT(A), SUM(A.m), SUM(A.n), SUM(A.o), SUM(A.p), SUM(A.q), SUM(A.r), ... {code} PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator) Key: PIG-807 URL: https://issues.apache.org/jira/browse/PIG-807 Project: Pig Issue Type: Improvement Affects Versions: 0.2.1 Reporter: Pradeep Kamath Fix For: 0.3.0 Currently all bags resulting from a group or cogroup are materialized as bags containing all of the contents. The issue with this is that if a particular key has many corresponding values, all these values get stuffed in a bag which may run out of memory and hence spill causing slow down in performance and sometime memory exceptions. In many cases, the udfs which use these bags coming out a group and cogroup only need to iterate over the bag in a unidirectional read-once manner. This can be implemented by having the bag implement its iterator by simply iterating over the underlying hadoop iterator provided in the reduce. This kind of a bag is also needed in http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for this issue too. The other part of this issue is to have some way for the udfs to communicate to Pig that any input bags that they need are read once bags . This can be achieved by having an Interface - say UsesReadOnceBags which is serves as a tag to indicate the intent to Pig. Pig can then rewire its execution plan to use ReadOnceBags is feasible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707377#action_12707377 ] David Ciemiewicz commented on PIG-734: -- Alan, if I don't think this is going to be that problematic. Even if I try to pass in a map dereference with an integer such as mymap#1, would pig automagically convert the 1 to a string equivalent to mymap#'1'. If so, I think this would be quite acceptable. Non-string keys in maps --- Key: PIG-734 URL: https://issues.apache.org/jira/browse/PIG-734 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.3.0 Attachments: PIG-734.patch With the addition of types to pig, maps were changed to allow any atomic type to be a key. However, in practice we do not see people using keys other than strings. And allowing multiple types is causing us issues in serializing data (we have to check what every key type is) and in the design for non-java UDFs (since many scripting languages include associative arrays such as Perl's hash). So I propose we scope back maps to only have string keys. This would be a non-compatible change. But I am not aware of anyone using non-string keys, so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency
[ https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-801: - Summary: Pig needs to handle scalar aliases to improve programmer and code execution efficiency (was: Pig needs to handle scalar aliases to programmer and code execution efficiency) Pig needs to handle scalar aliases to improve programmer and code execution efficiency -- Key: PIG-801 URL: https://issues.apache.org/jira/browse/PIG-801 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz In Pig, it is often the case that the result of an operation is a scalar value that needs to be applied to the next step of processing. For example: * FILTER by MAX of group -- See: PIG-772 * Compute proportions by dividing by total (SUM) of grouped alias Today Pig programmers need to go through distasteful and slow contortions of using FLATTEN or CROSS to propagate the scalar computation to EVERY row of data to perform these operations creating needless copies of data. Or, the user must write the global sum to a file, then read it back in to gain the efficiency. If the language were simply extended to have the notion of scalar aliases, then coding would be simplified without contortions for the programmer and, I believe, execution of the code would be faster too. For instance, to compute global proportions, I want to do the following: {code} CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long ); AllCountryPopulations= group CountryPopulations all; Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population; PopulationProportions = foreach CountryPopulations generate country, population, (double)population / (double)Total.population as global_proportion; {code} One of the very distasteful workarounds for this is to do something like: {code} CountryPopulations = load 'country.dat' using PigStorage() as ( country: chararray, population: long ); AllCountryPopulations= group CountryPopulations all; Total = foreach AllCountryPopulations generate SUM(CountryPopulations.population) as population; CountryPopulationsTotal = cross CountryPopulations, Total; PopulationProportions = foreach CountryPopulations generate CountryPopulations::country, CountryPopulations::population, (double)CountryPopulations::population / (double)Total::population as global_proportion; {code} This just makes me cringe every time I have to do it. Constructing new rows of data simply to apply the same scalar value row after row after row for potentially billions of rows of data just feels horribly wrong and inefficient both from the coding standpoint and from the execution standpoint. In SQL, I'd just code this as: {code} select country, population, population / SUM(population) from CountryPopulations; {code} In writing a SQL to Pig translator, it would seem that this construct or idiom would need to be supported, so why not create a higher level of Pig which would support the notion of scalars efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-602) Pass global configurations to UDF
[ https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705812#action_12705812 ] David Ciemiewicz commented on PIG-602: -- JIRA PIG-477 is related to this, I think. Pass global configurations to UDF - Key: PIG-602 URL: https://issues.apache.org/jira/browse/PIG-602 Project: Pig Issue Type: New Feature Components: impl Reporter: Yiping Han Assignee: Alan Gates We are seeking an easy way to pass a large number of global configurations to UDFs. Since our application contains many pig jobs, and has a large number of configurations. Passing configurations through command line is not an ideal way (i.e. modifying single parameter needs to change multiple command lines). And to put everything into the hadoop conf is not an ideal way either. We would like to see if Pig can provide such a facility that allows us to pass a configuration file in some format(XML?) and then make it available through out all the UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-477) passing properties from command line to the backend
[ https://issues.apache.org/jira/browse/PIG-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705814#action_12705814 ] David Ciemiewicz commented on PIG-477: -- PIG-602 is related to this, I think. passing properties from command line to the backend --- Key: PIG-477 URL: https://issues.apache.org/jira/browse/PIG-477 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich We have users that would like to be able to pass paramters from command line to their UDFs. A natural way to do that would be pass them as properties from the client to the compute node and make them available through System.getProperties on the backend. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-741) Add LIMIT as a statement that works in nested FOREACH
[ https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704452#action_12704452 ] David Ciemiewicz commented on PIG-741: -- Thanks Alan! The fact that LIMIT in this case doesn't use the combiner is probably not an issue. In most of the instances I have, I usually don't have more than a million things in the grouped databag, most of the time I only have under 1000 to 1 things so the combiner won't have much value. Add LIMIT as a statement that works in nested FOREACH - Key: PIG-741 URL: https://issues.apache.org/jira/browse/PIG-741 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Alan Gates Fix For: 0.3.0 Attachments: PIG-741.patch I'd like to compute the top 10 results in each group. The natural way to express this in Pig would be: {code} A = load '...' using PigStorage() as ( date: int, count: int, url: chararray ); B = group A by ( date ); C = foreach B { D = order A by count desc; E = limit D 10; generate FLATTEN(E); }; dump C; {code} Yeah, I could write a UDF / PiggyBank function to take the top n results. But since LIMIT already exists as a statement, it seems like it should also work in the nested foreach context. Example workaround code. {code} C = foreach B { D = order A by count desc; E = util.TOP(D, 10); generate FLATTEN(E); }; dump C; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code
[ https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703764#action_12703764 ] David Ciemiewicz commented on PIG-777: -- Another thing ... If you eliminate the D = load statement, could you provide some information to the user that this optimization is taking place? It would help me immensely with code maintenance if I could eliminate the D = load steps which often require recoding the AS clause schema. Code refactoring: Create optimization out of store/load post processing code Key: PIG-777 URL: https://issues.apache.org/jira/browse/PIG-777 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner The postProcessing method in the pig server checks whether a logical graph contains stores to and loads from the same location. If so, it will either connect the store and load, or optimize by throwing out the load and connecting the store predecessor with the successor of the load. Ideally the introduction of the store and load connection should happen in the query compiler, while the optimization should then happen in an separate optimizer step as part of the optimizer framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??
[ https://issues.apache.org/jira/browse/PIG-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703219#action_12703219 ] David Ciemiewicz commented on PIG-771: -- I'm just using Mac OS terminal to connect to a RHEL-4 gateway server to a RHEL-4 grid. I changed the code to use PigDump() storage format for the STORE statement and reran the code, trying to eliminate the terminal aspect. Pig itself is writing the question marks ('?', 0x3f). {code} -bash-3.00$ cat ch2.pig A = load 'ch.txt' using PigStorage() as (str: chararray); store A into 'ch.dmp' using PigDump(); -bash-3.00$ hadoop fs -cat ch.dmp/* () -bash-3.00$ hadoop fs -cat ch.dmp/* | od -xc 000 3f28 3f3f 293f 000a ( ? ? ? ? ) \n \0 007 {code} PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ?? -- Key: PIG-771 URL: https://issues.apache.org/jira/browse/PIG-771 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz PigDump does not properly output Chinese UTF8 characters. The reason for this is that the function Tuple.toString() is called. DefaultTuple implements Tuple.toString() and it calls Object.toString() on the opaque object d. Instead, I think that the code should be changed instead to call the new DataType.toString() function. {code} @Override public String toString() { StringBuilder sb = new StringBuilder(); sb.append('('); for (IteratorObject it = mFields.iterator(); it.hasNext();) { Object d = it.next(); if(d != null) { if(d instanceof Map) { sb.append(DataType.mapToString((MapObject, Object)d)); } else { sb.append(DataType.toString(d)); // Change this one line if(d instanceof Long) { sb.append(L); } else if(d instanceof Float) { sb.append(F); } } } else { sb.append(); } if (it.hasNext()) sb.append(,); } sb.append(')'); return sb.toString(); } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-786) Default job.names - script name, load file pattern, store file pattern, sub task type
Default job.names - script name, load file pattern, store file pattern, sub task type - Key: PIG-786 URL: https://issues.apache.org/jira/browse/PIG-786 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Priority: Trivial I have very complex Pig scripts which are often concatenations and iterations of a large number of map reduce tasks. I've gotten into the habit of using the following construct in my code: {code}set job.name '$DIR/$DATE/summary.bz'; A = load ... ... store Z into '$DIR/$DATE/summary.bz' using PigStorage();{code} But it would be really useful if Pig script parsing automagically set these job.name values. Ideally I'd like to have Pig just automagically construct job names for me so I can trace execution of multihour jobs in the HOD progress pages. Something like: {code}process-dates.pig A = LOAD /data/logs/daily/20090408 ... STORE Z into mysummary/20090408/summary.bz map-group-combiner-sort{code} Okay you say, I could construct this kind of job.name myself if this is what I want. Well: 1) I'd really like to have a default constructed by Pig so I don't have to 2) Pig has information about what is happening that I don't have such as: * The name of the script passed to Pig * The glob expansion of the file pathname in the LOAD statement * The execution plan of pig that would tell me what the map-group-combine-sort-reduce group looks like * The name of intermediate STORE operations that are being performed -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-786) Default job.names - script name, load file pattern, store file pattern, sub task type
[ https://issues.apache.org/jira/browse/PIG-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702931#action_12702931 ] David Ciemiewicz commented on PIG-786: -- I think the desire for global configuration information and the notion of environment variables might be related: PIG-602 Default job.names - script name, load file pattern, store file pattern, sub task type - Key: PIG-786 URL: https://issues.apache.org/jira/browse/PIG-786 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Priority: Trivial I have very complex Pig scripts which are often concatenations and iterations of a large number of map reduce tasks. I've gotten into the habit of using the following construct in my code: {code}set job.name '$DIR/$DATE/summary.bz'; A = load ... ... store Z into '$DIR/$DATE/summary.bz' using PigStorage();{code} But it would be really useful if Pig script parsing automagically set these job.name values. Ideally I'd like to have Pig just automagically construct job names for me so I can trace execution of multihour jobs in the HOD progress pages. Something like: {code}process-dates.pig A = LOAD /data/logs/daily/20090408 ... STORE Z into mysummary/20090408/summary.bz map-group-combiner-sort{code} Okay you say, I could construct this kind of job.name myself if this is what I want. Well: 1) I'd really like to have a default constructed by Pig so I don't have to 2) Pig has information about what is happening that I don't have such as: * The name of the script passed to Pig * The glob expansion of the file pathname in the LOAD statement * The execution plan of pig that would tell me what the map-group-combine-sort-reduce group looks like * The name of intermediate STORE operations that are being performed -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-755) Difficult to debug parameter substitution problems based on the error messages when running in local mode
[ https://issues.apache.org/jira/browse/PIG-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702703#action_12702703 ] David Ciemiewicz commented on PIG-755: -- Thanks. Didn't know about the dry-run option. Hopefully it will someday produce UTF-8 text given some of the parameters will be in Chinese or Japanese characters. :^) Difficult to debug parameter substitution problems based on the error messages when running in local mode - Key: PIG-755 URL: https://issues.apache.org/jira/browse/PIG-755 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Attachments: inputfile.txt, localparamsub.pig I have a script in which I do a parameter substitution for the input file. I have a use case where I find it difficult to debug based on the error messages in local mode. {code} A = load '$infile' using PigStorage() as ( date: chararray, count : long, gmean : double ); dump A; {code} 1) I run it in local mode with the input file in the current working directory {code} prompt $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -exectype local -param infile='inputfile.txt' localparamsub.pig {code} 2009-04-07 00:03:51,967 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore - Received error from storer function: org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to setup the load function. 2009-04-07 00:03:51,970 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Failed jobs!! 2009-04-07 00:03:51,971 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 1 out of 1 failed! 2009-04-07 00:03:51,974 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062631414.log ERROR 1066: Unable to open iterator for alias A org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A at org.apache.pig.PigServer.openIterator(PigServer.java:439) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88) at org.apache.pig.Main.main(Main.java:352) Caused by: java.io.IOException: Job terminated with anomalous status FAILED at org.apache.pig.PigServer.openIterator(PigServer.java:433) ... 5 more 2) I run it in map reduce mode {code} prompt $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -param infile='inputfile.txt' localparamsub.pig {code} 2009-04-07 00:07:31,660 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 2009-04-07 00:07:32,074 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001 2009-04-07 00:07:34,543 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-07 00:07:39,540 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-07 00:07:39,540 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Map reduce job failed 2009-04-07 00:07:39,563 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: inputfile does not exist. Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062851400.log ERROR 2100: inputfile does not exist. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias A at org.apache.pig.PigServer.openIterator(PigServer.java:439) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193) at
[jira] Commented: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702704#action_12702704 ] David Ciemiewicz commented on PIG-506: -- Alan, This seems much cleaner way to set up native Hadoop map-reduce jobs than the command line interfaces people use today. Might be worth it just for that alone. I think you'd need to gather some examples from non-Pig users and prototype them as Pig/NATIVE scripts to demonstrate what the advantages would be. For me, as a primary Pig user, there is some appeal because I could benefit from borrowing other's code. Does pig need a NATIVE keyword? --- Key: PIG-506 URL: https://issues.apache.org/jira/browse/PIG-506 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this: {code} A = load 'myfile'; X = load 'myotherfile'; B = group A by $0; C = foreach B generate group, myudf(B); D = native (jar=mymr.jar, infile=frompig outfile=topig); E = join D by $0, X by $0; ... {code} This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk. Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-602) Pass global configurations to UDF
[ https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702707#action_12702707 ] David Ciemiewicz commented on PIG-602: -- This sounds a lot like shell script environment variables. As such maybe it should follow the same rich level of operations and semantics that you get with environment variables. How is PigConf different from set properties in Pig? Why can't both use the same mechanism? Should they use the same mechanism? Can / should this same mechanism let my UDFs know when Pig is in local mode versus hdfs mode? [JIRA PIG-756] (or should something different be used? When in grunt, how can I inspect what the current PigConf values are? (Useful for logging and debugging) By what mechanism can I set or override these values from within my Pig script? Can I set the values to be one thing at one point in the Pig script and change it later to a new value in the Pig script? Pass global configurations to UDF - Key: PIG-602 URL: https://issues.apache.org/jira/browse/PIG-602 Project: Pig Issue Type: New Feature Components: impl Reporter: Yiping Han Assignee: Alan Gates We are seeking an easy way to pass a large number of global configurations to UDFs. Since our application contains many pig jobs, and has a large number of configurations. Passing configurations through command line is not an ideal way (i.e. modifying single parameter needs to change multiple command lines). And to put everything into the hadoop conf is not an ideal way either. We would like to see if Pig can provide such a facility that allows us to pass a configuration file in some format(XML?) and then make it available through out all the UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-784) PigStorage() - need ability to turn off Attempt to access field warnings
[ https://issues.apache.org/jira/browse/PIG-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702528#action_12702528 ] David Ciemiewicz commented on PIG-784: -- @Santhosh Hmmm. I'm running Pig in local mode with the latest published build and I get lots of warnings and they are not aggregated: -bash-3.00$ pig -exectype local -latest cat.pig USING: /grid/0/gs/pig/current 2009-04-24 20:02:55,666 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,667 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,668 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,669 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input 2009-04-24 20:02:55,672 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-24 20:02:55,672 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,1,42.0F) (,,) (,,) (,,) (,,)
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702533#action_12702533 ] David Ciemiewicz commented on PIG-774: -- A somewhat related bug is JIRA PIG-755 - the difficulty of debugging issues related to passed parameters. If Pig produced an output file of the code with parameter substitutions made, we could have more rapidly isolated the problem. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' æŒæ‰‹é¦™æ¸¯æƒ…牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' æŒæ‰‹é¦™æ¸¯æƒ…牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt æŒæ‰‹é¦™æ¸¯æƒ…牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = J = filter I by $0 == ' æŒæ‰‹é¦™æ¸¯æƒ…牽女人心演唱會'; = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-0
[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function
[ https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702535#action_12702535 ] David Ciemiewicz commented on PIG-759: -- If hbase has named columns in it's schema, why wouldn't it be appropriate to say something like: table = load '$tablename/$subsection' using HBaseStorage() as (a, b); Since HBaseStorage() is specified: 1) Isn't hbase:// implicit? 2) Shouldn't I be able to just specify the names in the AS clause? HBaseStorage scheme for Load/Slice function --- Key: PIG-759 URL: https://issues.apache.org/jira/browse/PIG-759 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner We would like to change the HBaseStorage function to use a scheme when loading a table in pig. The scheme we are thinking of is: hbase. So in order to load an hbase table in a pig script the statement should read: {noformat} table = load 'hbase://tablename' using HBaseStorage(); {noformat} If the scheme is omitted pig would assume the tablename to be an hdfs path and the storage function would use the last component of the path as a table name and output a warning. For details on why see jira issue: PIG-758 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??
PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ?? -- Key: PIG-771 URL: https://issues.apache.org/jira/browse/PIG-771 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz PigDump does not properly output Chinese UTF8 characters. The reason for this is that the function Tuple.toString() is called. DefaultTuple implements Tuple.toString() and it calls Object.toString() on the opaque object d. Instead, I think that the code should be changed instead to call the new DataType.toString() function. {code} @Override public String toString() { StringBuilder sb = new StringBuilder(); sb.append('('); for (IteratorObject it = mFields.iterator(); it.hasNext();) { Object d = it.next(); if(d != null) { if(d instanceof Map) { sb.append(DataType.mapToString((MapObject, Object)d)); } else { sb.append(DataType.toString(d)); // Change this one line if(d instanceof Long) { sb.append(L); } else if(d instanceof Float) { sb.append(F); } } } else { sb.append(); } if (it.hasNext()) sb.append(,); } sb.append(')'); return sb.toString(); } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??
[ https://issues.apache.org/jira/browse/PIG-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700845#action_12700845 ] David Ciemiewicz commented on PIG-771: -- I was going to submit a patch for this one line change, but I discovered in compiling the code that DataType.toString(d) throws an ExecException. Oddly, DataType.mapToString DOES NOT throw any Exceptions which is inconsistent with the other DataType.to... functions. I am not sure how to best implement the try / catch / throw for this particular case. Also, in doing the code review of DataType.mapToString(...) I discovered that it will also have problems with correctly dumping the data contained within it because it too uses Object.toString() on opaque data handles. So, the code for DataType.mapToString(...) should also use DataType.toString(Object); But now I witness a recursion problem. DataType.toString(Object) does not work for complex types. So maps of maps will not be recursed properly. So DataType.toString(Object) should probably be enhanced to work on Maps as well. But now we have another problem ... PigDump wants to append L and F for Long values and Float values. But this won't work for nested structures. PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ?? -- Key: PIG-771 URL: https://issues.apache.org/jira/browse/PIG-771 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz PigDump does not properly output Chinese UTF8 characters. The reason for this is that the function Tuple.toString() is called. DefaultTuple implements Tuple.toString() and it calls Object.toString() on the opaque object d. Instead, I think that the code should be changed instead to call the new DataType.toString() function. {code} @Override public String toString() { StringBuilder sb = new StringBuilder(); sb.append('('); for (IteratorObject it = mFields.iterator(); it.hasNext();) { Object d = it.next(); if(d != null) { if(d instanceof Map) { sb.append(DataType.mapToString((MapObject, Object)d)); } else { sb.append(DataType.toString(d)); // Change this one line if(d instanceof Long) { sb.append(L); } else if(d instanceof Float) { sb.append(F); } } } else { sb.append(); } if (it.hasNext()) sb.append(,); } sb.append(')'); return sb.toString(); } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-750) Use combiner when a mix of algebraic and non-algebraic functions are used
[ https://issues.apache.org/jira/browse/PIG-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700195#action_12700195 ] David Ciemiewicz commented on PIG-750: -- Also consider the application of a scalar function to the result of an aggregation function: 3) foreach X generate EXP(AVG(b)) Use combiner when a mix of algebraic and non-algebraic functions are used - Key: PIG-750 URL: https://issues.apache.org/jira/browse/PIG-750 Project: Pig Issue Type: Improvement Reporter: Amir Youssefi Priority: Minor Currently Pig uses combiner when all a,b, c,... are algebraic (e.g. SUM, AVG etc.) in foreach: foreach X generate a,b,c,... It's a performance improvement if it uses combiner when a mix of algebraic and non-algebraic functions are used as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.
[ https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697631#action_12697631 ] David Ciemiewicz commented on PIG-760: -- Sure, you could do that, create PigStorageSchema. The thing is, I don't think it is necessary and it is possible to do this in a backward compatible way. First, if the user specifies a LOAD ... AS clause schema, then PigStorage could simply use that casting to override what is in the .schema. Of course, PigStorage might want to warn that there is an override at run time or do a smart warning only if there are incompatible differences between the serialized schema and the explicit AS clause schema. Next, is there really any harm in creating the serialized shema file on each and every STORE? Finally, why sub class when we could parameterize? In other words, instead of writing: store A into 'file' using PigStorageSchema(); Why not do: store A into 'file' using PigStorage('schema=yes'); -- redundant schema=yes is default I think it would be more useful to have single classes with parameterized options than a proliferation of classes. Or, better yet, why can't I just define the behavior of PigStorage() for all of the instances in my script: define PigStorage PigStorage( 'sep=\t', 'schema=yes', 'erroronmissingcolumn=no' ); I have recently done similar things for other functions and it turns out to be a nice way of capturing global parameterizations for cleaner Pig code. Serialize schemas for PigStorage() and other storage types. --- Key: PIG-760 URL: https://issues.apache.org/jira/browse/PIG-760 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz I'm finding PigStorage() really convenient for storage and data interchange because it compresses well and imports into Excel and other analysis environments well. However, it is a pain when it comes to maintenance because the columns are in fixed locations and I'd like to add columns in some cases. It would be great if load PigStorage() could read a default schema from a .schema file stored with the data and if store PigStorage() could store a .schema file with the data. I have tested this out and both Hadoop HDFS and Pig in -exectype local mode will ignore a file called .schema in a directory of part files. So, for example, if I have a chain of Pig scripts I execute such as: A = load 'data-1' using PigStorage() as ( a: int , b: int ); store A into 'data-2' using PigStorage(); B = load 'data-2' using PigStorage(); describe B; describe B should output something like { a: int, b: int } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-729) Use of default parallelism
[ https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697641#action_12697641 ] David Ciemiewicz commented on PIG-729: -- Ah wait, I just read what Olga wrote again. I think there might be hybrid solution that handles both cases without having to do -param. We should add to Pig a -set option that let's us set values for things that we would set in our scripts. pig -set parallelism=5 is equivalent to following idiom in my pig script. set parallelism 5; Command line -set options should override explicit set statements in the pig script with a warning of the override. I think this generalized mechanism would satisfy both my desires as a developer and Olga's desire to reduce pig development team code maintenance headaches. Use of default parallelism -- Key: PIG-729 URL: https://issues.apache.org/jira/browse/PIG-729 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.1 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Fix For: 0.2.1 Currently, if the user does not specify the number of reduce slots using the parallel keyword, Pig lets Hadoop decide on the default number of reducers. This model worked well with dynamically allocated clusters using HOD and for static clusters where the default number of reduce slots was explicitly set. With Hadoop 0.20, a single static cluster will be shared amongst a number of queues. As a result, a common scenario is to end up with default number of reducers set to one (1). When users migrate to Hadoop 0.20, they might see a dramatic change in the performance of their queries if they had not used the parallel keyword to specify the number of reducers. In order to mitigate such circumstances, Pig can support one of the following: 1. Specify a default parallelism for the entire script. This option will allow users to use the same parallelism for all operators that do not have the explicit parallel keyword. This will ensure that the scripts utilize more reducers than the default of one reducer. On the down side, due to data transformations, usually operations that are performed towards the end of the script will need smaller number of reducers compared to the operators that appear at the beginning of the script. 2. Display a warning message for each reduce side operator that does have the use of the explicit parallel keyword. Proceed with the execution. 3. Display an error message indicating the operator that does not have the explicit use of the parallel keyword. Stop the execution. Other suggestions/thoughts/solutions are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path Key: PIG-756 URL: https://issues.apache.org/jira/browse/PIG-756 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz I have a utility function util.INSETFROMFILE() that I pass a file name during initialization. {code} define inQuerySet util.INSETFROMFILE(analysis/queries); A = load 'logs' using PigStorage() as ( date int, query chararray ); B = filter A by inQuerySet(query); {code} This provides a computationally inexpensive way to effect map-side joins for small sets plus functions of this style provide the ability to encapsulate more complex matching rules. For rapid development and debugging purposes, I want this code to run without modification on both my local file system when I do pig -exectype local and on HDFS. Pig needs to provide an API for UDFs which allow them to either: 1) know when they are in local or HDFS mode and let them open and read from files as appropriate 2) just provide a file name and read statements and have pig transparently manage local or HDFS opens and reads for the UDF UDFs need to read configuration information off the filesystem and it simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
[ https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697043#action_12697043 ] David Ciemiewicz commented on PIG-756: -- BTW, there used to be a mechanism to do this in early versions of Pig that was last in the transition to the new execution system. UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path Key: PIG-756 URL: https://issues.apache.org/jira/browse/PIG-756 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz I have a utility function util.INSETFROMFILE() that I pass a file name during initialization. {code} define inQuerySet util.INSETFROMFILE(analysis/queries); A = load 'logs' using PigStorage() as ( date int, query chararray ); B = filter A by inQuerySet(query); {code} This provides a computationally inexpensive way to effect map-side joins for small sets plus functions of this style provide the ability to encapsulate more complex matching rules. For rapid development and debugging purposes, I want this code to run without modification on both my local file system when I do pig -exectype local and on HDFS. Pig needs to provide an API for UDFs which allow them to either: 1) know when they are in local or HDFS mode and let them open and read from files as appropriate 2) just provide a file name and read statements and have pig transparently manage local or HDFS opens and reads for the UDF UDFs need to read configuration information off the filesystem and it simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697094#action_12697094 ] David Ciemiewicz commented on PIG-745: -- Alan, I realized several things. 1) The question of what to do about BOOLEAN case. My original suggestion was to convert the BOOLEAN case to 1 and 0 but in the patch, I just used the Boolean.toString() function. Not sure if that matters or not. 2) I didn't see other test cases for the other DataType.toInteger(), ... conversions so I didn't create one for DataType.toString(). 3) We are just using the default conversion of Float.toString() and Double.toString(). I don't know if this is actually best since I don't know if these operations present the floating-point values in full precision or not. At this point, it may not really matter so much as the primary reason for creating DataType.toString() is to allow string functions to operate on any data type (like in Perl) without generating cast errors. Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Attachments: PIG-745.patch I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-753) Do not support UDF not providing parameter
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697108#action_12697108 ] David Ciemiewicz commented on PIG-753: -- I think Jeff means that Pig does not support UDFs without parameters, but should. I agree. Do not support UDF not providing parameter -- Key: PIG-753 URL: https://issues.apache.org/jira/browse/PIG-753 Project: Pig Issue Type: Improvement Reporter: Jeff Zhang Pig do not support UDF without parameters, it force me provide a parameter. like the following statement: B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to provide a parameter like following B = FOREACH A GENERATE bagGenerator($0); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697146#action_12697146 ] David Ciemiewicz commented on PIG-697: -- Some thoughts on optimization problems and patterns from SQL and coding Pig and my desire for a higher level version of Pig than we have today. I know this may come off as distraction but hopefully you'll have some time to hear me out. * after a conversation with Santhosh about the SQL to Pig translation work * multiple issues I have countered with nested foreach statements including redundant function execution * nested FOREACH statement assignment computation bugs * hand coding chains of foreach statements so I can get the Algebraic combiner to kick * hand coding chains of foreach statements and grouping statements rather than using a single statement I think I might have stumbled on a potentially improved model for Pig to Pig execution plan generation: {code} High Level Pig to Low Level Pig translation {code} I think this would potentially benefit the SQL to Pig efforts and provide for programmer coding efficiency in Pig as well. This will be a bit protracted, but I hope you have some time to consider it. Take the following SQL idiom that the SQL to Pig translator will need to support: {code} select EXP(AVG(LN(time+0.1))) as geomean_time from events where time is not null and time = 0; {code} In high level pig, I have wanted to code this as {code} A = load 'events' using PigStorage() as ( time: int ); B = filter A by time is not null and time = 0; C = group B all; D = foreach C generate EXP(AVG(LN(B.time+0.1))) as geomean_time; {code} In fact, this would seem to provide a nice translation path from SQL to low level pig via high level pig. Unfortunately, this won't work. We developers must write Pig scripts at a lower level and break all of this apart into various steps. An additional issue is that, because of some, um, workarounds, in the execution plan optimizations, the combiner won't kick in if we don't do further steps. So the most performant version of the desired pig script is the following really low level pig where D is broken into 3 steps, merging one with B and the remaining 2 steps as separate D steps: {code} A = load 'events' using PigStorage() as ( time: int ); B = filter A by time is not null and time = 0; B = foreach A generate LOG(time+0.1) as log_time; C = group B all; D = foreach C generate group, AVG(B.log_time) as mean_log_time; -- note that group alias is required for Algebraic combiner to kick in D = foreach D generate EXP(mean_log_time) as geomean_time; {code} If we can figure out how to translate SQL into this last low-level set of statements, why couldn't we or shouldn't we have high level pig as well and permit more efficient code writing and optimization? Next example I do a bunch of nested intermediate computations in a nested FOREACH statement: {code} C = foreach C { curr_mean_log_timetonextevent = curr_sum_log_timetonextevent / (double)count; curr_meansq_log_timetonextevent = curr_sumsq_log_timetonextevent / (double)count; curr_var_log_timetonextevent = curr_meansq_log_timetonextevent - (curr_mean_log_timetonextevent * curr_mean_log_timetonextevent); curr_sterr_log_timetonextevent = math.SQRT(curr_var_log_timetonextevent / (double)count); curr_geomean_timetonextevent = math.EXP(curr_mean_log_timetonextevent); curr_geosterr_timetonextevent = math.EXP(curr_sterr_log_timetonextevent); curr_mean_timetonextevent = curr_sum_log_timetonextevent / (double)count; curr_meansq_timetonextevent = curr_sumsq_log_timetonextevent / (double)count; curr_var_timetonextevent = curr_meansq_timetonextevent - (curr_mean_timetonextevent * curr_mean_timetonextevent); curr_sterr_timetonextevent = math.SQRT(curr_var_timetonextevent / count); generate ... {code} The code for nested statements in Pig has been particularly problematic and buggy including problems such as: * redundant execution of functions such as SUM, AVG * nested function problems * mathematical operator problems (illustrated in this bug) * no type propagation * the need to use AS clauses to name nested alias assignments projected in the GENERATE clauses What if instead of trying to do all of these operations in some specialized execution code, what if this was treated as high level pig that translated all of these intermediate statements into two or more low level foreach expansions.
[jira] Commented: (PIG-564) Parameter Substitution using -param option does not seem to work when parameters contain special characters such as +,=,-,?,'
[ https://issues.apache.org/jira/browse/PIG-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696266#action_12696266 ] David Ciemiewicz commented on PIG-564: -- Period (.) is also a special character that seems to cause problems. See related JIRA PIG-754 Parameter Substitution using -param option does not seem to work when parameters contain special characters such as +,=,-,?,' --- Key: PIG-564 URL: https://issues.apache.org/jira/browse/PIG-564 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Consider the following Pig script which uses parameter substitution {code} %default qual '/user/viraj' %default mydir 'mydir_myextraqual' VISIT_LOGS = load '$qual/$mydir' as (a,b,c); dump VISIT_LOGS; {code} If you run the script as: == java -cp pig.jar:${HADOOP_HOME}/conf/ -Dhod.server='' org.apache.pig.Main -param mydir=mydir-myextraqual mypigparamsub.pig == You get the following error: == 2008-12-15 19:49:43,964 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - java.io.IOException: /user/viraj/mydir does not exist at org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:109) at org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) at org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:200) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) java.io.IOException: Unable to open iterator for alias: VISIT_LOGS [Job terminated with anomalous status FAILED] at org.apache.pig.PigServer.openIterator(PigServer.java:389) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: java.io.IOException: Job terminated with anomalous status FAILED ... 6 more == Also tried using: -param mydir='mydir\-myextraqual' This behavior occurs if the parameter value contains characters such as +,=, ?. A workaround for this behavior is using a param_file which contains param_name=param_value on each line, with the param_value enclosed by quotes. For example: mydir='mydir-myextraqual' and then running the pig script as: java -cp pig.jar:${HADOOP_HOME}/conf/ -Dhod.server='' org.apache.pig.Main -param_file myparamfile mypigparamsub.pig The following issues need to be fixed: 1) In -param option if parameter value contains special characters, it is truncated 2) In param_file, if param_value contains a special characters, it should be enclosed in quotes 3) If 2 is a known issue then it should be documented in http://wiki.apache.org/pig/ParameterSubstitution -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695921#action_12695921 ] David Ciemiewicz commented on PIG-745: -- The more I think about this one, the more I realize that not having DataType.toString() is an oversight for the DataType package. Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-745: - Attachment: PIG-745.patch PIG-745.patch attached. Patch for consideration to add DataTypes.toString() function. Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Attachments: PIG-745.patch I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-745: - Status: Patch Available (was: Open) PIG-745.patch adds DataType.toString() function to DataType package. Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Attachments: PIG-745.patch I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-754) Bugs with load and store and filenames passed with -param containing periods
Bugs with load and store and filenames passed with -param containing periods Key: PIG-754 URL: https://issues.apache.org/jira/browse/PIG-754 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz This one drove me batty. I have two files file and file.right. file: {code} WRONG This is file, not file.right. {code} file.right: {code} RIGHT This is file.right.. {code} infile.pig: {code} A = load '$infile' using PigStorage(); dump A; {code} When I pass in file.right as the infile parameter value, the wrong file is read: {code} -bash-3.00$ pig -exectype local -param infile=file.right infile.pig USING: /grid/0/gs/pig/current 2009-04-05 23:18:36,291 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-05 23:18:36,292 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (WRONG ) (This is file, not file.right.) {code} However, if I pass in infile as ./file.right, the script magically works. {code} -bash-3.00$ pig -exectype local -param infile=./file.right infile.pig USING: /grid/0/gs/pig/current 2009-04-05 23:20:46,735 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-05 23:20:46,736 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (RIGHT) (This is file.right.) {code} I do not have this problem if I use the file name with a period in the script itself: infile2.pig {code} A = load 'file.right' using PigStorage(); dump A; {code} {code} -bash-3.00$ pig -exectype local infile2.pig USING: /grid/0/gs/pig/current 2009-04-05 23:22:47,022 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-05 23:22:47,023 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (RIGHT) (This is file.right.) {code} I also experience similar problems when I try to pass in param outfile in a store statement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-752) bzip2 compression and local mode bugs
bzip2 compression and local mode bugs - Key: PIG-752 URL: https://issues.apache.org/jira/browse/PIG-752 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz Problem 1) use of .bz2 file extension does not store results bzip2 compressed in Local mode (-exectype local) If I use the .bz2 filename extension in a STORE statement on HDFS, the results are stored with bzip2 compression. If I use the .bz2 filename extension in a STORE statement on local file system, the results are NOT stored with bzip2 compression. compact.bz2.pig: {code} A = load 'events.test' using PigStorage(); store A into 'events.test.bz2' using PigStorage(); C = load 'events.test.bz2' using PigStorage(); C = limit C 10; dump C; {code} {code} -bash-3.00$ pig -exectype local compact.bz2.pig -bash-3.00$ file events.test events.test: ASCII English text, with very long lines -bash-3.00$ file events.test.bz2 events.test.bz2: ASCII English text, with very long lines -bash-3.00$ cat events.test | bzip2 events.test.bz2 -bash-3.00$ file events.test.bz2 events.test.bz2: bzip2 compressed data, block size = 900k {code} The output format in local mode is definitely not bzip2, but it should be. {code} Problem 2) pig in local mode does not decompress bzip2 compressed files, but should, to be consistent with HDFS read.bz2.pig: {code} A = load 'events.test.bz2' using PigStorage(); A = limit A 10; dump A; {code} The output should be human readable but is instead garbage, indicating no decompression took place during the load: {code} -bash-3.00$ pig -exectype local read.bz2.pig USING: /grid/0/gs/pig/current 2009-04-03 18:26:30,455 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-04-03 18:26:30,456 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (BZh91AYsyoz?u?...@{x_?d?|u-??mK???;??4?C??) ((R? 6?*mg, ?6?Zj?k,???0?QT?d???hY?#mJ?[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a?? ??U?p@@MT?$?B?P??N??=???(z}gk...@c$\??i]?g:?J) a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m? (mP(i?4,#F[?I)@?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?wf??4z???4t?) (?oou?t???Kwl?3?nCM?WS?;l???P?s?x a???e)B??9? ?44 ((?...@4?) (f) (?...@+?d?0@?U) (Q?SR) -bash-3.00$ {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-729) Use of default parallelism
[ https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695599#action_12695599 ] David Ciemiewicz commented on PIG-729: -- I've been through this battle before. And I write LOTS of Pig scripts. Here's what I want: 1) Use default parallelism of 1 reducer. BUT WARN ME that I've got a default parallelism of 1 reducer. (I'd actually prefer what ever works on a single node). 2) Allow me a command line option such as -parallel # or -mappers # -reducers #. 3) Allow me a set parameter inside my Pig scripts such as: set parallel # set mappers # set reducers # 4) DO NOT require me to add a PARALLEL clause to each and every one of my reducer statements. PARALLEL clauses are a code maintenance nightmare. Sometimes the grid is fat on available nodes and so I want to take advantage of this and run my job across as many nodes as possible. Sometimes the grid is scarce on available nodes and so I want back off on the parallelism. I DO NOT WANT to change EVERY PARALLEL clause in my code each time I run my script. I DO NOT WANT to change parameter values for the PARALLEL clause each time I run my script. I really, really, really want to make this a run-time decision on the execution of the script at the time that I invoke the script and I want this to be the default behavior in PIg. Use of default parallelism -- Key: PIG-729 URL: https://issues.apache.org/jira/browse/PIG-729 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.1 Environment: Hadoop 0.20 Reporter: Santhosh Srinivasan Fix For: 0.2.1 Currently, if the user does not specify the number of reduce slots using the parallel keyword, Pig lets Hadoop decide on the default number of reducers. This model worked well with dynamically allocated clusters using HOD and for static clusters where the default number of reduce slots was explicitly set. With Hadoop 0.20, a single static cluster will be shared amongst a number of queues. As a result, a common scenario is to end up with default number of reducers set to one (1). When users migrate to Hadoop 0.20, they might see a dramatic change in the performance of their queries if they had not used the parallel keyword to specify the number of reducers. In order to mitigate such circumstances, Pig can support one of the following: 1. Specify a default parallelism for the entire script. This option will allow users to use the same parallelism for all operators that do not have the explicit parallel keyword. This will ensure that the scripts utilize more reducers than the default of one reducer. On the down side, due to data transformations, usually operations that are performed towards the end of the script will need smaller number of reducers compared to the operators that appear at the beginning of the script. 2. Display a warning message for each reduce side operator that does have the use of the explicit parallel keyword. Proceed with the execution. 3. Display an error message indicating the operator that does not have the explicit use of the parallel keyword. Stop the execution. Other suggestions/thoughts/solutions are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694893#action_12694893 ] David Ciemiewicz commented on PIG-745: -- Actually, the proposed function DataTypes.toString() is the following: {code} public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } {code} Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-746) Works in --exectype local, fails on grid - ERROR 2113: SingleTupleBag should never be serialized
[ https://issues.apache.org/jira/browse/PIG-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695198#action_12695198 ] David Ciemiewicz commented on PIG-746: -- I'd still like to use the combiner in other instances in my combined Pig scripts (I concatentate several pig scripts together to create compound Pig scripts). It would be nice if Pig had a per statement option to turn off or force on the combiner. In the mean time, I discovered a feature (flaw?) in Pig that turns off the combiner - perform a scalar operation (such as +0L) on the Algebraic aggregation function. D = foreach B generate group, SUM(A.matched) + 0L as matchedcount, -- +0L :flaw turns off combiner A; describe D; I have tried this workaround and it works, at least in the current version of Pig. Until someone figures out how to permit use of the combiner for combined Algebraic and scalar operations. Works in --exectype local, fails on grid - ERROR 2113: SingleTupleBag should never be serialized Key: PIG-746 URL: https://issues.apache.org/jira/browse/PIG-746 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz The script below works on Pig 2.0 local mode but fails when I run the same program on the grid. I was attempting to create a workaround for PIG-710. Here's the error: {code} Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2113: SingleTupleBag should never be serialized or serialized. at org.apache.pig.data.SingleTupleBag.write(SingleTupleBag.java:129) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:147) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:439) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) {code} Here's the program: {code} A = load 'filterbug.data' using PigStorage() as ( id, str ); A = foreach A generate id, str, ( str matches 'hello' or str matches 'hello' ? 1 : 0 ) as matched; describe A; B = group A by ( id ); describe B; D = foreach B generate group, SUM(A.matched) as matchedcount, A; describe D; E = filter D by matchedcount 0; describe E; F = foreach E generate FLATTEN(A); describe F; dump F; {code} Here's the data filterbug.data {code} a hello a goodbye b goodbye c hello c hello c hello e what {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-747) Logical to Physical Plan Translation fails when temporary alias are created within foreach
[ https://issues.apache.org/jira/browse/PIG-747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695201#action_12695201 ] David Ciemiewicz commented on PIG-747: -- Another workaround is to split this into a chain of foreach statements: {code} B = foreach A generate *, (double)col1 / (double)col2 as d, (double)col3 / (double)col2 as e; B = foreach B generate e - d * d as newcol; dump B; {code} Logical to Physical Plan Translation fails when temporary alias are created within foreach -- Key: PIG-747 URL: https://issues.apache.org/jira/browse/PIG-747 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Attachments: physicalplan.txt, physicalplanprob.pig Consider a the pig script which calculates a new column F inside the foreach as: {code} A = load 'physicalplan.txt' as (col1,col2,col3); B = foreach A { D = col1/col2; E = col3/col2; F = E - (D*D); generate F as newcol; }; dump B; {code} This gives the following error: === Caused by: org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException: ERROR 2015: Invalid physical operators in the physical plan at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:377) at org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:63) at org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:29) at org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:68) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:908) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:122) at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:41) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:246) ... 10 more Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to give operator of type org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide multiple outputs. This operator does not support multiple outputs. at org.apache.pig.impl.plan.OperatorPlan.connect(OperatorPlan.java:158) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.connect(PhysicalPlan.java:89) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:373) ... 19 more === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-745) Please add DataTypes.toString() conversion function
Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-741) Add LIMIT as a statement that works in nested FOREACH
Add LIMIT as a statement that works in nested FOREACH - Key: PIG-741 URL: https://issues.apache.org/jira/browse/PIG-741 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz I'd like to compute the top 10 results in each group. The natural way to express this in Pig would be: {code} A = load '...' using PigStorage() as ( date: int, count: int, url: chararray ); B = group A by ( date ); C = foreach B { D = order A by count desc; E = limit D 10; generate FLATTEN(E); }; dump C; {code} Yeah, I could write a UDF / PiggyBank function to take the top n results. But since LIMIT already exists as a statement, it seems like it should also work in the nested foreach context. Example workaround code. {code} C = foreach B { D = order A by count desc; E = util.TOP(D, 10); generate FLATTEN(E); }; dump C; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-710) Filtering bag in nested foreach does not produce expected results
Filtering bag in nested foreach does not produce expected results - Key: PIG-710 URL: https://issues.apache.org/jira/browse/PIG-710 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz I have an idiom I used to use in older versions of pig (prior to types branch) which would group into a collection and then filter the output if any of the collection contained a particular string. This relies on FILTER statements within a FOREACH ... { ... GENERATE ... } statement. ORDER ... BY in the FOREACH ... { ... GENERATE ... } statement does not seem to have a problem so it seems to be something isolated to the FILTER. {code} A = load 'filterbug.data' using PigStorage() as ( id, str ); B = group A by ( id ); describe B; dump B; D = foreach B generate group, COUNT(A), A.str; describe D; dump D; C = foreach B { D = order A by str; matchedcount = COUNT(D); generate group, matchedcount as matchedcount, D.str; }; describe C; dump C; Cfiltered = foreach B { D = filter A by ( str matches 'hello' ); matchedcount = COUNT(D); generate group, matchedcount as matchedcount, A.str; }; describe Cfiltered; dump Cfiltered; {code} Here's the output: {code} -bash-3.00$ pig -exectype local -latest filterbug.pig USING: /grid/0/gs/pig/current B: {group: bytearray,A: {id: bytearray,str: bytearray}} 2009-03-10 03:14:14,838 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-03-10 03:14:14,839 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,{(a,hello),(a,goodbye)}) (b,{(b,goodbye)}) (c,{(c,hello),(c,hello),(c,hello)}) (d,{(d,what)}) D: {group: bytearray,long,str: {str: bytearray}} 2009-03-10 03:14:14,920 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-03-10 03:14:14,920 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,2L,{(hello),(goodbye)}) (b,1L,{(goodbye)}) (c,3L,{(hello),(hello),(hello)}) (d,1L,{(what)}) C: {group: bytearray,matchedcount: long,str: {str: bytearray}} 2009-03-10 03:14:14,985 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-03-10 03:14:14,985 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,2L,{(goodbye),(hello)}) (b,1L,{(goodbye)}) (c,3L,{(hello),(hello),(hello)}) (d,1L,{(what)}) 2009-03-10 03:14:15,018 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). Cfiltered: {group: bytearray,matchedcount: long,str: {str: bytearray}} 2009-03-10 03:14:15,044 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). 2009-03-10 03:14:15,057 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-03-10 03:14:15,057 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! (a,1L,{(hello),(goodbye)}) {code} What I expect for the output of Cfiltered is actually: (a,1L,{(hello),(goodbye)}) (b,0L,{(goodbye)}) (c,3L,{(hello),(hello),(hello)}) (d,0L,{(what)}) The data file is: {code} a hello a goodbye b goodbye c hello c hello c hello d what {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-574) run command for grunt
[ https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672655#action_12672655 ] David Ciemiewicz commented on PIG-574: -- Thanks! This will make so iterative development faster and less painful than preallocating a HOD subcluster and then forgetting to delete it. run command for grunt - Key: PIG-574 URL: https://issues.apache.org/jira/browse/PIG-574 Project: Pig Issue Type: New Feature Components: grunt Reporter: David Ciemiewicz Priority: Minor Attachments: run_command.patch, run_command_params.patch This is a request for a run file command in grunt which will read a script from the local file system and execute the script interactively while in the grunt shell. One of the things that slows down iterative development of large, complicated Pig scripts that must operate on hadoop fs data is that the edit, run, debug cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) cluster for each iteration. I would prefer not to preallocate a cluster of nodes (though I could). Instead, I'd like to have one window open and edit my Pig script using vim or emacs, write it, and then type run myscript.pig at the grunt shell until I get things right. I'm used to doing similar things with Oracle, MySQL, and R. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-600) PiggyBank compilation instructions don't work
PiggyBank compilation instructions don't work - Key: PIG-600 URL: https://issues.apache.org/jira/browse/PIG-600 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: David Ciemiewicz I know that PiggyBank is as-is but the instructions are incomplete and should be complete enough (all steps) required to compile PiggyBank. http://wiki.apache.org/pig/PiggyBank I checked out the types branch version of PiggyBank by modifying the instructions to check out: svn co http://svn.apache.org/repos/asf/hadoop/pig/branches/types/contrib/piggybank/ At step 2 it says: To build a jar file that contains all available user defined functions (UDFs), please follow the steps: 1. Checkout UDF code: svn co http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank 2. Build the jar file: from trunk/contrib/piggybank/java directory run ant. This will generate piggybank.jar in the same directory. So I went into the piggybank/java directory and and ran ant and got the following errors: {code} -bash-3.00$ ant Buildfile: build.xml init: compile: [echo] *** Compiling Pig UDFs *** [javac] Compiling 70 source files to /homes/ciemo/piggybank/java/build/classes [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:25: cannot find symbol [javac] symbol : class EvalFunc [javac] location: package org.apache.pig [javac] import org.apache.pig.EvalFunc; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:26: cannot find symbol [javac] symbol : class FuncSpec [javac] location: package org.apache.pig [javac] import org.apache.pig.FuncSpec; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:27: package org.apache.pig.data does not exist [javac] import org.apache.pig.data.Tuple; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:28: package org.apache.pig.impl.logicalLayer.schema does not exist [javac] import org.apache.pig.impl.logicalLayer.schema.Schema; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:29: package org.apache.pig.data does not exist [javac] import org.apache.pig.data.DataType; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:30: package org.apache.pig.impl.logicalLayer does not exist [javac] import org.apache.pig.impl.logicalLayer.FrontendException; [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:31: package org.apache.pig.impl.util does not exist [javac] import org.apache.pig.impl.util.WrappedIOException; [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:61: cannot find symbol [javac] symbol: class EvalFunc [javac] public class ABS extends EvalFuncDouble{ [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:67: cannot find symbol [javac] symbol : class Tuple [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS [javac] public Double exec(Tuple input) throws IOException { [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85: cannot find symbol [javac] symbol : class Schema [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS [javac] public Schema outputSchema(Schema input) { [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85: cannot find symbol [javac] symbol : class Schema [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS [javac] public Schema outputSchema(Schema input) { [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:93: cannot find symbol [javac] symbol : class FuncSpec [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS [javac] public ListFuncSpec getArgToFuncMapping() throws FrontendException { [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:93: cannot
[jira] Commented: (PIG-600) PiggyBank compilation instructions don't work
[ https://issues.apache.org/jira/browse/PIG-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661641#action_12661641 ] David Ciemiewicz commented on PIG-600: -- I think the problem is this ... the build.xml file is looking for pig.jar several directories up: property name=pigjar value=../../../pig.jar / The thing is, I'm relying on another build of pig and not the whole pig directory. If you take the instructions for PiggyBank literally (as I did) you will not get a successful build. PiggyBank compilation instructions don't work - Key: PIG-600 URL: https://issues.apache.org/jira/browse/PIG-600 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: David Ciemiewicz I know that PiggyBank is as-is but the instructions are incomplete and should be complete enough (all steps) required to compile PiggyBank. http://wiki.apache.org/pig/PiggyBank I checked out the types branch version of PiggyBank by modifying the instructions to check out: svn co http://svn.apache.org/repos/asf/hadoop/pig/branches/types/contrib/piggybank/ At step 2 it says: To build a jar file that contains all available user defined functions (UDFs), please follow the steps: 1. Checkout UDF code: svn co http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank 2. Build the jar file: from trunk/contrib/piggybank/java directory run ant. This will generate piggybank.jar in the same directory. So I went into the piggybank/java directory and and ran ant and got the following errors: {code} -bash-3.00$ ant Buildfile: build.xml init: compile: [echo] *** Compiling Pig UDFs *** [javac] Compiling 70 source files to /homes/ciemo/piggybank/java/build/classes [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:25: cannot find symbol [javac] symbol : class EvalFunc [javac] location: package org.apache.pig [javac] import org.apache.pig.EvalFunc; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:26: cannot find symbol [javac] symbol : class FuncSpec [javac] location: package org.apache.pig [javac] import org.apache.pig.FuncSpec; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:27: package org.apache.pig.data does not exist [javac] import org.apache.pig.data.Tuple; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:28: package org.apache.pig.impl.logicalLayer.schema does not exist [javac] import org.apache.pig.impl.logicalLayer.schema.Schema; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:29: package org.apache.pig.data does not exist [javac] import org.apache.pig.data.DataType; [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:30: package org.apache.pig.impl.logicalLayer does not exist [javac] import org.apache.pig.impl.logicalLayer.FrontendException; [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:31: package org.apache.pig.impl.util does not exist [javac] import org.apache.pig.impl.util.WrappedIOException; [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:61: cannot find symbol [javac] symbol: class EvalFunc [javac] public class ABS extends EvalFuncDouble{ [javac] ^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:67: cannot find symbol [javac] symbol : class Tuple [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS [javac] public Double exec(Tuple input) throws IOException { [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85: cannot find symbol [javac] symbol : class Schema [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS [javac] public Schema outputSchema(Schema input) { [javac]^ [javac] /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85: cannot find symbol
[jira] Created: (PIG-598) Parameter substitution ($PARAMETER) should not be performed in comments
Parameter substitution ($PARAMETER) should not be performed in comments --- Key: PIG-598 URL: https://issues.apache.org/jira/browse/PIG-598 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: David Ciemiewicz Priority: Minor Compiling the following code example will generate an error that $NOT_A_PARAMETER is an Undefined Parameter. This is problematic as sometimes you want to comment out parts of your code, including parameters so that you don't have to define them. This I think it would be really good if parameter substitution was not performed in comments. {code} -- $NOT_A_PARAMETER {code} {code} -bash-3.00$ pig -exectype local -latest comment.pig USING: /grid/0/gs/pig/current java.lang.RuntimeException: Undefined parameter : NOT_A_PARAMETER at org.apache.pig.tools.parameters.PreprocessorContext.substitute(PreprocessorContext.java:221) at org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.parsePigFile(ParameterSubstitutionPreprocessor.java:106) at org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.genSubstitutedFile(ParameterSubstitutionPreprocessor.java:86) at org.apache.pig.Main.runParamPreprocessor(Main.java:394) at org.apache.pig.Main.main(Main.java:296) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-596) Anonymous tuples in bags create ParseExceptions
[ https://issues.apache.org/jira/browse/PIG-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660201#action_12660201 ] David Ciemiewicz commented on PIG-596: -- Note that specifying the tuple without the tuple designator doesn't work either. {code} One = load 'one.txt' using PigStorage() as ( one: int ); LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: tuple ( a, b ) }; AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { ( a, b ) }; Tuples = union LabelledTupleInBag, AnonymousTupleInBag; dump Tuples; {code} Anonymous tuples in bags create ParseExceptions --- Key: PIG-596 URL: https://issues.apache.org/jira/browse/PIG-596 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: David Ciemiewicz {code} One = load 'one.txt' using PigStorage() as ( one: int ); LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: tuple ( a, b ) }; AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, b ) }; -- Anonymous tuple creates bug Tuples = union LabelledTupleInBag, AnonymousTupleInBag; dump Tuples; {code} java.io.IOException: Encountered { tuple at line 6, column 66. Was expecting one of: parallel ... ; ... , ... : ... ( ... { IDENTIFIER ... { } ... [ ... at org.apache.pig.PigServer.parseQuery(PigServer.java:298) at org.apache.pig.PigServer.registerQuery(PigServer.java:263) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered { tuple at line 6, column 66. Why can't there be an anonymous tuple at the top level of a bag? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-596) Anonymous tuples in bags create ParseExceptions
[ https://issues.apache.org/jira/browse/PIG-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660202#action_12660202 ] David Ciemiewicz commented on PIG-596: -- The reason I think it is important to be able to create anonymous tuples is because the tuples are anonymous in the LOAD statements. Because, if you FLATTEN a bag such as mybag, any intermediate tuple label is immediately lost and the results of the flatten are mybag::a, mybag::b. They are not mybag::tuplelabel::a, mybag::tuplelabel::b; Anonymous tuples in bags create ParseExceptions --- Key: PIG-596 URL: https://issues.apache.org/jira/browse/PIG-596 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: David Ciemiewicz {code} One = load 'one.txt' using PigStorage() as ( one: int ); LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: tuple ( a, b ) }; AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, b ) }; -- Anonymous tuple creates bug Tuples = union LabelledTupleInBag, AnonymousTupleInBag; dump Tuples; {code} java.io.IOException: Encountered { tuple at line 6, column 66. Was expecting one of: parallel ... ; ... , ... : ... ( ... { IDENTIFIER ... { } ... [ ... at org.apache.pig.PigServer.parseQuery(PigServer.java:298) at org.apache.pig.PigServer.registerQuery(PigServer.java:263) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered { tuple at line 6, column 66. Why can't there be an anonymous tuple at the top level of a bag? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-596) Anonymous tuples in bags create ParseExceptions
Anonymous tuples in bags create ParseExceptions --- Key: PIG-596 URL: https://issues.apache.org/jira/browse/PIG-596 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: David Ciemiewicz {code} One = load 'one.txt' using PigStorage() as ( one: int ); LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: tuple ( a, b ) }; AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, b ) }; -- Anonymous tuple creates bug Tuples = union LabelledTupleInBag, AnonymousTupleInBag; dump Tuples; {code} java.io.IOException: Encountered { tuple at line 6, column 66. Was expecting one of: parallel ... ; ... , ... : ... ( ... { IDENTIFIER ... { } ... [ ... at org.apache.pig.PigServer.parseQuery(PigServer.java:298) at org.apache.pig.PigServer.registerQuery(PigServer.java:263) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered { tuple at line 6, column 66. Why can't there be an anonymous tuple at the top level of a bag? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-579) Adding newlines to format foreach statement with constants causes parse errors
Adding newlines to format foreach statement with constants causes parse errors -- Key: PIG-579 URL: https://issues.apache.org/jira/browse/PIG-579 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: David Ciemiewicz The following code example files with parse errors on step D: {code} A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararray, contributions: float); C = COGROUP A BY name, B BY name; D = FOREACH C GENERATE group, flatten((not IsEmpty(A) ? A : (bag{tuple(chararray, int, float)}){(null, null, null)})), flatten((not IsEmpty(B) ? B : (bag{tuple(chararray, int, chararray, float)}){(null,null,null, null)})); dump D; {code} I get the parse error: Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered not IsEmpty ( A ) ? A : ( bag { tuple ( chararray , int , float ) } ; at line 9, column 18. Was expecting one of: ( ... - ... tuple ... bag ... map ... int ... long ... ... However, if I simply remove the new lines from statement D and make it: {code} D = FOREACH C GENERATE group, flatten((not IsEmpty(A) ? A : (bag{tuple(chararray, int, float)}){(null, null, null)})), flatten((not IsEmpty(B) ? B : (bag{tuple(chararray, int, chararray, float)}){(null,null,null, null)})); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-574) run command for grunt
run command for grunt - Key: PIG-574 URL: https://issues.apache.org/jira/browse/PIG-574 Project: Pig Issue Type: New Feature Components: grunt Reporter: David Ciemiewicz Priority: Minor This is a request for a run file command in grunt which will read a script from the local file system and execute the script interactively while in the grunt shell. One of the things that slows down iterative development of large, complicated Pig scripts that must operate on hadoop fs data is that the edit, run, debug cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) cluster for each iteration. I would prefer not to preallocate a cluster of nodes (though I could). Instead, I'd like to have one window open and edit my Pig script using vim or emacs, write it, and then type run myscript.pig at the grunt shell until I get things right. I'm used to doing similar things with Oracle, MySQL, and R. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-575) Please extend FieldSchema class with getSchema() member function for iterating over complex Schemas in Pig UDF outputSchema
Please extend FieldSchema class with getSchema() member function for iterating over complex Schemas in Pig UDF outputSchema --- Key: PIG-575 URL: https://issues.apache.org/jira/browse/PIG-575 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz I have discovered that it is not possible to recurse through parts of the input Schema in the UDF outputSchema function. I have a function that operates on an input bag of tuples and then creates sequential pairings of the rows. A = foreach One generate { ( 1, a ), ( 2, b ) } as bag { tuple ( seq: int, value: chararray ) }; The output of the PAIRS(A) should be: { ( ( 1, a ), ( 2, b ) ), ( ( 2, b ), ( null, null ) ) } The default output schema for the function should be: bag { tuple ( tuple ( order: int, value: chararray ), tuple ( order: int, value: chararray ) ) ) } The problem I have is that I'm not able to recurse into the internal Schema of the FieldSchema in my outputSchema function to get at the tuple within the input bag. Here's my sample outputSchema for PAIRS: public Schema outputSchema(Schema input) { try { System.out.println(input: + input.toString()); Schema databagSchema = new Schema(); Schema tupleSchema = new Schema(); Schema inputDataBag = new Schema(input.getFields().get(0)); System.out.println(inputDataBag: + input.getFields().get(0).toString()); // // RIGHT HERE IS WHERE I WANT TO DO inputDataBag.getFields.get(0).getSchema // Schema.FieldSchema inputTuple = inputDataBag.getFields().get(0); // Here's where I want to say System.out.println(inputTuple: + inputTuple.toString()); databagSchema.add(new Schema.FieldSchema(null, DataType.TUPLE)); System.out.println(databagSchema: + databagSchema.toString()); return new Schema( new Schema.FieldSchema( getSchemaName( this.getClass().getName().toLowerCase(), input), databagSchema, DataType.BAG ) ); } catch (Exception e) { return null; } } Here's the execution output from outputSchema: input: {A: {seq: int,value: chararray},int,int} inputDataBag: A: bag({seq: int,value: chararray}) inputTuple: A: bag({seq: int,value: chararray})= what I want to see is ( seq: int, value: chararray ) rowSchema: A: bag({seq: int,value: chararray}) rowSchema: A: bag({seq: int,value: chararray}) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-575) Please extend FieldSchema class with getSchema() member function for iterating over complex Schemas in Pig UDF outputSchema
[ https://issues.apache.org/jira/browse/PIG-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ciemiewicz updated PIG-575: - Component/s: impl Priority: Minor (was: Major) Please extend FieldSchema class with getSchema() member function for iterating over complex Schemas in Pig UDF outputSchema --- Key: PIG-575 URL: https://issues.apache.org/jira/browse/PIG-575 Project: Pig Issue Type: Improvement Components: impl Reporter: David Ciemiewicz Priority: Minor I have discovered that it is not possible to recurse through parts of the input Schema in the UDF outputSchema function. I have a function that operates on an input bag of tuples and then creates sequential pairings of the rows. A = foreach One generate { ( 1, a ), ( 2, b ) } as bag { tuple ( seq: int, value: chararray ) }; The output of the PAIRS(A) should be: { ( ( 1, a ), ( 2, b ) ), ( ( 2, b ), ( null, null ) ) } The default output schema for the function should be: bag { tuple ( tuple ( order: int, value: chararray ), tuple ( order: int, value: chararray ) ) ) } The problem I have is that I'm not able to recurse into the internal Schema of the FieldSchema in my outputSchema function to get at the tuple within the input bag. Here's my sample outputSchema for PAIRS: public Schema outputSchema(Schema input) { try { System.out.println(input: + input.toString()); Schema databagSchema = new Schema(); Schema tupleSchema = new Schema(); Schema inputDataBag = new Schema(input.getFields().get(0)); System.out.println(inputDataBag: + input.getFields().get(0).toString()); // // RIGHT HERE IS WHERE I WANT TO DO inputDataBag.getFields.get(0).getSchema // Schema.FieldSchema inputTuple = inputDataBag.getFields().get(0); // Here's where I want to say System.out.println(inputTuple: + inputTuple.toString()); databagSchema.add(new Schema.FieldSchema(null, DataType.TUPLE)); System.out.println(databagSchema: + databagSchema.toString()); return new Schema( new Schema.FieldSchema( getSchemaName( this.getClass().getName().toLowerCase(), input), databagSchema, DataType.BAG ) ); } catch (Exception e) { return null; } } Here's the execution output from outputSchema: input: {A: {seq: int,value: chararray},int,int} inputDataBag: A: bag({seq: int,value: chararray}) inputTuple: A: bag({seq: int,value: chararray})= what I want to see is ( seq: int, value: chararray ) rowSchema: A: bag({seq: int,value: chararray}) rowSchema: A: bag({seq: int,value: chararray}) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.