Review Request: PIG-1508: Make 'docs' target (forrest) work with Java 1.6
--- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/725/ --- Review request for Pig Developers. Summary --- Remove Pig's dependency on Java5. This addresses bug PIG-1508. http://issues.apache.org/jira/browse/PIG-1508 Diffs - build.xml b0a2ada src/docs/forrest.properties 51f1af7 test/bin/test-patch.sh 55c449e Diff: http://review.cloudera.org/r/725/diff Testing --- Thanks, Carl
[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6
[ https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902767#action_12902767 ] HBase Review Board commented on PIG-1508: - Message from: Carl Steinbach c...@cloudera.com --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/725/ --- Review request for Pig Developers. Summary --- Remove Pig's dependency on Java5. This addresses bug PIG-1508. http://issues.apache.org/jira/browse/PIG-1508 Diffs - build.xml b0a2ada src/docs/forrest.properties 51f1af7 test/bin/test-patch.sh 55c449e Diff: http://review.cloudera.org/r/725/diff Testing --- Thanks, Carl Make 'docs' target (forrest) work with Java 1.6 --- Key: PIG-1508 URL: https://issues.apache.org/jira/browse/PIG-1508 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Carl Steinbach Assignee: Carl Steinbach Attachments: PIG-1508.patch.txt FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with Java 1.6 The same ticket also suggests a workaround: disabling sitemap and stylesheet validation by setting the forrest.validate.sitemap and forrest.validate.stylesheets properties to false. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pig Contributor meeting notes
Wonderful, Dmitriy, It's pity for me missing the contributor meeting. And any ppt shared ? On Wed, Aug 25, 2010 at 8:32 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Twitter hosted this month's Pig contributor meeting. Developers from Yahoo, Twitter, LinkedIn, RichRelevance, and Cloudera were present. 1. Howl First, Alan Gates demoed Howl, a project whose goal is to provide table management service for all of hadoop. The vision is that ultimately you will be able to read/write data using regular MR, or Pig, or Hive, and read it using any of those three, with full support of a partition-aware metadata store that will tell you what data is available, what its schema is, etc, reusing a single table abstraction. Currently, tables are created using (a restricted subset of) Hive ddl statements; a howl cli for this will be created, which will enforce the restricted subset. Writing to the table using Pig or MapReduce is supported. Reading can already be done using all three. At the moment, a single Pig store statement can only store into a single partition; adding ability to spray across partitions is on the roadmap. This, and a good api for interacting with the metastore, are the two areas that were identified as good opportunities for the wider developer community to get involved with the project. The source code is on GitHub, and is at the moment synchronized with the development trunk manually; Yahoo folks will look into changing this. Security is a concern, and Yahoo will be working on it. Making it possible for Hive to write to the tables is at the moment not as high a priority as the others listed, it would basically involve just writing a Hive SerDe (an equivalent of Pig's StoreFunc). 2. Azkaban presentation Russel Jurney and Richard Park from LinkedIn presented the workflow management tool open-sourced by LinkedIn, called Azkaban. It allows you to declare job dependencies, has a web interface for launching and monitoring jobs, etc. It has a special exec mode for Pig that lets you set some Pig-specific options on a per-job basis. It does not currently have triggering or job-instance parameter substitution (it does have job-level parameter substitution). When asked what would Pig could do to make life easier for Azkaban, the two things Richard identified were registering jars through the grunt command line and a way to monitor the running job -- both of these are already in trunk, so we're in pretty good shaped for 0.8 3. Piggybank discussion Kevin Weil led a discussion of the piggybank. There are a few problems with it -- it's released on the Pig schedule, and has quite a few barriers to submission that are, anecdotally at least, preventing people from contributing. Several options were discussed, with the group finally settling on starting a community-curated GitHub project for piggybank. It will have a number of committers from different companies, and will aim to make it easy for folks to contribute (all contribs will still have to have tests, and be Apache 2.0-licensed). More details will be forthcoming as we figure them out. Initially this project will be seeded with the current Piggybank functions some time after 0.8 is branched. The initial list of committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone. Please send us any thoughts you might have on this subject. It was suggested that a lot of common code might be shared with Hive UDFs, which have the same problems as Piggybank does, and that perhaps the project can be another collaboration point between the projects. Not clear how that would work, Carl will talk to other Hive people. Pig 0.9 So far the items on the list for 0.9 are: better type propagation / resolution story and documentation, perhaps different parser (ANTLR?), some performance tweaks, and map types with fixed-type values. Much still to be decided. The next contributor meeting will be hosted by LinkedIn in October. -Dmitriy -- Best Regards Jeff Zhang
Added Pig to the list of projects on Cloudera's public ReviewBoard instance
Hi, I added Pig to the list of projects that can be reviewed on Cloudera's public ReviewBoard instance, located at http://review.cloudera.org (AKA review.hbase.org). Review requests and comments are automatically forwarded to the pig-dev mailing list, and they also get posted back to the original JIRA ticket. Please refer to the Review Process section of HBase's HowToContribute page for more information on using ReviewBoard: http://wiki.apache.org/hadoop/Hbase/HowToContribute Thanks. Carl
[jira] Created: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
java properties not honored in case of properties such as stop.on.failure - Key: PIG-1569 URL: https://issues.apache.org/jira/browse/PIG-1569 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Fix For: 0.8.0 In org.apache.pig.Main , properties are being set to default value without checking if the java system properties have been set to something else. stop.on.failure, opt.multiquery, aggregate.warning are some properties that have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1501: -- Status: Patch Available (was: Open) This feature will save HDFS space used to store the intermediate data used by PIG and potentially improve query execution speed. In general, the more intermediate data generated, the more storage and speedup benefits. There are no backward compatibility issues as result of this feature. An example is the following test.pig script: register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B1 = filter A by timespent == 4; B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by query_term, B by query_term using 'skewed' parallel 300; D = distinct C parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar -Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo org.apache.pig.Main ./test.pig need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pig optimizer
Anyone, please? Renato M. 2010/8/24 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com Hi Daniel, Thanks, but that was not what I was actually looking. What I want to know is for example, how the optimizer work when the bags' logical plans are combined, or if all commands are reduced at the end to CO-GROUP commands, how is this handled? I know from Pig's paper that the ORDER, and LOAD, commands generate new MapReduce jobs, are there any optimizations for the physical plans? Thanks in advanced. Renato M. 2010/8/23 Daniel Dai jiany...@yahoo-inc.com Hi, Renato, There is a description of optimization rule in Pig Latin reference menu: http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html#Optimization+Rules. Is that enough? Daniel Renato Marroquín Mogrovejo wrote: Hey everyone, I was wondering if anybody has any references or suggestion on how to learn about Pig's optimizer besides the source code or Pig's paper. Thanks in advance. Renato M.
[jira] Created: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs
native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs - Key: PIG-1570 URL: https://issues.apache.org/jira/browse/PIG-1570 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Fix For: 0.8.0 The code path for handling failure in MR job corresponding to native MR is different and does not have the same behavior. For example, even if the MR job for mapreduce operator fails, the number of jobs that failed is being reported as 0 in PigStats log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1571) add a compile time check to see if the output file of native mapreduce operator exists
add a compile time check to see if the output file of native mapreduce operator exists -- Key: PIG-1571 URL: https://issues.apache.org/jira/browse/PIG-1571 Project: Pig Issue Type: Bug Reporter: Thejas M Nair If the output file for native MR operator exist, the query does not fail at compile time, it fails only at runtime. This file loaded in the nested load of native MR operator, it should be possible to check for this file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-506: -- Attachment: PIG-506.2.patch PIG-506.2.patch has - Changes to get mapreduce operator working with new logical plan - Changes to LO/PO Native operators - The store and load for the operator are no longer within it, they are part of the plan. As a result, several changes in visitors made for handling the load/store within LONative has been reverted. - Fix for reporting failure when MR job corresponding to native operator fails. - Removed TestTestNativeMapReduce from exclude list in ant target. Some issues still to be fixed, which i will address as part of new jiras - - PIG-1570 The code path for handling failure in MR job corresponding to native MR is different and does not have the same behavior. - PIG-1571 If the output file for native MR exist, the query does not fail at compile time, it fails only at runtime. This file loaded in the nested load of native MR operator, it should be possible to check for this file. Does pig need a NATIVE keyword? --- Key: PIG-506 URL: https://issues.apache.org/jira/browse/PIG-506 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, PIG-506.patch, TestWordCount.jar Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this: {code} A = load 'myfile'; X = load 'myotherfile'; B = group A by $0; C = foreach B generate group, myudf(B); D = native (jar=mymr.jar, infile=frompig outfile=topig); E = join D by $0, X by $0; ... {code} This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk. Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs
[ https://issues.apache.org/jira/browse/PIG-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902919#action_12902919 ] Thejas M Nair commented on PIG-1570: Another thing to investigate (somewhat related) - there seems to be a problem when PigServer is used to execute query having native mr operator - i was unable to run the tests in local mode . But i am able to run query in local mode from commandline. native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs - Key: PIG-1570 URL: https://issues.apache.org/jira/browse/PIG-1570 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Fix For: 0.8.0 The code path for handling failure in MR job corresponding to native MR is different and does not have the same behavior. For example, even if the MR job for mapreduce operator fails, the number of jobs that failed is being reported as 0 in PigStats log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pig Contributor meeting notes
Slides about Azkaban and Pig: http://www.slideshare.net/rjurney/azkaban-pig-5057793 On Thu, Aug 26, 2010 at 12:55 AM, Jeff Zhang zjf...@gmail.com wrote: Wonderful, Dmitriy, It's pity for me missing the contributor meeting. And any ppt shared ? On Wed, Aug 25, 2010 at 8:32 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Twitter hosted this month's Pig contributor meeting. Developers from Yahoo, Twitter, LinkedIn, RichRelevance, and Cloudera were present. 1. Howl First, Alan Gates demoed Howl, a project whose goal is to provide table management service for all of hadoop. The vision is that ultimately you will be able to read/write data using regular MR, or Pig, or Hive, and read it using any of those three, with full support of a partition-aware metadata store that will tell you what data is available, what its schema is, etc, reusing a single table abstraction. Currently, tables are created using (a restricted subset of) Hive ddl statements; a howl cli for this will be created, which will enforce the restricted subset. Writing to the table using Pig or MapReduce is supported. Reading can already be done using all three. At the moment, a single Pig store statement can only store into a single partition; adding ability to spray across partitions is on the roadmap. This, and a good api for interacting with the metastore, are the two areas that were identified as good opportunities for the wider developer community to get involved with the project. The source code is on GitHub, and is at the moment synchronized with the development trunk manually; Yahoo folks will look into changing this. Security is a concern, and Yahoo will be working on it. Making it possible for Hive to write to the tables is at the moment not as high a priority as the others listed, it would basically involve just writing a Hive SerDe (an equivalent of Pig's StoreFunc). 2. Azkaban presentation Russel Jurney and Richard Park from LinkedIn presented the workflow management tool open-sourced by LinkedIn, called Azkaban. It allows you to declare job dependencies, has a web interface for launching and monitoring jobs, etc. It has a special exec mode for Pig that lets you set some Pig-specific options on a per-job basis. It does not currently have triggering or job-instance parameter substitution (it does have job-level parameter substitution). When asked what would Pig could do to make life easier for Azkaban, the two things Richard identified were registering jars through the grunt command line and a way to monitor the running job -- both of these are already in trunk, so we're in pretty good shaped for 0.8 3. Piggybank discussion Kevin Weil led a discussion of the piggybank. There are a few problems with it -- it's released on the Pig schedule, and has quite a few barriers to submission that are, anecdotally at least, preventing people from contributing. Several options were discussed, with the group finally settling on starting a community-curated GitHub project for piggybank. It will have a number of committers from different companies, and will aim to make it easy for folks to contribute (all contribs will still have to have tests, and be Apache 2.0-licensed). More details will be forthcoming as we figure them out. Initially this project will be seeded with the current Piggybank functions some time after 0.8 is branched. The initial list of committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone. Please send us any thoughts you might have on this subject. It was suggested that a lot of common code might be shared with Hive UDFs, which have the same problems as Piggybank does, and that perhaps the project can be another collaboration point between the projects. Not clear how that would work, Carl will talk to other Hive people. Pig 0.9 So far the items on the list for 0.9 are: better type propagation / resolution story and documentation, perhaps different parser (ANTLR?), some performance tweaks, and map types with fixed-type values. Much still to be decided. The next contributor meeting will be hosted by LinkedIn in October. -Dmitriy -- Best Regards Jeff Zhang
[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1501: --- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to trunk. Thanks Yan! need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch, PIG-1501.patch, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1572) change default datatype when relations are used as scalar to bytearray
change default datatype when relations are used as scalar to bytearray -- Key: PIG-1572 URL: https://issues.apache.org/jira/browse/PIG-1572 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Fix For: 0.8.0 When relations are cast to scalar, the current default type is chararray. This is inconsistent with the behavior in rest of pig-latin. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1518: -- Attachment: PIG-1518.patch rebased on the latest trunk multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1568) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly
[ https://issues.apache.org/jira/browse/PIG-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1568: - Status: Open (was: Patch Available) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly -- Key: PIG-1568 URL: https://issues.apache.org/jira/browse/PIG-1568 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1568-1.patch, jira-1568-1.patch FilterAboveForeach rule is to optimize the plan by pushing up filter above previous foreach operator. However, during code review, two major problems were found: 1. Current implementation assumes that if no projection is found in the filter condition then all columns from foreach are projected. This issue prevents the following optimization: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY 8 5; STORE C INTO 'empty'; 2. Current implementation doesn't handle * probjection, which means project all columns. As a result, it wasn't able to optimize the following: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY Identity.class.getName(*) 5; STORE C INTO 'empty'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1568) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly
[ https://issues.apache.org/jira/browse/PIG-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1568: - Attachment: jira-1568-1.patch Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly -- Key: PIG-1568 URL: https://issues.apache.org/jira/browse/PIG-1568 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1568-1.patch, jira-1568-1.patch FilterAboveForeach rule is to optimize the plan by pushing up filter above previous foreach operator. However, during code review, two major problems were found: 1. Current implementation assumes that if no projection is found in the filter condition then all columns from foreach are projected. This issue prevents the following optimization: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY 8 5; STORE C INTO 'empty'; 2. Current implementation doesn't handle * probjection, which means project all columns. As a result, it wasn't able to optimize the following: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY Identity.class.getName(*) 5; STORE C INTO 'empty'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1564) add support for multiple filesystems
[ https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902952#action_12902952 ] Richard Ding commented on PIG-1564: --- Hi Andrew, HDataStorage is a thin layer on top of Hadoop FileSystem. Since moving its local mode to Hadoop local mode, Pig no longer needs this layer. We intends to remove it in the feature. On Pig reading data from one file system and writing it to another, this feature is supported since Pig 0.7. -Richard add support for multiple filesystems Key: PIG-1564 URL: https://issues.apache.org/jira/browse/PIG-1564 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1564-1.patch Currently you can't run Pig scripts that read data from one file system and write it to another. Also, Grunt doesn't support CDing from one directory to another on different file systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1568) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly
[ https://issues.apache.org/jira/browse/PIG-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1568: - Status: Patch Available (was: Open) Regenerate the patch after fixing failed test case. The test case itself was changed as it uses an internal bug. When a UDF takes no argument, PIG backend passes the whole input to the UDF. This needs to be corrected. In another word, if a UDF doesn't specify any argument, we assume that it doesn't need any input. If a UDF needs all input, it can either specify a star (*). It can also list whatever it requires in the argument list. A Jira tracking Pig backend changes will be created. Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly -- Key: PIG-1568 URL: https://issues.apache.org/jira/browse/PIG-1568 Project: Pig Issue Type: Bug Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: jira-1568-1.patch, jira-1568-1.patch FilterAboveForeach rule is to optimize the plan by pushing up filter above previous foreach operator. However, during code review, two major problems were found: 1. Current implementation assumes that if no projection is found in the filter condition then all columns from foreach are projected. This issue prevents the following optimization: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY 8 5; STORE C INTO 'empty'; 2. Current implementation doesn't handle * probjection, which means project all columns. As a result, it wasn't able to optimize the following: A = LOAD 'file.txt' AS (a(u,v), b, c); B = FOREACH A GENERATE $0, b; C = FILTER B BY Identity.class.getName(*) 5; STORE C INTO 'empty'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1573) PIG shouldn't pass all input to a UDF if the UDF specify no argument
PIG shouldn't pass all input to a UDF if the UDF specify no argument Key: PIG-1573 URL: https://issues.apache.org/jira/browse/PIG-1573 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Fix For: 0.9.0 Currently If in a pig script user uses a UDF with no argument, PIG backend assumes that the UDF takes all input so at run time it passes all input as a tuple to the UDF. This assumption is incorrect, causing conceptual confusions. If a UDF takes all input, it can specify a star (*) as its argument. If it specify no argument at all, then we assume that it requires no input data. We need to differentiate no input and all input for a UDF. Thus, in case that a UDF specify no argument, backend should pass the UDF an empty tuple. See notes in PIG-1586 for more information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1573) PIG shouldn't pass all input to a UDF if the UDF specify no argument
[ https://issues.apache.org/jira/browse/PIG-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang reassigned PIG-1573: Assignee: Xuefu Zhang PIG shouldn't pass all input to a UDF if the UDF specify no argument Key: PIG-1573 URL: https://issues.apache.org/jira/browse/PIG-1573 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.9.0 Currently If in a pig script user uses a UDF with no argument, PIG backend assumes that the UDF takes all input so at run time it passes all input as a tuple to the UDF. This assumption is incorrect, causing conceptual confusions. If a UDF takes all input, it can specify a star (*) as its argument. If it specify no argument at all, then we assume that it requires no input data. We need to differentiate no input and all input for a UDF. Thus, in case that a UDF specify no argument, backend should pass the UDF an empty tuple. See notes in PIG-1586 for more information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding resolved PIG-1518. --- Hadoop Flags: [Reviewed] Resolution: Fixed Patch is committed to trunk. Thanks Yan. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pig optimizer
Hi, Renato, I think you are talking about how we organize different operators into map-reduce jobs. Unfortunately there is no document currently. Basically we will put as much operators into one map-reduce job as possible. Co-group/Group, Join, Order, Distinct, Cross, Stream will create a map-reduce boundary; Most others we will put into existing jobs. The main logic is inside MRCompiler.java. Daniel Renato Marroquín Mogrovejo wrote: Anyone, please? Renato M. 2010/8/24 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com Hi Daniel, Thanks, but that was not what I was actually looking. What I want to know is for example, how the optimizer work when the bags' logical plans are combined, or if all commands are reduced at the end to CO-GROUP commands, how is this handled? I know from Pig's paper that the ORDER, and LOAD, commands generate new MapReduce jobs, are there any optimizations for the physical plans? Thanks in advanced. Renato M. 2010/8/23 Daniel Dai jiany...@yahoo-inc.com Hi, Renato, There is a description of optimization rule in Pig Latin reference menu: http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html#Optimization+Rules. Is that enough? Daniel Renato Marroquín Mogrovejo wrote: Hey everyone, I was wondering if anybody has any references or suggestion on how to learn about Pig's optimizer besides the source code or Pig's paper. Thanks in advance. Renato M.
Re: Pig Contributor meeting notes
On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote: Wonderful, Dmitriy, It's pity for me missing the contributor meeting. And any ppt shared ? Jeff, We don't want to exclude our contributors who don't happen to live in the San Francisco Bay Area. If we could include you via Skype or some other technology we'd be happy to set it up on our end. Do you think something like that would work for you? Alan.
Re: Added Pig to the list of projects on Cloudera's public ReviewBoard instance
Thanks Carl! On Thu, Aug 26, 2010 at 1:08 AM, Carl Steinbach c...@cloudera.com wrote: Hi, I added Pig to the list of projects that can be reviewed on Cloudera's public ReviewBoard instance, located at http://review.cloudera.org (AKA review.hbase.org). Review requests and comments are automatically forwarded to the pig-dev mailing list, and they also get posted back to the original JIRA ticket. Please refer to the Review Process section of HBase's HowToContribute page for more information on using ReviewBoard: http://wiki.apache.org/hadoop/Hbase/HowToContribute Thanks. Carl
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1343: --- Status: Patch Available (was: Open) pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1343: --- Attachment: pig_1343_2.patch Implemented the interactive mode logging as well. pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1555) [piggybank] add CSV Loader
[ https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1555: --- Status: Resolved (was: Patch Available) Release Note: CSVLoader can be used to load comma-separated value files. It properly handles commas included inside quoted fields, and quotes escaped by preceding them with another quote character (Excel-style). CSVLoader only handle single-line entries; quoting a multi-line value will *not* work. Resolution: Fixed [piggybank] add CSV Loader -- Key: PIG-1555 URL: https://issues.apache.org/jira/browse/PIG-1555 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.8.0 Attachments: PIG_1555.patch Users often ask for a CSV loader that can handle quoted commas. Let's get 'er done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903031#action_12903031 ] Dmitriy V. Ryaboy commented on PIG-1518: This is a great feature, thanks Yan. Could you comment on what the final solution was as far as PigStorage and OrderedLoadFunc? I see two ideas (yours and Ashutosh's) in the discussion, but not what the ultimate direction you took was. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1569: - Assignee: Richard Ding java properties not honored in case of properties such as stop.on.failure - Key: PIG-1569 URL: https://issues.apache.org/jira/browse/PIG-1569 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Richard Ding Fix For: 0.8.0 In org.apache.pig.Main , properties are being set to default value without checking if the java system properties have been set to something else. stop.on.failure, opt.multiquery, aggregate.warning are some properties that have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903043#action_12903043 ] Alan Gates commented on PIG-1205: - Comments # As discussed previously, LoadStoreCaster should be changed so that there is a StoreCaster interface that has the toByte methods, and LoadStoreCaster is a convenience interface that extends LoadCaster and StoreCaster. # It looks like with HBASE-1933 Hbase is now available via Maven. Can we pull it from Maven rather than check in the jar to our lib directory? Since I know little about Hbase I focussed my review on the Pig side. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903072#action_12903072 ] Richard Ding commented on PIG-1343: --- The new patch logs NPE instead of the intended message: {code} [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null {code} pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1343: --- Attachment: pig_1343_3.patch pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, pig_1343_3.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1343: --- Status: Open (was: Patch Available) pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, pig_1343_3.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1458: -- Attachment: PIG-1458.patch This patch uses the new multi-file-combiner (PIG-1518) to concatenate many small files for replicated join. This is based on the assumption that the total size of the replicated files should be small enough to fit into main memory. aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1458.patch We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903102#action_12903102 ] Yan Zhou commented on PIG-1518: --- It is not combinable if the loader is a CollectableLoadFunc AND a OrderedLoadFunc. Since PigStorage is a CollectableLoadFunc but not a OrderedLoadFunc, it is combinable. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1565) additional piggybank datetime and string UDFs
[ https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903123#action_12903123 ] Alan Gates commented on PIG-1565: - Comments # ErrorCatchingBase swallows any non-ExecExceptions. It should print their messages out as warnings. Warnings are collated and the count reported at the end of the job. Details are only printed if the user asks for them. That way the user will still be informed that something unexpected happened and can investigate further if he wants to. # On the duplication, it looks to me like INDEX_OF and LAST_INDEX_OF are supersets of the functions already in Pig. You could submit a patch for those two functions (which are now builtins) to extend them to take the optional third argument. SPLIT_ON_REGEX looks like a subset of the existing SPLIT function that is built into Pig, so other than having it as an alias so that Amazon users who are used to calling SPLIT_ON_REGEX I'm not clear what the value is. Thanks for contributing all these, this is great. I'll run test-patch and the unit tests and post the results. additional piggybank datetime and string UDFs - Key: PIG-1565 URL: https://issues.apache.org/jira/browse/PIG-1565 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1565-1.patch Pig is missing a variety of UDFs that might be helpful for users implementing Pig scripts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1565) additional piggybank datetime and string UDFs
[ https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1565: --- Assignee: Andrew Hitchcock additional piggybank datetime and string UDFs - Key: PIG-1565 URL: https://issues.apache.org/jira/browse/PIG-1565 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Assignee: Andrew Hitchcock Fix For: 0.8.0 Attachments: PIG-1565-1.patch Pig is missing a variety of UDFs that might be helpful for users implementing Pig scripts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1343: --- Attachment: pig_1343_4.patch pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, pig_1343_4.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1564) add support for multiple filesystems
[ https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903128#action_12903128 ] Alan Gates commented on PIG-1564: - We do intend to remove it, though at the moment there is no other way to access HDFS for UDFs. So before we can officially deprecate it we need to come up with a replacement. Andrew, as Richard points out, as of Pig 0.7 load and store functions no longer use HDataStorage. Do you still see this patch as being useful just for UDFs? Or are load and store functions the only use cases for it? add support for multiple filesystems Key: PIG-1564 URL: https://issues.apache.org/jira/browse/PIG-1564 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1564-1.patch Currently you can't run Pig scripts that read data from one file system and write it to another. Also, Grunt doesn't support CDing from one directory to another on different file systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-506) Does pig need a NATIVE keyword?
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-506: -- Attachment: PIG-506.3.patch Updated patch, earlier patch was missing src/org/apache/pig/newplan/logical/relational/LONative.java. test-patch and core tests are successful. [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 8 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Does pig need a NATIVE keyword? --- Key: PIG-506 URL: https://issues.apache.org/jira/browse/PIG-506 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Aniket Mokashi Priority: Minor Fix For: 0.8.0 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, PIG-506.3.patch, PIG-506.patch, TestWordCount.jar Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this: {code} A = load 'myfile'; X = load 'myotherfile'; B = group A by $0; C = foreach B generate group, myudf(B); D = native (jar=mymr.jar, infile=frompig outfile=topig); E = join D by $0, X by $0; ... {code} This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk. Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1565) additional piggybank datetime and string UDFs
[ https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903174#action_12903174 ] Alan Gates commented on PIG-1565: - [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 5 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] additional piggybank datetime and string UDFs - Key: PIG-1565 URL: https://issues.apache.org/jira/browse/PIG-1565 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Assignee: Andrew Hitchcock Fix For: 0.8.0 Attachments: PIG-1565-1.patch Pig is missing a variety of UDFs that might be helpful for users implementing Pig scripts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pig Contributor meeting notes
Alan, That's great, next time I will try to join the contributor meeting. On Thu, Aug 26, 2010 at 11:35 AM, Alan Gates ga...@yahoo-inc.com wrote: On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote: Wonderful, Dmitriy, It's pity for me missing the contributor meeting. And any ppt shared ? Jeff, We don't want to exclude our contributors who don't happen to live in the San Francisco Bay Area. If we could include you via Skype or some other technology we'd be happy to set it up on our end. Do you think something like that would work for you? Alan. -- Best Regards Jeff Zhang
Re: Pig Contributor meeting notes
BTW, actually Dmitriy has invited me to join this meeting through skype, but it's pity that I have no time to join it this time. On Thu, Aug 26, 2010 at 6:15 PM, Jeff Zhang zjf...@gmail.com wrote: Alan, That's great, next time I will try to join the contributor meeting. On Thu, Aug 26, 2010 at 11:35 AM, Alan Gates ga...@yahoo-inc.com wrote: On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote: Wonderful, Dmitriy, It's pity for me missing the contributor meeting. And any ppt shared ? Jeff, We don't want to exclude our contributors who don't happen to live in the San Francisco Bay Area. If we could include you via Skype or some other technology we'd be happy to set it up on our end. Do you think something like that would work for you? Alan. -- Best Regards Jeff Zhang -- Best Regards Jeff Zhang
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1343: --- Attachment: PIG_1343_5.patch pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: niraj rai Fix For: 0.8.0 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven
[ https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1562: --- Attachment: PIG_1562_0.patch This patch has fix for the version issue of the required packages. Fix the version for the dependent packages for the maven - Key: PIG-1562 URL: https://issues.apache.org/jira/browse/PIG-1562 Project: Pig Issue Type: Bug Reporter: niraj rai Assignee: niraj rai Fix For: 0.8.0 Attachments: PIG_1562_0.patch We need to fix the set version so that, version is properly set for the dependent packages in the maven repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.