Problem running Pig 0.60
Hi pig team, I¹m testing zebra v2 and trying to run the pig 0.60 jar that I got from Yan. However, I got the following error: Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) Is there any additional jar file that I need to include with Hadoop or pig? Thanks~ -- Yiping Han y...@yahoo-inc.com US phone: +1(408)349-4403 Beijing phone: +86(10)8215-9357
[jira] Created: (PIG-941) [zebra] Loading non-existing column generates error
[zebra] Loading non-existing column generates error --- Key: PIG-941 URL: https://issues.apache.org/jira/browse/PIG-941 Project: Pig Issue Type: Bug Components: data Reporter: Yiping Han Loading a column that does not exist generates the following error: 2009-09-01 21:29:15,161 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null Example is like this: STORE urls2 into '$output' using org.apache.pig.table.pig.TableStorer('md5:string, url:string'); and then in another pig script, I load the table: input = LOAD '$output' USING org.apache.pig.table.pig.TableLoader('md5,url, domain'); where domain is a column that does not exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Proposal to create a branch for contrib project Zebra
+1 On 8/18/09 7:11 AM, "Olga Natkovich" wrote: > +1 > > -Original Message- > From: Raghu Angadi [mailto:rang...@yahoo-inc.com] > Sent: Monday, August 17, 2009 4:06 PM > To: pig-dev@hadoop.apache.org > Subject: Proposal to create a branch for contrib project Zebra > > > Thanks to the PIG team, The first version of contrib project Zebra > (PIG-833) is committed to PIG trunk. > > In short, Zebra is a table storage layer built for use in PIG and other > Hadoop applications. > > While we are stabilizing current version V1 in the trunk, we plan to add > > more new features to it. We would like to create an svn branch for the > new features. We will be responsible for managing zebra in PIG trunk and > > in the new branch. We will merge the branch when it is ready. We expect > the changes to affect only 'contrib/zebra' directory. > > As a regular contributor to Hadoop, I will be the initial committer for > Zebra. As more patches are contributed by other Zebra developers, there > might be more commiters added through normal Hadoop/Apache procedure. > > I would like to create a branch called 'zebra-v2' with approval from PIG > > team. > > Thanks, > Raghu. -- Yiping Han F-3140 (408)349-4403 y...@yahoo-inc.com
Re: COUNT, AVG and nulls
+1. --Yiping On 7/6/09 10:58 AM, "Dmitriy Ryaboy" wrote: > +1 for standard semantics. > > We need a COALESCE function to go along with this. > > -D > > On Mon, Jul 6, 2009 at 10:46 AM, Olga Natkovich wrote: > >> Hi, >> >> >> >> The current implementation of COUNT and AVG in Pig counts null values. >> This is inconsistent with SQL semantics and also with semantics of other >> aggregated functions such as SUM, MIN, and MAX. Originally we chose this >> implementation for performance reasons; however, we re-implemented both >> functions to support multi-step combiner and now the cost of checking >> for null for the case where combiner is invoked is trivial. (I ran some >> tests with COUNT and they showed no performance difference.) We will pay >> penalty for the non-combinable case including local mode but I think it >> is worth the price to have consistent semantics. Also as we are working >> on SQL support, having SQL compliant semantics becomes very desirable. >> >> >> >> Please, let us know if you have any concerns. I am planning to make the >> change later this week. >> >> >> >> Olga >> >> -- Yiping Han F-3140 (408)349-4403 y...@yahoo-inc.com
[jira] Commented: (PIG-796) support conversion from numeric types to chararray
[ https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714526#action_12714526 ] Yiping Han commented on PIG-796: I have the same idea that Alan proposed. I agree the common case is most values are of the same type. Caching the type and change the cached type only when catch the ClassCastException would be the most efficient way. > support conversion from numeric types to chararray > --- > > Key: PIG-796 > URL: https://issues.apache.org/jira/browse/PIG-796 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.0 >Reporter: Olga Natkovich > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)
[ https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710818#action_12710818 ] Yiping Han commented on PIG-807: David, the syntax: B = foreach A generate SUM(m), is confusing for both developers and the parser. I like the idea to remove the explicit GROUP ALL, but would rather to use a different key word for that. I.e., B = FOR A GENERATE SUM(m); Adding a new keyword for this purpose would also works as the hint for parser to treat this as a direct hadoop iterator access. > PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the > Hadoop values iterator) > > > Key: PIG-807 > URL: https://issues.apache.org/jira/browse/PIG-807 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.1 >Reporter: Pradeep Kamath > Fix For: 0.3.0 > > > Currently all bags resulting from a group or cogroup are materialized as bags > containing all of the contents. The issue with this is that if a particular > key has many corresponding values, all these values get stuffed in a bag > which may run out of memory and hence spill causing slow down in performance > and sometime memory exceptions. In many cases, the udfs which use these bags > coming out a group and cogroup only need to iterate over the bag in a > unidirectional read-once manner. This can be implemented by having the bag > implement its iterator by simply iterating over the underlying hadoop > iterator provided in the reduce. This kind of a bag is also needed in > http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for > this issue too. The other part of this issue is to have some way for the udfs > to communicate to Pig that any input bags that they need are "read once" bags > . This can be achieved by having an Interface - say "UsesReadOnceBags " which > is serves as a tag to indicate the intent to Pig. Pig can then rewire its > execution plan to use ReadOnceBags is feasible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)
[ https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708575#action_12708575 ] Yiping Han commented on PIG-807: I would say instead of annotating the UDF to indicate ""read once" bags, it would be easier to do that in the co-group command. We would skip bag materialization only if it is accessed by UDFs that ALL read it in the "read once" manner. Thus we only need to specify that once. > PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the > Hadoop values iterator) > > > Key: PIG-807 > URL: https://issues.apache.org/jira/browse/PIG-807 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.2.1 >Reporter: Pradeep Kamath > Fix For: 0.3.0 > > > Currently all bags resulting from a group or cogroup are materialized as bags > containing all of the contents. The issue with this is that if a particular > key has many corresponding values, all these values get stuffed in a bag > which may run out of memory and hence spill causing slow down in performance > and sometime memory exceptions. In many cases, the udfs which use these bags > coming out a group and cogroup only need to iterate over the bag in a > unidirectional read-once manner. This can be implemented by having the bag > implement its iterator by simply iterating over the underlying hadoop > iterator provided in the reduce. This kind of a bag is also needed in > http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for > this issue too. The other part of this issue is to have some way for the udfs > to communicate to Pig that any input bags that they need are "read once" bags > . This can be achieved by having an Interface - say "UsesReadOnceBags " which > is serves as a tag to indicate the intent to Pig. Pig can then rewire its > execution plan to use ReadOnceBags is feasible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707178#action_12707178 ] Yiping Han commented on PIG-734: Then why not just to restrict all the keys to be of the same type? I don't see the point that different records should have different key types. But I do see the point that people may want to use non-string type of keys. > Non-string keys in maps > --- > > Key: PIG-734 > URL: https://issues.apache.org/jira/browse/PIG-734 > Project: Pig > Issue Type: Bug >Affects Versions: 0.2.0 >Reporter: Alan Gates >Assignee: Alan Gates >Priority: Minor > Fix For: 0.3.0 > > Attachments: PIG-734.patch > > > With the addition of types to pig, maps were changed to allow any atomic type > to be a key. However, in practice we do not see people using keys other than > strings. And allowing multiple types is causing us issues in serializing > data (we have to check what every key type is) and in the design for non-java > UDFs (since many scripting languages include associative arrays such as > Perl's hash). > So I propose we scope back maps to only have string keys. This would be a > non-compatible change. But I am not aware of anyone using non-string keys, > so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707110#action_12707110 ] Yiping Han commented on PIG-734: I don't get the serializing part. I would expect the type-checking just happen once, would that be a performance problem. Actually we are thinking if we sould switch to integer key for saving space. I wouldn't post strong against to this rollback, but I don't see a significant reason for dong that. > Non-string keys in maps > --- > > Key: PIG-734 > URL: https://issues.apache.org/jira/browse/PIG-734 > Project: Pig > Issue Type: Bug >Affects Versions: 0.2.0 >Reporter: Alan Gates >Assignee: Alan Gates >Priority: Minor > Fix For: 0.3.0 > > > With the addition of types to pig, maps were changed to allow any atomic type > to be a key. However, in practice we do not see people using keys other than > strings. And allowing multiple types is causing us issues in serializing > data (we have to check what every key type is) and in the design for non-java > UDFs (since many scripting languages include associative arrays such as > Perl's hash). > So I propose we scope back maps to only have string keys. This would be a > non-compatible change. But I am not aware of anyone using non-string keys, > so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707110#action_12707110 ] Yiping Han edited comment on PIG-734 at 5/7/09 2:10 PM: I don't get the serializing part. I would expect the type-checking just happen once, would that be a performance problem? Actually we are thinking if we should switch to integer key to save space. I wouldn't post strong against to this rollback, but I don't see a significant reason for dong that. was (Author: yhan): I don't get the serializing part. I would expect the type-checking just happen once, would that be a performance problem. Actually we are thinking if we sould switch to integer key for saving space. I wouldn't post strong against to this rollback, but I don't see a significant reason for dong that. > Non-string keys in maps > --- > > Key: PIG-734 > URL: https://issues.apache.org/jira/browse/PIG-734 > Project: Pig > Issue Type: Bug >Affects Versions: 0.2.0 >Reporter: Alan Gates >Assignee: Alan Gates >Priority: Minor > Fix For: 0.3.0 > > > With the addition of types to pig, maps were changed to allow any atomic type > to be a key. However, in practice we do not see people using keys other than > strings. And allowing multiple types is causing us issues in serializing > data (we have to check what every key type is) and in the design for non-java > UDFs (since many scripting languages include associative arrays such as > Perl's hash). > So I propose we scope back maps to only have string keys. This would be a > non-compatible change. But I am not aware of anyone using non-string keys, > so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-282) Custom Partitioner
[ https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672467#action_12672467 ] Yiping Han commented on PIG-282: Any concerns on this issue? > Custom Partitioner > -- > > Key: PIG-282 > URL: https://issues.apache.org/jira/browse/PIG-282 > Project: Pig > Issue Type: New Feature >Reporter: Amir Youssefi >Priority: Minor > > By adding custom partitioner we can give control over which output partition > a key (/value) goes to. We can add keywords to language e.g. > PARTITION BY UDF(...) > or a similar syntax. UDF returns a number between 0 and n-1 where n is number > of output partitions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-602) Pass global configurations to UDF
[ https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672376#action_12672376 ] Yiping Han commented on PIG-602: Alan, this plan looks good for our requirements. > Pass global configurations to UDF > - > > Key: PIG-602 > URL: https://issues.apache.org/jira/browse/PIG-602 > Project: Pig > Issue Type: New Feature > Components: impl > Reporter: Yiping Han >Assignee: Alan Gates > > We are seeking an easy way to pass a large number of global configurations to > UDFs. > Since our application contains many pig jobs, and has a large number of > configurations. Passing configurations through command line is not an ideal > way (i.e. modifying single parameter needs to change multiple command lines). > And to put everything into the hadoop conf is not an ideal way either. > We would like to see if Pig can provide such a facility that allows us to > pass a configuration file in some format(XML?) and then make it available > through out all the UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-664) Semantics of * is not consistent
[ https://issues.apache.org/jira/browse/PIG-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672372#action_12672372 ] Yiping Han commented on PIG-664: I would second Santhosh. In PIG 1.x, * in UDF parameter list does expend as flattened list of columns. While converting into PIG 2.0, this create a lot of inconvenience. * should always generate flattened columns. > Semantics of * is not consistent > > > Key: PIG-664 > URL: https://issues.apache.org/jira/browse/PIG-664 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: types_branch >Reporter: Santhosh Srinivasan >Assignee: Santhosh Srinivasan > Fix For: types_branch > > > The semantics of * is not consistent in PIG. The use of * with generate > results in the all the columns of the record being flattened. However, the > use of * as an input to a UDF results in a tuple (wrapped in another tuple). > For consistency, * should always result in all the columns of the record > (i.e., flattened). The use of * occurs in: > 1. Foreach generate: E.g.: foreach input generate *; > 2. Input to UDFs: E.g. foreach input generate myUDF(*); > 3. Order by: E.g.: order input by *; > 4. (Co)Group: E.g.: group a by *; cogroup a by *, b by *; > In terms of implementation, this involves rolling back the fix introduced in > PIG-597 and fixing the following builtin UDFs: > 1. ARITY - Should return the size of the input tuple instead of extracting > the first column of the input tuple > 2. SIZE - Should return the size of the input tuple instead of extracting the > first column of the input tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-625) Add global -explain, -illustrate, -describe mode to PIG
Add global -explain, -illustrate, -describe mode to PIG --- Key: PIG-625 URL: https://issues.apache.org/jira/browse/PIG-625 Project: Pig Issue Type: New Feature Reporter: Yiping Han Currently PIG has the command EXPLAIN, ILLUSTRATE and DESCRIBE. But user need to manually add/remove these lines in the script when they want to debug or see details of the job. I think there should be a wait to enable these globally. What I suggest is, to add -explain, -illustrate, -describe options to PIG command line. When either of these are presented, all the DUMP and STORE commands in the script are converted into EXPLAIN, ILLUSTRATE, DESCRIBE correspondingly. This makes debugging easier. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-610) Pig appears to continue when an underlying mapred job fails
[ https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662084#action_12662084 ] Yiping Han commented on PIG-610: We are on hadoop 0.18.2 and latest pig_types branch. We tried to do "hadoop job -kill x" through a different terminal. I believe this happens every time since Ralf gave me instruction yesterday and I can easily reproduce it. > Pig appears to continue when an underlying mapred job fails > > > Key: PIG-610 > URL: https://issues.apache.org/jira/browse/PIG-610 > Project: Pig > Issue Type: Bug >Reporter: Yiping Han >Priority: Critical > > We observed sometimes, pig appears to continue when an underlying mapred job > fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-610) Pig appears to continue when an underlying mapred job fails
[ https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661823#action_12661823 ] Yiping Han commented on PIG-610: Create a pig job with multiple mapred jobs. Let the script run and then manually kill the running mapred job. Pig reports the failure of this mapred job but does not terminate itself. The next mapred job will be launched. Pig should fail immediately. > Pig appears to continue when an underlying mapred job fails > > > Key: PIG-610 > URL: https://issues.apache.org/jira/browse/PIG-610 > Project: Pig > Issue Type: Bug > Reporter: Yiping Han >Priority: Critical > > We observed sometimes, pig appears to continue when an underlying mapred job > fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-610) Pig appears to continue when an underlying mapred job fails
[ https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiping Han updated PIG-610: --- Priority: Critical (was: Major) > Pig appears to continue when an underlying mapred job fails > > > Key: PIG-610 > URL: https://issues.apache.org/jira/browse/PIG-610 > Project: Pig > Issue Type: Bug > Reporter: Yiping Han >Priority: Critical > > We observed sometimes, pig appears to continue when an underlying mapred job > fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-609) PIG does not return the correct error code
[ https://issues.apache.org/jira/browse/PIG-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiping Han updated PIG-609: --- Priority: Critical (was: Major) > PIG does not return the correct error code > -- > > Key: PIG-609 > URL: https://issues.apache.org/jira/browse/PIG-609 > Project: Pig > Issue Type: Bug > Reporter: Yiping Han >Priority: Critical > > Pig still does not always return a correct error code. When the hadoop job > fails, sometimes pig still return 0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-604) Kill the Pig job should kill all associated Hadoop Jobs
[ https://issues.apache.org/jira/browse/PIG-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiping Han updated PIG-604: --- Priority: Minor (was: Major) > Kill the Pig job should kill all associated Hadoop Jobs > --- > > Key: PIG-604 > URL: https://issues.apache.org/jira/browse/PIG-604 > Project: Pig > Issue Type: Improvement > Components: grunt > Reporter: Yiping Han >Priority: Minor > > Current if we kill the pig job on the client machine, those hadoop jobs > already launched still keep running. We have to kill these jobs manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-611) Better logging support
Better logging support -- Key: PIG-611 URL: https://issues.apache.org/jira/browse/PIG-611 Project: Pig Issue Type: Improvement Components: tools Reporter: Yiping Han I started this ticket to discuss future improvements on logging. The first thing I would like to suggest is that, pig needs more comprehensive logs. If there is a debug mode, when pig could print extensive detailed log, that would be very helpful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-610) Pig appears to continue when an underlying mapred job fails
Pig appears to continue when an underlying mapred job fails Key: PIG-610 URL: https://issues.apache.org/jira/browse/PIG-610 Project: Pig Issue Type: Bug Reporter: Yiping Han We observed sometimes, pig appears to continue when an underlying mapred job fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-608) Compile or validate the whole script before execution
[ https://issues.apache.org/jira/browse/PIG-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661714#action_12661714 ] Yiping Han commented on PIG-608: Alan, I think that's the problem. Everything up to a store or dump is checked but if there is an error after that, it will be reported until the previous store or dump finished. I don't think this is a duplicate of PIG-607, but I agree the fix to PIG-607 might be able to fix this problem (depends on the actual solution). > Compile or validate the whole script before execution > - > > Key: PIG-608 > URL: https://issues.apache.org/jira/browse/PIG-608 > Project: Pig > Issue Type: Improvement > Components: grunt >Reporter: Yiping Han > > This is a very usual scenario: > We are running a big pig job that contains several hadoop jobs. It has been > running for long times and the first hadoop job sucess, then suddenly pig > report it found a syntax error in the script after the first hadoop job...we > have to repeat from the beginning. > It would be nice if pig can compile to the end of the script, find all the > syntax error, type mismatch, etc., before it really starts execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-609) PIG does not return the correct error code
PIG does not return the correct error code -- Key: PIG-609 URL: https://issues.apache.org/jira/browse/PIG-609 Project: Pig Issue Type: Bug Reporter: Yiping Han Pig still does not always return a correct error code. When the hadoop job fails, sometimes pig still return 0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-608) Compile or validate the whole script before execution
Compile or validate the whole script before execution - Key: PIG-608 URL: https://issues.apache.org/jira/browse/PIG-608 Project: Pig Issue Type: Improvement Components: grunt Reporter: Yiping Han This is a very usual scenario: We are running a big pig job that contains several hadoop jobs. It has been running for long times and the first hadoop job sucess, then suddenly pig report it found a syntax error in the script after the first hadoop job...we have to repeat from the beginning. It would be nice if pig can compile to the end of the script, find all the syntax error, type mismatch, etc., before it really starts execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-607) Utilize intermediate results instead of re-execution
Utilize intermediate results instead of re-execution Key: PIG-607 URL: https://issues.apache.org/jira/browse/PIG-607 Project: Pig Issue Type: New Feature Reporter: Yiping Han Priority: Critical This is the long existing problem. intermediate results are not reused. Every STORE or DUMP are executed in a separate plan and thus everything it needs are re-executed. This is really a terrible issue that should be fixed asap. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-606) Setting replication factor in Pig
Setting replication factor in Pig - Key: PIG-606 URL: https://issues.apache.org/jira/browse/PIG-606 Project: Pig Issue Type: New Feature Reporter: Yiping Han We would like the STORE clause to be able to set the replication factor. This is particularly useful for certain small files, i.e. for replication join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-605) Better explain and console output
Better explain and console output - Key: PIG-605 URL: https://issues.apache.org/jira/browse/PIG-605 Project: Pig Issue Type: Improvement Components: grunt Reporter: Yiping Han It would be nice if when we explain the script, the corresponding mapred jobs can be explicitly mark out in a neat way. While we execute the script, the console output could print the name and url of the corresponding hadoop jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-604) Kill the Pig job should kill all associated Hadoop Jobs
Kill the Pig job should kill all associated Hadoop Jobs --- Key: PIG-604 URL: https://issues.apache.org/jira/browse/PIG-604 Project: Pig Issue Type: Improvement Components: grunt Reporter: Yiping Han Current if we kill the pig job on the client machine, those hadoop jobs already launched still keep running. We have to kill these jobs manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-603) Pig Server
Pig Server -- Key: PIG-603 URL: https://issues.apache.org/jira/browse/PIG-603 Project: Pig Issue Type: New Feature Components: grunt Reporter: Yiping Han With a real Pig Server, when we lose the client, the pig job will not be killed. And also, a more important reason for a Pig Server is, we can talk with the Pig Sever through APIs to query status, failures, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-602) Pass global configurations to UDF
Pass global configurations to UDF - Key: PIG-602 URL: https://issues.apache.org/jira/browse/PIG-602 Project: Pig Issue Type: New Feature Components: impl Reporter: Yiping Han We are seeking an easy way to pass a large number of global configurations to UDFs. Since our application contains many pig jobs, and has a large number of configurations. Passing configurations through command line is not an ideal way (i.e. modifying single parameter needs to change multiple command lines). And to put everything into the hadoop conf is not an ideal way either. We would like to see if Pig can provide such a facility that allows us to pass a configuration file in some format(XML?) and then make it available through out all the UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-601) Add finalize() interface to UDF
Add finalize() interface to UDF --- Key: PIG-601 URL: https://issues.apache.org/jira/browse/PIG-601 Project: Pig Issue Type: New Feature Components: impl Reporter: Yiping Han I would like to have a finalize() method to UDF, which will be called when no more inputs and the UDF will be killed. The finalize() method should allow to generate extra output, which in many cases could benefit aggregations. There are couple of application that can benefit from this feature. One of the example is, in some UDFs, I need to open some resource(i. e. local file) and when the task finishes, I need to close the resource. Another example is, in one of my application, I do statistics for a list of categories and I need to generate a summary category and attach to the end of the table. With the finalize method, I could achieve this in an efficient and neat way. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.