Problem running Pig 0.60

2009-11-03 Thread Yiping Han
Hi pig team,

I¹m testing zebra v2 and trying to run the pig 0.60 jar that I got from Yan.
However, I got the following error:

Caused by: java.lang.ClassNotFoundException: jline.ConsoleReaderInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)

Is there any additional jar file that I need to include with Hadoop or pig?


Thanks~
--
Yiping Han
y...@yahoo-inc.com
US phone: +1(408)349-4403
Beijing phone: +86(10)8215-9357 



[jira] Created: (PIG-941) [zebra] Loading non-existing column generates error

2009-09-01 Thread Yiping Han (JIRA)
[zebra] Loading non-existing column generates error
---

 Key: PIG-941
 URL: https://issues.apache.org/jira/browse/PIG-941
 Project: Pig
  Issue Type: Bug
  Components: data
Reporter: Yiping Han


Loading a column that does not exist generates the following error:

2009-09-01 21:29:15,161 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2999: Unexpected internal error. null

Example is like this:

STORE urls2 into '$output' using 
org.apache.pig.table.pig.TableStorer('md5:string, url:string');

and then in another pig script, I load the table:

input = LOAD '$output' USING org.apache.pig.table.pig.TableLoader('md5,url, 
domain');

where domain is a column that does not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Proposal to create a branch for contrib project Zebra

2009-08-17 Thread Yiping Han
+1


On 8/18/09 7:11 AM, "Olga Natkovich"  wrote:

> +1
> 
> -Original Message-
> From: Raghu Angadi [mailto:rang...@yahoo-inc.com]
> Sent: Monday, August 17, 2009 4:06 PM
> To: pig-dev@hadoop.apache.org
> Subject: Proposal to create a branch for contrib project Zebra
> 
> 
> Thanks to the PIG team, The first version of contrib project Zebra
> (PIG-833) is committed to PIG trunk.
> 
> In short, Zebra is a table storage layer built for use in PIG and other
> Hadoop applications.
> 
> While we are stabilizing current version V1 in the trunk, we plan to add
> 
> more new features to it. We would like to create an svn branch for the
> new features. We will be responsible for managing zebra in PIG trunk and
> 
> in the new branch. We will merge the branch when it is ready. We expect
> the changes to affect only 'contrib/zebra' directory.
> 
> As a regular contributor to Hadoop, I will be the initial committer for
> Zebra. As more patches are contributed by other Zebra developers, there
> might be more commiters added through normal Hadoop/Apache procedure.
> 
> I would like to create a branch called 'zebra-v2' with approval from PIG
> 
> team.
> 
> Thanks,
> Raghu.

--
Yiping Han
F-3140 
(408)349-4403
y...@yahoo-inc.com



Re: COUNT, AVG and nulls

2009-07-06 Thread Yiping Han
+1.

--Yiping


On 7/6/09 10:58 AM, "Dmitriy Ryaboy"  wrote:

> +1 for standard semantics.
> 
> We need a COALESCE function to go along with this.
> 
> -D
> 
> On Mon, Jul 6, 2009 at 10:46 AM, Olga Natkovich  wrote:
> 
>> Hi,
>> 
>> 
>> 
>> The current implementation of COUNT and AVG in Pig counts null values.
>> This is inconsistent with SQL semantics and also with semantics of other
>> aggregated functions such as SUM, MIN, and MAX. Originally we chose this
>> implementation for performance reasons; however, we re-implemented both
>> functions to support multi-step combiner and now the cost of checking
>> for null for the case where combiner is invoked is trivial. (I ran some
>> tests with COUNT and they showed no performance difference.) We will pay
>> penalty for the non-combinable case including local mode but I think it
>> is worth the price to have consistent semantics. Also as we are working
>> on SQL support, having SQL compliant semantics becomes very desirable.
>> 
>> 
>> 
>> Please, let us know if you have any concerns. I am planning to make the
>> change later this week.
>> 
>> 
>> 
>> Olga
>> 
>> 

--
Yiping Han
F-3140 
(408)349-4403
y...@yahoo-inc.com



[jira] Commented: (PIG-796) support conversion from numeric types to chararray

2009-05-29 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714526#action_12714526
 ] 

Yiping Han commented on PIG-796:


I have the same idea that Alan proposed. I agree the common case is most values 
are of the same type. Caching the type and change the cached type only when 
catch the ClassCastException would be the most efficient way.

> support  conversion from numeric types to chararray
> ---
>
> Key: PIG-796
> URL: https://issues.apache.org/jira/browse/PIG-796
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)

2009-05-19 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12710818#action_12710818
 ] 

Yiping Han commented on PIG-807:


David, the syntax: B = foreach A generate SUM(m), is confusing for both 
developers and the parser.

I like the idea to remove the explicit GROUP ALL, but would rather to use a 
different key word for that. I.e., B = FOR A GENERATE SUM(m);

Adding a new keyword for this purpose would also works as the hint for parser 
to treat this as a direct hadoop iterator access.

> PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the 
> Hadoop values iterator)
> 
>
> Key: PIG-807
> URL: https://issues.apache.org/jira/browse/PIG-807
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.1
>Reporter: Pradeep Kamath
> Fix For: 0.3.0
>
>
> Currently all bags resulting from a group or cogroup are materialized as bags 
> containing all of the contents. The issue with this is that if a particular 
> key has many corresponding values, all these values get stuffed in a bag 
> which may run out of memory and hence spill causing slow down in performance 
> and sometime memory exceptions. In many cases, the udfs which use these bags 
> coming out a group and cogroup only need to iterate over the bag in a 
> unidirectional read-once manner. This can be implemented by having the bag 
> implement its iterator by simply iterating over the underlying hadoop 
> iterator provided in the reduce. This kind of a bag is also needed in 
> http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for 
> this issue too. The other part of this issue is to have some way for the udfs 
> to communicate to Pig that any input bags that they need are "read once" bags 
> . This can be achieved by having an Interface - say "UsesReadOnceBags " which 
> is serves as a tag to indicate the intent to Pig. Pig can then rewire its 
> execution plan to use ReadOnceBags is feasible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)

2009-05-12 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708575#action_12708575
 ] 

Yiping Han commented on PIG-807:


I would say instead of annotating the UDF to indicate ""read once" bags, it 
would be easier to do that in the co-group command. We would skip bag 
materialization only if it is accessed by UDFs that ALL read it in the "read 
once" manner. Thus we only need to specify that once. 


> PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the 
> Hadoop values iterator)
> 
>
> Key: PIG-807
> URL: https://issues.apache.org/jira/browse/PIG-807
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.1
>Reporter: Pradeep Kamath
> Fix For: 0.3.0
>
>
> Currently all bags resulting from a group or cogroup are materialized as bags 
> containing all of the contents. The issue with this is that if a particular 
> key has many corresponding values, all these values get stuffed in a bag 
> which may run out of memory and hence spill causing slow down in performance 
> and sometime memory exceptions. In many cases, the udfs which use these bags 
> coming out a group and cogroup only need to iterate over the bag in a 
> unidirectional read-once manner. This can be implemented by having the bag 
> implement its iterator by simply iterating over the underlying hadoop 
> iterator provided in the reduce. This kind of a bag is also needed in 
> http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for 
> this issue too. The other part of this issue is to have some way for the udfs 
> to communicate to Pig that any input bags that they need are "read once" bags 
> . This can be achieved by having an Interface - say "UsesReadOnceBags " which 
> is serves as a tag to indicate the intent to Pig. Pig can then rewire its 
> execution plan to use ReadOnceBags is feasible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-734) Non-string keys in maps

2009-05-07 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707178#action_12707178
 ] 

Yiping Han commented on PIG-734:


Then why not just to restrict all the keys to be of the same type? I don't see 
the point that different records should have different key types. But I do see 
the point that people may want to use non-string type of keys.

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.3.0
>
> Attachments: PIG-734.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-734) Non-string keys in maps

2009-05-07 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707110#action_12707110
 ] 

Yiping Han commented on PIG-734:


I don't get the serializing part. I would expect the type-checking just happen 
once, would that be a performance problem.

Actually we are thinking if we sould switch to integer key for saving space.

I wouldn't post strong against to this rollback, but I don't see a significant 
reason for dong that.

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.3.0
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-734) Non-string keys in maps

2009-05-07 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707110#action_12707110
 ] 

Yiping Han edited comment on PIG-734 at 5/7/09 2:10 PM:


I don't get the serializing part. I would expect the type-checking just happen 
once, would that be a performance problem?

Actually we are thinking if we should switch to integer key to save space.

I wouldn't post strong against to this rollback, but I don't see a significant 
reason for dong that.

  was (Author: yhan):
I don't get the serializing part. I would expect the type-checking just 
happen once, would that be a performance problem.

Actually we are thinking if we sould switch to integer key for saving space.

I wouldn't post strong against to this rollback, but I don't see a significant 
reason for dong that.
  
> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.3.0
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-282) Custom Partitioner

2009-02-10 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672467#action_12672467
 ] 

Yiping Han commented on PIG-282:


Any concerns on this issue?

> Custom Partitioner
> --
>
> Key: PIG-282
> URL: https://issues.apache.org/jira/browse/PIG-282
> Project: Pig
>  Issue Type: New Feature
>Reporter: Amir Youssefi
>Priority: Minor
>
> By adding custom partitioner we can give control over which output partition 
> a key (/value) goes to. We can add keywords to language e.g. 
> PARTITION BY UDF(...)
> or a similar syntax. UDF returns a number between 0 and n-1 where n is number 
> of output partitions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-602) Pass global configurations to UDF

2009-02-10 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672376#action_12672376
 ] 

Yiping Han commented on PIG-602:


Alan, this plan looks good for our requirements.

> Pass global configurations to UDF
> -
>
> Key: PIG-602
> URL: https://issues.apache.org/jira/browse/PIG-602
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>    Reporter: Yiping Han
>Assignee: Alan Gates
>
> We are seeking an easy way to pass a large number of global configurations to 
> UDFs.
> Since our application contains many pig jobs, and has a large number of 
> configurations. Passing configurations through command line is not an ideal 
> way (i.e. modifying single parameter needs to change multiple command lines). 
> And to put everything into the hadoop conf is not an ideal way either.
> We would like to see if Pig can provide such a facility that allows us to 
> pass a configuration file in some format(XML?) and then make it available 
> through out all the UDFs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-664) Semantics of * is not consistent

2009-02-10 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672372#action_12672372
 ] 

Yiping Han commented on PIG-664:


I would second Santhosh. In PIG 1.x, * in UDF parameter list does expend as 
flattened list of columns. While converting into PIG 2.0, this create a lot of 
inconvenience. * should always generate flattened columns.

> Semantics of * is not consistent
> 
>
> Key: PIG-664
> URL: https://issues.apache.org/jira/browse/PIG-664
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: types_branch
>Reporter: Santhosh Srinivasan
>Assignee: Santhosh Srinivasan
> Fix For: types_branch
>
>
> The semantics of * is not consistent in PIG. The use of * with generate 
> results in the all the columns of the record being flattened. However, the 
> use of * as an input to a UDF results in a tuple (wrapped in another tuple). 
> For consistency, * should always result in all the columns of the record 
> (i.e., flattened). The use of * occurs in:
> 1. Foreach generate: E.g.: foreach input generate *;
> 2. Input to UDFs: E.g. foreach input generate myUDF(*);
> 3. Order by: E.g.: order input by *;
> 4. (Co)Group: E.g.: group a by *; cogroup a by *, b by *;
> In terms of implementation, this involves rolling back the fix introduced in 
> PIG-597 and fixing the following builtin UDFs:
> 1. ARITY - Should return the size of the input tuple instead of extracting 
> the first column of the input tuple
> 2. SIZE - Should return the size of the input tuple instead of extracting the 
> first column of the input tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-625) Add global -explain, -illustrate, -describe mode to PIG

2009-01-16 Thread Yiping Han (JIRA)
Add global -explain, -illustrate, -describe mode to PIG
---

 Key: PIG-625
 URL: https://issues.apache.org/jira/browse/PIG-625
 Project: Pig
  Issue Type: New Feature
Reporter: Yiping Han


Currently PIG has the command EXPLAIN, ILLUSTRATE and DESCRIBE. But user need 
to manually add/remove these lines in the script when they want to debug or see 
details of the job. I think there should be a wait to enable these globally. 

What I suggest is, to add -explain, -illustrate, -describe options to PIG 
command line. When either of these are presented, all the DUMP and STORE 
commands in the script are converted into EXPLAIN, ILLUSTRATE, DESCRIBE 
correspondingly. This makes debugging easier.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-610) Pig appears to continue when an underlying mapred job fails

2009-01-08 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662084#action_12662084
 ] 

Yiping Han commented on PIG-610:


We are on hadoop 0.18.2 and latest pig_types branch. We tried to do "hadoop job 
-kill x" through a different terminal. I believe this happens every time 
since Ralf gave me instruction yesterday and I can easily reproduce it.

> Pig appears to continue when an underlying mapred job fails 
> 
>
> Key: PIG-610
> URL: https://issues.apache.org/jira/browse/PIG-610
> Project: Pig
>  Issue Type: Bug
>Reporter: Yiping Han
>Priority: Critical
>
> We observed sometimes, pig appears to continue when an underlying mapred job 
> fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-610) Pig appears to continue when an underlying mapred job fails

2009-01-07 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661823#action_12661823
 ] 

Yiping Han commented on PIG-610:


Create a pig job with multiple mapred jobs. Let the script run and then 
manually kill the running mapred job. Pig reports the failure of this mapred 
job but does not terminate itself. The next mapred job will be launched.

Pig should fail immediately.

> Pig appears to continue when an underlying mapred job fails 
> 
>
> Key: PIG-610
> URL: https://issues.apache.org/jira/browse/PIG-610
> Project: Pig
>  Issue Type: Bug
>    Reporter: Yiping Han
>Priority: Critical
>
> We observed sometimes, pig appears to continue when an underlying mapred job 
> fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-610) Pig appears to continue when an underlying mapred job fails

2009-01-07 Thread Yiping Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiping Han updated PIG-610:
---

Priority: Critical  (was: Major)

> Pig appears to continue when an underlying mapred job fails 
> 
>
> Key: PIG-610
> URL: https://issues.apache.org/jira/browse/PIG-610
> Project: Pig
>  Issue Type: Bug
>    Reporter: Yiping Han
>Priority: Critical
>
> We observed sometimes, pig appears to continue when an underlying mapred job 
> fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-609) PIG does not return the correct error code

2009-01-07 Thread Yiping Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiping Han updated PIG-609:
---

Priority: Critical  (was: Major)

> PIG does not return the correct error code
> --
>
> Key: PIG-609
> URL: https://issues.apache.org/jira/browse/PIG-609
> Project: Pig
>  Issue Type: Bug
>    Reporter: Yiping Han
>Priority: Critical
>
> Pig still does not always return a correct error code. When the hadoop job 
> fails, sometimes pig still return 0.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-604) Kill the Pig job should kill all associated Hadoop Jobs

2009-01-07 Thread Yiping Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiping Han updated PIG-604:
---

Priority: Minor  (was: Major)

> Kill the Pig job should kill all associated Hadoop Jobs
> ---
>
> Key: PIG-604
> URL: https://issues.apache.org/jira/browse/PIG-604
> Project: Pig
>  Issue Type: Improvement
>  Components: grunt
>        Reporter: Yiping Han
>Priority: Minor
>
> Current if we kill the pig job on the client machine, those hadoop jobs 
> already launched still keep running. We have to kill these jobs manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-611) Better logging support

2009-01-07 Thread Yiping Han (JIRA)
Better logging support
--

 Key: PIG-611
 URL: https://issues.apache.org/jira/browse/PIG-611
 Project: Pig
  Issue Type: Improvement
  Components: tools
Reporter: Yiping Han


I started this ticket to discuss future improvements on logging.

The first thing I would like to suggest is that, pig needs more comprehensive 
logs. If there is a debug mode, when pig could print extensive detailed log, 
that would be very helpful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-610) Pig appears to continue when an underlying mapred job fails

2009-01-07 Thread Yiping Han (JIRA)
Pig appears to continue when an underlying mapred job fails 


 Key: PIG-610
 URL: https://issues.apache.org/jira/browse/PIG-610
 Project: Pig
  Issue Type: Bug
Reporter: Yiping Han


We observed sometimes, pig appears to continue when an underlying mapred job 
fails.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-608) Compile or validate the whole script before execution

2009-01-07 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661714#action_12661714
 ] 

Yiping Han commented on PIG-608:


Alan, I think that's the problem. Everything up to a store or dump is checked 
but if there is an error after that, it will be reported until the previous 
store or dump finished. I don't think this is a duplicate of PIG-607, but I 
agree the fix to PIG-607 might be able to fix this problem (depends on the 
actual solution).

> Compile or validate the whole script before execution
> -
>
> Key: PIG-608
> URL: https://issues.apache.org/jira/browse/PIG-608
> Project: Pig
>  Issue Type: Improvement
>  Components: grunt
>Reporter: Yiping Han
>
> This is a very usual scenario: 
> We are running a big pig job that contains several hadoop jobs. It has been 
> running for long times and the first hadoop job sucess, then suddenly pig 
> report it found a syntax error in the script after the first hadoop job...we 
> have to repeat from the beginning.
> It would be nice if pig can compile to the end of the script, find all the 
> syntax error, type mismatch, etc., before it really starts execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-609) PIG does not return the correct error code

2009-01-07 Thread Yiping Han (JIRA)
PIG does not return the correct error code
--

 Key: PIG-609
 URL: https://issues.apache.org/jira/browse/PIG-609
 Project: Pig
  Issue Type: Bug
Reporter: Yiping Han


Pig still does not always return a correct error code. When the hadoop job 
fails, sometimes pig still return 0.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-608) Compile or validate the whole script before execution

2009-01-07 Thread Yiping Han (JIRA)
Compile or validate the whole script before execution
-

 Key: PIG-608
 URL: https://issues.apache.org/jira/browse/PIG-608
 Project: Pig
  Issue Type: Improvement
  Components: grunt
Reporter: Yiping Han


This is a very usual scenario: 

We are running a big pig job that contains several hadoop jobs. It has been 
running for long times and the first hadoop job sucess, then suddenly pig 
report it found a syntax error in the script after the first hadoop job...we 
have to repeat from the beginning.

It would be nice if pig can compile to the end of the script, find all the 
syntax error, type mismatch, etc., before it really starts execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-607) Utilize intermediate results instead of re-execution

2009-01-07 Thread Yiping Han (JIRA)
Utilize intermediate results instead of re-execution


 Key: PIG-607
 URL: https://issues.apache.org/jira/browse/PIG-607
 Project: Pig
  Issue Type: New Feature
Reporter: Yiping Han
Priority: Critical


This is the long existing problem. intermediate results are not reused. Every 
STORE or DUMP are executed in a separate plan and thus everything it needs are 
re-executed. This is really a terrible issue that should be fixed asap.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-606) Setting replication factor in Pig

2009-01-07 Thread Yiping Han (JIRA)
Setting replication factor in Pig
-

 Key: PIG-606
 URL: https://issues.apache.org/jira/browse/PIG-606
 Project: Pig
  Issue Type: New Feature
Reporter: Yiping Han


We would like the STORE clause to be able to set the replication factor. This 
is particularly useful for certain small files, i.e. for replication join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-605) Better explain and console output

2009-01-07 Thread Yiping Han (JIRA)
Better explain and console output
-

 Key: PIG-605
 URL: https://issues.apache.org/jira/browse/PIG-605
 Project: Pig
  Issue Type: Improvement
  Components: grunt
Reporter: Yiping Han


It would be nice if when we explain the script, the corresponding mapred jobs 
can be explicitly mark out in a neat way. While we execute the script, the 
console output could print the name and url of the corresponding hadoop jobs.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-604) Kill the Pig job should kill all associated Hadoop Jobs

2009-01-07 Thread Yiping Han (JIRA)
Kill the Pig job should kill all associated Hadoop Jobs
---

 Key: PIG-604
 URL: https://issues.apache.org/jira/browse/PIG-604
 Project: Pig
  Issue Type: Improvement
  Components: grunt
Reporter: Yiping Han


Current if we kill the pig job on the client machine, those hadoop jobs already 
launched still keep running. We have to kill these jobs manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-603) Pig Server

2009-01-07 Thread Yiping Han (JIRA)
Pig Server
--

 Key: PIG-603
 URL: https://issues.apache.org/jira/browse/PIG-603
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Reporter: Yiping Han


With a real Pig Server, when we lose the client, the pig job will not be 
killed. And also, a more important reason for a Pig Server is, we can talk with 
the Pig Sever through APIs to query status, failures, etc.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-602) Pass global configurations to UDF

2009-01-07 Thread Yiping Han (JIRA)
Pass global configurations to UDF
-

 Key: PIG-602
 URL: https://issues.apache.org/jira/browse/PIG-602
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han


We are seeking an easy way to pass a large number of global configurations to 
UDFs.

Since our application contains many pig jobs, and has a large number of 
configurations. Passing configurations through command line is not an ideal way 
(i.e. modifying single parameter needs to change multiple command lines). And 
to put everything into the hadoop conf is not an ideal way either.

We would like to see if Pig can provide such a facility that allows us to pass 
a configuration file in some format(XML?) and then make it available through 
out all the UDFs.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-601) Add finalize() interface to UDF

2009-01-07 Thread Yiping Han (JIRA)
Add finalize() interface to UDF
---

 Key: PIG-601
 URL: https://issues.apache.org/jira/browse/PIG-601
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han


I would like to have a finalize() method to UDF, which will be called when no 
more inputs and the UDF will be killed. The finalize() method should allow to 
generate extra output, which in many cases could benefit aggregations.

There are couple of application that can benefit from this feature.

One of the example is, in some UDFs, I need to open some resource(i. e. local 
file) and when the task finishes, I need to close the resource.

Another example is, in one of my application, I do statistics for a list of 
categories and I need to generate a summary category and attach to the end of 
the table. With the finalize method, I could achieve this in an efficient and 
neat way.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.