[jira] Updated: (PIG-1533) Compression codec should be a per-store property

2010-08-05 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1533:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

> Compression codec should be a per-store property
> 
>
> Key: PIG-1533
> URL: https://issues.apache.org/jira/browse/PIG-1533
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1533.patch
>
>
> The following script with multi-query optimization
> {code}
> a = load 'input';
> store a into 'outout.bz2';
> store a into 'outout2'
> {code}
> generates two .bz files, while only one of them should be compressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1525) Incorrect data generated by diff of SUM

2010-08-05 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1525:
--

Attachment: PIG-1525.patch

> Incorrect data generated by diff of SUM
> ---
>
> Key: PIG-1525
> URL: https://issues.apache.org/jira/browse/PIG-1525
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1525.patch
>
>
> Given data;
> input1:
> {code}
> id9 0
> {code}
> input2:
> {code}
> id8 1
> id9 1
> {code}
> Pig script
> {code}
> A = LOAD 'input1' AS (id:chararray, val:long);
> B = LOAD 'input2' AS (id:chararray, val:long);
> C = COGROUP A BY id, B BY id;
> D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - 
> SUM(B.val));
> dump D;
> {code}
> generates incorrect data:
> {code}
> (id8,1L,,)
> (id9,1L,0L,-2L)
> {code}
> The workaround is to replace the FOREACH statement with
> {code}
> D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a;
> E = FOREACH D GENERATE $0, b, a, (a-b);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1525) Incorrect data generated by diff of SUM

2010-08-05 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1525:
--

Status: Patch Available  (was: Open)

> Incorrect data generated by diff of SUM
> ---
>
> Key: PIG-1525
> URL: https://issues.apache.org/jira/browse/PIG-1525
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1525.patch
>
>
> Given data;
> input1:
> {code}
> id9 0
> {code}
> input2:
> {code}
> id8 1
> id9 1
> {code}
> Pig script
> {code}
> A = LOAD 'input1' AS (id:chararray, val:long);
> B = LOAD 'input2' AS (id:chararray, val:long);
> C = COGROUP A BY id, B BY id;
> D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - 
> SUM(B.val));
> dump D;
> {code}
> generates incorrect data:
> {code}
> (id8,1L,,)
> (id9,1L,0L,-2L)
> {code}
> The workaround is to replace the FOREACH statement with
> {code}
> D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a;
> E = FOREACH D GENERATE $0, b, a, (a-b);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-08-05 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895802#action_12895802
 ] 

Richard Ding commented on PIG-1334:
---


I ran mvn-deploy target. It succeeded and the pig jar and other artifacts were 
deployed to 

{code}
https://repository.apache.org/content/repositories/snapshots/org/apache/hadoop/pig/0.8.0-SNAPSHOT/
{code}

Giri, can you review the new patch?

> Make pig artifacts available through maven
> --
>
> Key: PIG-1334
> URL: https://issues.apache.org/jira/browse/PIG-1334
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
> mvn_pig_4.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1525) Incorrect data generated by diff of SUM

2010-08-06 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896094#action_12896094
 ] 

Richard Ding commented on PIG-1525:
---



Results of running test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{code}

> Incorrect data generated by diff of SUM
> ---
>
> Key: PIG-1525
> URL: https://issues.apache.org/jira/browse/PIG-1525
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1525.patch
>
>
> Given data;
> input1:
> {code}
> id9 0
> {code}
> input2:
> {code}
> id8 1
> id9 1
> {code}
> Pig script
> {code}
> A = LOAD 'input1' AS (id:chararray, val:long);
> B = LOAD 'input2' AS (id:chararray, val:long);
> C = COGROUP A BY id, B BY id;
> D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - 
> SUM(B.val));
> dump D;
> {code}
> generates incorrect data:
> {code}
> (id8,1L,,)
> (id9,1L,0L,-2L)
> {code}
> The workaround is to replace the FOREACH statement with
> {code}
> D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a;
> E = FOREACH D GENERATE $0, b, a, (a-b);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-103) Shared Job /tmp location should be configurable

2010-08-06 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896102#action_12896102
 ] 

Richard Ding commented on PIG-103:
--

The patch looks good.  A couple of comments:

* In FileLocalizer, it's better to call the getProperty

{code}
String tdir= pigContext.getProperties().getProperty("pig.temp.loc", "/tmp");
{code}

from inside of the if-block so it only gets called when needed.

* In the unit test, it world be good to verify the method

{code}
FileLocalizer.getTemporaryPath(PigContext pigContext)
{code}

returns the correct temp directory.

> Shared Job /tmp location should be configurable
> ---
>
> Key: PIG-103
> URL: https://issues.apache.org/jira/browse/PIG-103
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
> Environment: Partially shared file:// filesystem (eg NFS)
>Reporter: Craig Macdonald
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: conf_tmp_dir.patch
>
>
> Hello,
> I'm investigating running pig in an environment where various parts of the 
> file:// filesystem are available on all nodes. I can tell hadoop to use a 
> file:// file system location for it's default, by seting 
> fs.default.name=file://path/to/shared/folder
> However, this creates issues for Pig, as Pig writes it's job information in a 
> folder that it assumes is a shared FS (eg DFS). However, in this scenario 
> /tmp is not shared on each machine.
> So /tmp should either be configurable, or Hadoop should tell you the actual 
> full location set in fs.default.name?
> Straightforward solution is to make "/tmp/" a property in 
> src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext)
> Any suggestions of property names?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1525) Incorrect data generated by diff of SUM

2010-08-06 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896121#action_12896121
 ] 

Richard Ding commented on PIG-1525:
---

It turns out that the problem also affects the conditional operator (BinCond). 

> Incorrect data generated by diff of SUM
> ---
>
> Key: PIG-1525
> URL: https://issues.apache.org/jira/browse/PIG-1525
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1525.patch
>
>
> Given data;
> input1:
> {code}
> id9 0
> {code}
> input2:
> {code}
> id8 1
> id9 1
> {code}
> Pig script
> {code}
> A = LOAD 'input1' AS (id:chararray, val:long);
> B = LOAD 'input2' AS (id:chararray, val:long);
> C = COGROUP A BY id, B BY id;
> D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - 
> SUM(B.val));
> dump D;
> {code}
> generates incorrect data:
> {code}
> (id8,1L,,)
> (id9,1L,0L,-2L)
> {code}
> The workaround is to replace the FOREACH statement with
> {code}
> D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a;
> E = FOREACH D GENERATE $0, b, a, (a-b);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1525) Incorrect data generated by diff of SUM

2010-08-06 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896123#action_12896123
 ] 

Richard Ding commented on PIG-1525:
---

The cause is the interaction between Accumulator UDF and binary operators. In 
the failure cases, the state kept by Accumulator is not reset cross record 
boundaries. 

> Incorrect data generated by diff of SUM
> ---
>
> Key: PIG-1525
> URL: https://issues.apache.org/jira/browse/PIG-1525
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1525.patch
>
>
> Given data;
> input1:
> {code}
> id9 0
> {code}
> input2:
> {code}
> id8 1
> id9 1
> {code}
> Pig script
> {code}
> A = LOAD 'input1' AS (id:chararray, val:long);
> B = LOAD 'input2' AS (id:chararray, val:long);
> C = COGROUP A BY id, B BY id;
> D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - 
> SUM(B.val));
> dump D;
> {code}
> generates incorrect data:
> {code}
> (id8,1L,,)
> (id9,1L,0L,-2L)
> {code}
> The workaround is to replace the FOREACH statement with
> {code}
> D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a;
> E = FOREACH D GENERATE $0, b, a, (a-b);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1525) Incorrect data generated by diff of SUM

2010-08-09 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1525:
--

Attachment: PIG-1525_1.patch

Thanks Thejas for suggesting a simple fix. The new patch passed core tests.

> Incorrect data generated by diff of SUM
> ---
>
> Key: PIG-1525
> URL: https://issues.apache.org/jira/browse/PIG-1525
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1525.patch, PIG-1525_1.patch
>
>
> Given data;
> input1:
> {code}
> id9 0
> {code}
> input2:
> {code}
> id8 1
> id9 1
> {code}
> Pig script
> {code}
> A = LOAD 'input1' AS (id:chararray, val:long);
> B = LOAD 'input2' AS (id:chararray, val:long);
> C = COGROUP A BY id, B BY id;
> D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - 
> SUM(B.val));
> dump D;
> {code}
> generates incorrect data:
> {code}
> (id8,1L,,)
> (id9,1L,0L,-2L)
> {code}
> The workaround is to replace the FOREACH statement with
> {code}
> D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a;
> E = FOREACH D GENERATE $0, b, a, (a-b);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1525) Incorrect data generated by diff of SUM

2010-08-09 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1525:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

> Incorrect data generated by diff of SUM
> ---
>
> Key: PIG-1525
> URL: https://issues.apache.org/jira/browse/PIG-1525
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1525.patch, PIG-1525_1.patch
>
>
> Given data;
> input1:
> {code}
> id9 0
> {code}
> input2:
> {code}
> id8 1
> id9 1
> {code}
> Pig script
> {code}
> A = LOAD 'input1' AS (id:chararray, val:long);
> B = LOAD 'input2' AS (id:chararray, val:long);
> C = COGROUP A BY id, B BY id;
> D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - 
> SUM(B.val));
> dump D;
> {code}
> generates incorrect data:
> {code}
> (id8,1L,,)
> (id9,1L,0L,-2L)
> {code}
> The workaround is to replace the FOREACH statement with
> {code}
> D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a;
> E = FOREACH D GENERATE $0, b, a, (a-b);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-103) Shared Job /tmp location should be configurable

2010-08-09 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-103:
-

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

The patch committed to the trunk. Thanks Niraj.

> Shared Job /tmp location should be configurable
> ---
>
> Key: PIG-103
> URL: https://issues.apache.org/jira/browse/PIG-103
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
> Environment: Partially shared file:// filesystem (eg NFS)
>Reporter: Craig Macdonald
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: conf_tmp_dir.patch, conf_tmp_dir_2.patch
>
>
> Hello,
> I'm investigating running pig in an environment where various parts of the 
> file:// filesystem are available on all nodes. I can tell hadoop to use a 
> file:// file system location for it's default, by seting 
> fs.default.name=file://path/to/shared/folder
> However, this creates issues for Pig, as Pig writes it's job information in a 
> folder that it assumes is a shared FS (eg DFS). However, in this scenario 
> /tmp is not shared on each machine.
> So /tmp should either be configurable, or Hadoop should tell you the actual 
> full location set in fs.default.name?
> Straightforward solution is to make "/tmp/" a property in 
> src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext)
> Any suggestions of property names?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1541) FR Join shouldn't match null values

2010-08-10 Thread Richard Ding (JIRA)
FR Join shouldn't match null values
---

 Key: PIG-1541
 URL: https://issues.apache.org/jira/browse/PIG-1541
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0



Here is an example:

Data input:

{code}
1   1
2
{code}

the script 

{code}
a = load 'input';
b = load 'input';
c = join a by $0, b by $0 using 'repl';
dump c; 
{code}

generates results that matches null values:

{code}
(1,1,1,1)
(,2,,2)
{code}

The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897451#action_12897451
 ] 

Richard Ding commented on PIG-1458:
---

The proposal is to run another map-reduce job to merge the small files before 
the replicated join. This additional job will be added to the MR plan at the 
compile time.

We consider three cases of a replicated join: 

# The right input is a map-only job and input files exist at the compile time.
# The right input is a map-only job and input files do not exist at the compile 
time.
# The right input is a map-reduce job.

For 1., if the number of files exceeds the threshold specified in the property 
file (_pig.frjoin.merge.files.threshold_), a merge job is added between right 
input job and FR join job.

For 3., if the number of reducers exceeds the threshold specified in the 
property file (_pig.frjoin.merge.files.threshold_), a merge job is added 
between right input job and FR join job.

For 2., if the flag specified in the property file 
(_pig.frjoin.merge.files.optimistic_) is false,  a merge job is added between 
right input job and FR join job. The default value of this flag is false. 



> aggregate files for replicated join
> ---
>
> Key: PIG-1458
> URL: https://issues.apache.org/jira/browse/PIG-1458
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
>
> We have noticed that if the smaller data in replicated join has many files, 
> this puts  unneeded burden on the name node. pre-aggregating the files can 
> improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-103) Shared Job /tmp location should be configurable

2010-08-11 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-103:
-

Tags: documentation

> Shared Job /tmp location should be configurable
> ---
>
> Key: PIG-103
> URL: https://issues.apache.org/jira/browse/PIG-103
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
> Environment: Partially shared file:// filesystem (eg NFS)
>Reporter: Craig Macdonald
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: conf_tmp_dir.patch, conf_tmp_dir_2.patch
>
>
> Hello,
> I'm investigating running pig in an environment where various parts of the 
> file:// filesystem are available on all nodes. I can tell hadoop to use a 
> file:// file system location for it's default, by seting 
> fs.default.name=file://path/to/shared/folder
> However, this creates issues for Pig, as Pig writes it's job information in a 
> folder that it assumes is a shared FS (eg DFS). However, in this scenario 
> /tmp is not shared on each machine.
> So /tmp should either be configurable, or Hadoop should tell you the actual 
> full location set in fs.default.name?
> Straightforward solution is to make "/tmp/" a property in 
> src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext)
> Any suggestions of property names?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897484#action_12897484
 ] 

Richard Ding commented on PIG-1458:
---

For 1. and 2. above, another approach is to do nothing and rely on 
MultiFileInputFormat (PIG-1518) to merge small files. 

> aggregate files for replicated join
> ---
>
> Key: PIG-1458
> URL: https://issues.apache.org/jira/browse/PIG-1458
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
>
> We have noticed that if the smaller data in replicated join has many files, 
> this puts  unneeded burden on the name node. pre-aggregating the files can 
> improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1541) FR Join shouldn't match null values

2010-08-11 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1541:
--

Status: Patch Available  (was: Open)

> FR Join shouldn't match null values
> ---
>
> Key: PIG-1541
> URL: https://issues.apache.org/jira/browse/PIG-1541
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1541.patch
>
>
> Here is an example:
> Data input:
> {code}
> 1   1
> 2
> {code}
> the script 
> {code}
> a = load 'input';
> b = load 'input';
> c = join a by $0, b by $0 using 'repl';
> dump c; 
> {code}
> generates results that matches null values:
> {code}
> (1,1,1,1)
> (,2,,2)
> {code}
> The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1541) FR Join shouldn't match null values

2010-08-11 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1541:
--

Attachment: PIG-1541.patch

> FR Join shouldn't match null values
> ---
>
> Key: PIG-1541
> URL: https://issues.apache.org/jira/browse/PIG-1541
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1541.patch
>
>
> Here is an example:
> Data input:
> {code}
> 1   1
> 2
> {code}
> the script 
> {code}
> a = load 'input';
> b = load 'input';
> c = join a by $0, b by $0 using 'repl';
> dump c; 
> {code}
> generates results that matches null values:
> {code}
> (1,1,1,1)
> (,2,,2)
> {code}
> The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1541) FR Join shouldn't match null values

2010-08-12 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897866#action_12897866
 ] 

Richard Ding commented on PIG-1541:
---


Results of test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to i
 [exec] nclude 6 new or modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{code}

> FR Join shouldn't match null values
> ---
>
> Key: PIG-1541
> URL: https://issues.apache.org/jira/browse/PIG-1541
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1541.patch
>
>
> Here is an example:
> Data input:
> {code}
> 1   1
> 2
> {code}
> the script 
> {code}
> a = load 'input';
> b = load 'input';
> c = join a by $0, b by $0 using 'repl';
> dump c; 
> {code}
> generates results that matches null values:
> {code}
> (1,1,1,1)
> (,2,,2)
> {code}
> The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1541) FR Join shouldn't match null values

2010-08-13 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1541:
--

Attachment: PIG-1541_1.patch

New patch to address the general case where the join key is tuple.

> FR Join shouldn't match null values
> ---
>
> Key: PIG-1541
> URL: https://issues.apache.org/jira/browse/PIG-1541
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1541.patch, PIG-1541_1.patch
>
>
> Here is an example:
> Data input:
> {code}
> 1   1
> 2
> {code}
> the script 
> {code}
> a = load 'input';
> b = load 'input';
> c = join a by $0, b by $0 using 'repl';
> dump c; 
> {code}
> generates results that matches null values:
> {code}
> (1,1,1,1)
> (,2,,2)
> {code}
> The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator

2010-08-13 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898450#action_12898450
 ] 

Richard Ding commented on PIG-1448:
---

+1. Looks good.

> Detach tuple from inner plans of physical operator 
> ---
>
> Key: PIG-1448
> URL: https://issues.apache.org/jira/browse/PIG-1448
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: multi_oom_filt.pig, PIG-1448.1.patch
>
>
> This is a follow-up on PIG-1446 which only addresses this general problem for 
> a specific instance of For Each. In general, all the physical operators which 
> can have inner plans are vulnerable to this. Few of them include 
> POLocalRearrange, POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1541) FR Join shouldn't match null values

2010-08-16 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1541:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Tests are successful. The patch is committed to the trunk. 

> FR Join shouldn't match null values
> ---
>
> Key: PIG-1541
> URL: https://issues.apache.org/jira/browse/PIG-1541
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1541.patch, PIG-1541_1.patch
>
>
> Here is an example:
> Data input:
> {code}
> 1   1
> 2
> {code}
> the script 
> {code}
> a = load 'input';
> b = load 'input';
> c = join a by $0, b by $0 using 'repl';
> dump c; 
> {code}
> generates results that matches null values:
> {code}
> (1,1,1,1)
> (,2,,2)
> {code}
> The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1392) Parser fails to recognize valid field

2010-08-16 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1392:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

The parser bug is fixed, but encounters another problem which is tracked by 
PIG-1545. The work around is to disable the secondary key optimization.

The patch is committed to the trunk.

> Parser fails to recognize valid field
> -
>
> Key: PIG-1392
> URL: https://issues.apache.org/jira/browse/PIG-1392
> Project: Pig
>  Issue Type: Bug
>Reporter: Ankur
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: nested_parser.patch
>
>
> Using this script below, parser fails to recognize a valid field in the 
> relation and throws error
> A = LOAD '/tmp' as (a:int, b:chararray, c:int);
> B = GROUP A BY (a, b);
> C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ;
> The error thrown is
> 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: 
> chararray),A: {a: int,b: chararray,c: int}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1392) Parser fails to recognize valid field

2010-08-16 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899003#action_12899003
 ] 

Richard Ding commented on PIG-1392:
---

Thanks Niraj for fixing this issue.

> Parser fails to recognize valid field
> -
>
> Key: PIG-1392
> URL: https://issues.apache.org/jira/browse/PIG-1392
> Project: Pig
>  Issue Type: Bug
>Reporter: Ankur
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: nested_parser.patch
>
>
> Using this script below, parser fails to recognize a valid field in the 
> relation and throws error
> A = LOAD '/tmp' as (a:int, b:chararray, c:int);
> B = GROUP A BY (a, b);
> C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ;
> The error thrown is
> 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: 
> chararray),A: {a: int,b: chararray,c: int}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-08-16 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899053#action_12899053
 ] 

Richard Ding commented on PIG-1334:
---

bq. 2. This jar is 11MB and includes a bunch of dependencies, many of which are 
optional:

We should deploy _pig-0.8.0-SNAPSHOT-core.jar (which contains only Pig classes) 
instead of _pig-0.8.0-SNAPSHOT.jar_ (which also contains dependent jars).

> Make pig artifacts available through maven
> --
>
> Key: PIG-1334
> URL: https://issues.apache.org/jira/browse/PIG-1334
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
> mvn_pig_4.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-16 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Attachment: PIG-1452_3.patch

I resynced the patch with the trunk and the size of pig.jar now is about 8M.

> to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
> --
>
> Key: PIG-1452
> URL: https://issues.apache.org/jira/browse/PIG-1452
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Fix For: 0.8.0
>
> Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH
>
>
> pig use ivy for dependency management. But still it uses hadoop20.jar from 
> the lib folder. 
> Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
> should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Attachment: PIG-1452V4.PATCH

New patch fixing the contrib projects. 

> to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
> --
>
> Key: PIG-1452
> URL: https://issues.apache.org/jira/browse/PIG-1452
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Fix For: 0.8.0
>
> Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
> PIG-1452V4.PATCH
>
>
> pig use ivy for dependency management. But still it uses hadoop20.jar from 
> the lib folder. 
> Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
> should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Status: Open  (was: Patch Available)

> to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
> --
>
> Key: PIG-1452
> URL: https://issues.apache.org/jira/browse/PIG-1452
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Fix For: 0.8.0
>
> Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
> PIG-1452V4.PATCH
>
>
> pig use ivy for dependency management. But still it uses hadoop20.jar from 
> the lib folder. 
> Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
> should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Status: Patch Available  (was: Open)

> to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
> --
>
> Key: PIG-1452
> URL: https://issues.apache.org/jira/browse/PIG-1452
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Fix For: 0.8.0
>
> Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
> PIG-1452V4.PATCH
>
>
> pig use ivy for dependency management. But still it uses hadoop20.jar from 
> the lib folder. 
> Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
> should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-17 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899631#action_12899631
 ] 

Richard Ding commented on PIG-1452:
---

The target "buildJar-withouthadoop" doesn't depend on hadoop20.jar so this 
change doesn't affect this target.

> to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
> --
>
> Key: PIG-1452
> URL: https://issues.apache.org/jira/browse/PIG-1452
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Fix For: 0.8.0
>
> Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
> PIG-1452V4.PATCH
>
>
> pig use ivy for dependency management. But still it uses hadoop20.jar from 
> the lib folder. 
> Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
> should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

> to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
> --
>
> Key: PIG-1452
> URL: https://issues.apache.org/jira/browse/PIG-1452
> Project: Pig
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Giridharan Kesavan
>Assignee: Giridharan Kesavan
> Fix For: 0.8.0
>
> Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
> PIG-1452V4.PATCH
>
>
> pig use ivy for dependency management. But still it uses hadoop20.jar from 
> the lib folder. 
> Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
> should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1497) Mandatory rule PartitionFilterOptimizer

2010-08-18 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900100#action_12900100
 ] 

Richard Ding commented on PIG-1497:
---

Looks good. A few comments:

In _PartitionFilterPushDown_:

* In _check_ method, why changes the condition from

{code}
if(... || sucs.size() != 1 || ...) {
{code}

 to 

{code}
if(... || succeds.size() == 0 || ...)
{code}

* In _transform_ method, the original code

{code}
// remove this filter from the plan  
mPlan.removeAndReconnect(loFilter);
{code}

is replaced by its own implementation. It seems better to also migrate the 
_removeAndReconnect_ to the new _OperatorPlan_ since the logic there is more 
complicated (keeping the order of connections). 

* The javadoc for the class isn't migrated.

* Several variables (e.g. loadFunc, loLoad, loFilter, ...) now have scope 
within the _PartitionFilterPushDownTransformer_ class, so it would be better to 
put them inside the transformer class.

In addition,

* Need to remove all the tabs from the files and replace them with 4 spaces.
* Several unit tests now fail due to the dependency on other jiras.

> Mandatory rule PartitionFilterOptimizer
> ---
>
> Key: PIG-1497
> URL: https://issues.apache.org/jira/browse/PIG-1497
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: jira-1497-0.patch
>
>
> Need to migrate PartitionFilterOptimizer to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer

2010-08-19 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900376#action_12900376
 ] 

Richard Ding commented on PIG-1514:
---

Patch looks good. A couple of comments:

* It would be better to refactor the graph manipulation code into a helper 
class so that the graph transformation routines (such as swap, insert, remove, 
replace, ...) can be shared by all rules.
* Please remove tabs from the file. 

> Migrate logical optimization rule: OpLimitOptimizer
> ---
>
> Key: PIG-1514
> URL: https://issues.apache.org/jira/browse/PIG-1514
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: jira-1514-0.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-08-19 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900518#action_12900518
 ] 

Richard Ding commented on PIG-1334:
---

The new output is at 
https://repository.apache.org/content/repositories/snapshots/org/apache/hadoop/pig/0.8.0-SNAPSHOT/

> Make pig artifacts available through maven
> --
>
> Key: PIG-1334
> URL: https://issues.apache.org/jira/browse/PIG-1334
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
> mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1505) support jars and scripts in dfs

2010-08-20 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900811#action_12900811
 ] 

Richard Ding commented on PIG-1505:
---


The results of test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{code}

I'll commit the patch after running unit tests.




> support jars and scripts in dfs
> ---
>
> Key: PIG-1505
> URL: https://issues.apache.org/jira/browse/PIG-1505
> Project: Pig
>  Issue Type: Improvement
>Reporter: Andrew Hitchcock
>Assignee: Andrew Hitchcock
> Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, 
> pig-jars-and-scripts-from-dfs-trunk-1.patch, 
> pig-jars-and-scripts-from-dfs-trunk-2.patch, 
> pig-jars-and-scripts-from-dfs-trunk.patch
>
>
> Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1505) support jars and scripts in dfs

2010-08-20 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1505:
--

Fix Version/s: 0.8.0
Affects Version/s: 0.7.0

> support jars and scripts in dfs
> ---
>
> Key: PIG-1505
> URL: https://issues.apache.org/jira/browse/PIG-1505
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Andrew Hitchcock
>Assignee: Andrew Hitchcock
> Fix For: 0.8.0
>
> Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, 
> pig-jars-and-scripts-from-dfs-trunk-1.patch, 
> pig-jars-and-scripts-from-dfs-trunk-2.patch, 
> pig-jars-and-scripts-from-dfs-trunk.patch
>
>
> Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1334) Make pig artifacts available through maven

2010-08-20 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1334:
--

Hadoop Flags: [Reviewed]
Release Note: 
ant mvn-install   :To install artifact to the local filesystem
ant mvn-deploy  : To deploy snapshots to the apache nexus repo (looks for 
authentication in the ~/.m2/settings.xml)
ant mvn-deploy -Drepo=staging  :To deploy artifacts for voting before release , 
this also requires authentication configured in ~/.m2/settings.xml
Deploying artifacts to the staging repository requires signing the artifacts 
with gpg keys, mvn-deploy target takes care of signing the artifacts. While 
executing mvn-deploy target with -Drepo=staging it would ask for gpg passphrase 
which need to be keyed in. Once the deployment is successful, to make the 
artifact available in the staging repository , login into the staging 
repository and close the staging by right clicking on the staged artifact at 
http:/repository.apache.org


  was:
ant mvn-install   :To install artifact to the local filesystem
ant mvn-deploy  : To deploy snapshots to the apache nexus repo (looks for 
authentication in the ~/.m2/settings.xml)
ant mvn-deploy -Drepo=staging  :To deploy artifacts for voting before release , 
this also requires authentication configured in ~/.m2/settings.xml
Deploying artifacts to the staging repository requires signing the artifacts 
with gpg keys, mvn-deploy target takes care of signing the artifacts. While 
executing mvn-deploy target with -Drepo=staging it would ask for gpg passphrase 
which need to be keyed in. Once the deployment is successful, to make the 
artifact available in the staging repository , login into the staging 
repository and close the staging by right clicking on the staged artifact at 
http:/repository.apache.org
With this patch I have already uploaded artifacts to the stating repository; 
(only ppl with committer access would be able to view this, as the repository 
is not closed yet)


> Make pig artifacts available through maven
> --
>
> Key: PIG-1334
> URL: https://issues.apache.org/jira/browse/PIG-1334
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
> mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1334) Make pig artifacts available through maven

2010-08-20 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1334:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

The patch is committed to the trunk. Thanks Niraj for making this feature 
available.

> Make pig artifacts available through maven
> --
>
> Key: PIG-1334
> URL: https://issues.apache.org/jira/browse/PIG-1334
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
> mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1505) support jars and scripts in dfs

2010-08-20 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1505:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

All core tests passed. The patch is committed to the trunk. 

Thanks Andrew for contributing this feature!

> support jars and scripts in dfs
> ---
>
> Key: PIG-1505
> URL: https://issues.apache.org/jira/browse/PIG-1505
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Andrew Hitchcock
>Assignee: Andrew Hitchcock
> Fix For: 0.8.0
>
> Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, 
> pig-jars-and-scripts-from-dfs-trunk-1.patch, 
> pig-jars-and-scripts-from-dfs-trunk-2.patch, 
> pig-jars-and-scripts-from-dfs-trunk.patch
>
>
> Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1505) support jars and scripts in dfs

2010-08-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1505:
--

Release Note: Pig now supports running scripts and registering jars that 
are stored in HDFS, Amazon S3, or other distributed file systems.   (was: Pig 
now supports running scripts and registering jars that are stored in HDFS, 
Amazon S3, or other distributed file systems. Also added a -R parameter which 
allows users to specify properties in key=value form on the command line.)

Remove -R option. In 0.8 Pig supports generic parameters such as -Dkey=value. 

> support jars and scripts in dfs
> ---
>
> Key: PIG-1505
> URL: https://issues.apache.org/jira/browse/PIG-1505
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Andrew Hitchcock
>Assignee: Andrew Hitchcock
> Fix For: 0.8.0
>
> Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, 
> pig-jars-and-scripts-from-dfs-trunk-1.patch, 
> pig-jars-and-scripts-from-dfs-trunk-2.patch, 
> pig-jars-and-scripts-from-dfs-trunk.patch
>
>
> Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901600#action_12901600
 ] 

Richard Ding commented on PIG-1518:
---

+1. The patch looks good.

A few of minor points:

* In PigSplit, the method add(InputSplit split) is not used and can be removed
* In MapRedUtil, it would be better to not leave the debug verification code in 
the source code
* In PigRecordReader, the code can be simplified if the initNextRecordReader() 
from constructor to initialize() method

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-23 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901656#action_12901656
 ] 

Richard Ding commented on PIG-1551:
---


In Invoker.java, there is a typo:

{code}
private static final Class LONG_ARRAY_CLASS = new String[0].getClass();
{code}

also in unPrimitivize method, this code seems unnecessary:

{code}
} else if (klass.equals(DOUBLE_ARRAY_CLASS)) {
return DOUBLE_ARRAY_CLASS;
{code}

Otherwise the patch looks good.

> Improve dynamic invokers to deal with no-arg methods and array parameters
> -
>
> Key: PIG-1551
> URL: https://issues.apache.org/jira/browse/PIG-1551
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG-1551.patch
>
>
> PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
> Java methods in a UDF, so that users don't need to create trivial wrappers if 
> they are ok sacrificing some speed.
> This issue is to extend the set of methods that can be wrapped this way to 
> include methods that do not take any arguments, and methods that take arrays 
> of {int,long,float,double,string} as arguments. 
> Arrays are expected to be represented by bags in Pig. Notably, this allows 
> users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1560) Build target 'checkstyle' fails

2010-08-23 Thread Richard Ding (JIRA)
Build target 'checkstyle' fails
---

 Key: PIG-1560
 URL: https://issues.apache.org/jira/browse/PIG-1560
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Giridharan Kesavan
 Fix For: 0.8.0



Stack trace:

{code}
/homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: 
org/apache/commons/logging/LogFactory
at 
org.apache.commons.beanutils.ConvertUtilsBean.(ConvertUtilsBean.java:130)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
at org.apache.tools.ant.Main.runBuild(Main.java:801)
at org.apache.tools.ant.Main.startAnt(Main.java:218)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.logging.LogFactory
at 
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
at 
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
at 
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1560) Build target 'checkstyle' fails

2010-08-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1560:
--

Description: 
Stack trace:

{code}
/trunk/build.xml:894: java.lang.NoClassDefFoundError: 
org/apache/commons/logging/LogFactory
at 
org.apache.commons.beanutils.ConvertUtilsBean.(ConvertUtilsBean.java:130)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
at org.apache.tools.ant.Main.runBuild(Main.java:801)
at org.apache.tools.ant.Main.startAnt(Main.java:218)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.logging.LogFactory
at 
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
at 
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
at 
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more
{code}

  was:

Stack trace:

{code}
/homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: 
org/apache/commons/logging/LogFactory
at 
org.apache.commons.beanutils.ConvertUtilsBean.(ConvertUtilsBean.java:130)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
at org.apache.tools.ant.Main.runBuild(Main.java:801)
at org.apache.tools.ant.Main.startAnt(Main.java:218)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.logging.LogFactory
at 
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
at 
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
at 
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more
{code}


> Build target 'checkstyle' fails
> ---
>
> Key: PIG-1560
> URL: https://issues.

[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Attachment: PIG-1557.patch

The alias for load statement is missing. Add load alias to the alias list.

> couple of issue mapping aliases to jobs
> ---
>
> Key: PIG-1557
> URL: https://issues.apache.org/jira/browse/PIG-1557
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1557.patch
>
>
> I have a simple script:
> A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> B = group A by name;
> C = foreach B generate group, COUNT(A);
> D = order C by $1;
> E = limit D 10;
> dump E;
> I noticed a couple of issues with alias to job mapping: neither load(A) nor 
> limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Fix Version/s: 0.8.0

> couple of issue mapping aliases to jobs
> ---
>
> Key: PIG-1557
> URL: https://issues.apache.org/jira/browse/PIG-1557
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1557.patch
>
>
> I have a simple script:
> A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> B = group A by name;
> C = foreach B generate group, COUNT(A);
> D = order C by $1;
> E = limit D 10;
> dump E;
> I noticed a couple of issues with alias to job mapping: neither load(A) nor 
> limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901992#action_12901992
 ] 

Richard Ding commented on PIG-1551:
---


The typo is still there:

{code}
private static final Class LONG_ARRAY_CLASS = new Long[0].getClass();
{code}

It seems what you want is 

{code}
private static final Class LONG_ARRAY_CLASS = new long[0].getClass();
{code}

so it's consistent with other array classes.

This does raise a question about array parameters: the first form applies to 
methods like _amethod(Long[] nums)_, while the second supports methods like 
_amethod(long[] nums)_. And they are not exchangeable. 

> Improve dynamic invokers to deal with no-arg methods and array parameters
> -
>
> Key: PIG-1551
> URL: https://issues.apache.org/jira/browse/PIG-1551
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG-1551.patch, PIG_1551.2.patch
>
>
> PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
> Java methods in a UDF, so that users don't need to create trivial wrappers if 
> they are ok sacrificing some speed.
> This issue is to extend the set of methods that can be wrapped this way to 
> include methods that do not take any arguments, and methods that take arrays 
> of {int,long,float,double,string} as arguments. 
> Arrays are expected to be represented by bags in Pig. Notably, this allows 
> users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902030#action_12902030
 ] 

Richard Ding commented on PIG-1343:
---

The log file is created when running in batch mode, but not in interactive mode.

> pig_log file missing even though Main tells it is creating one and an M/R job 
> fails 
> 
>
> Key: PIG-1343
> URL: https://issues.apache.org/jira/browse/PIG-1343
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: 1343.patch, PIG-1343-1.patch
>
>
> There is a particular case where I was running with the latest trunk of Pig.
> {code}
> $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
> [main] INFO  org.apache.pig.Main - Logging error messages to: 
> /homes/viraj/pig_1263420012601.log
> $ls -l pig_1263420012601.log
> ls: pig_1263420012601.log: No such file or directory
> {code}
> The job failed and the log file did not contain anything, the only way to 
> debug was to look into the Jobtracker logs.
> Here are some reasons which would have caused this behavior:
> 1) The underlying filer/NFS had some issues. In that case do we not error on 
> stdout?
> 2) There are some errors from the backend which are not being captured
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902042#action_12902042
 ] 

Richard Ding commented on PIG-1551:
---

+1.

I'm fine with arrays of primitive types. I can't think of a Java method that 
uses an array of object Long as a parameter.

> Improve dynamic invokers to deal with no-arg methods and array parameters
> -
>
> Key: PIG-1551
> URL: https://issues.apache.org/jira/browse/PIG-1551
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch
>
>
> PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
> Java methods in a UDF, so that users don't need to create trivial wrappers if 
> they are ok sacrificing some speed.
> This issue is to extend the set of methods that can be wrapped this way to 
> include methods that do not take any arguments, and methods that take arrays 
> of {int,long,float,double,string} as arguments. 
> Arrays are expected to be represented by bags in Pig. Notably, this allows 
> users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

Attachment: PIG-1483_1.patch

New patch adding unit test.

> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> ---
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
>  Issue Type: New Feature
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus 
> it's now possible to use Pig for querying Hadoop job history/xml files to get 
> script-level usage statistics. What we need is a Pig loader that can parse 
> these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
> (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
> j#'USER' as user, (Chararray) j#'JOBID' as job; 
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
> max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
> end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
> MIN(b.start)/1000;
> dump d;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

Status: Patch Available  (was: Open)

> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> ---
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
>  Issue Type: New Feature
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus 
> it's now possible to use Pig for querying Hadoop job history/xml files to get 
> script-level usage statistics. What we need is a Pig loader that can parse 
> these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
> (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
> j#'USER' as user, (Chararray) j#'JOBID' as job; 
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
> max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
> end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
> MIN(b.start)/1000;
> dump d;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Attachment: PIG-1557_1.patch

New patch adds a unit test.

> couple of issue mapping aliases to jobs
> ---
>
> Key: PIG-1557
> URL: https://issues.apache.org/jira/browse/PIG-1557
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1557.patch, PIG-1557_1.patch
>
>
> I have a simple script:
> A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> B = group A by name;
> C = foreach B generate group, COUNT(A);
> D = order C by $1;
> E = limit D 10;
> dump E;
> I noticed a couple of issues with alias to job mapping: neither load(A) nor 
> limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

  Status: Patch Available  (was: Open)
Hadoop Flags: [Reviewed]

> couple of issue mapping aliases to jobs
> ---
>
> Key: PIG-1557
> URL: https://issues.apache.org/jira/browse/PIG-1557
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1557.patch, PIG-1557_1.patch
>
>
> I have a simple script:
> A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> B = group A by name;
> C = foreach B generate group, COUNT(A);
> D = order C by $1;
> E = limit D 10;
> dump E;
> I noticed a couple of issues with alias to job mapping: neither load(A) nor 
> limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

> couple of issue mapping aliases to jobs
> ---
>
> Key: PIG-1557
> URL: https://issues.apache.org/jira/browse/PIG-1557
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1557.patch, PIG-1557_1.patch
>
>
> I have a simple script:
> A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> B = group A by name;
> C = foreach B generate group, COUNT(A);
> D = order C by $1;
> E = limit D 10;
> dump E;
> I noticed a couple of issues with alias to job mapping: neither load(A) nor 
> limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1564) add support for multiple filesystems

2010-08-26 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902952#action_12902952
 ] 

Richard Ding commented on PIG-1564:
---

Hi Andrew,

HDataStorage is a thin layer on top of Hadoop FileSystem. Since moving its 
local mode to Hadoop local mode, Pig no longer needs this layer.  We intends to 
remove it in the feature.

On Pig reading data from one file system and writing it to another, this 
feature is supported since Pig 0.7.

-Richard 

> add support for multiple filesystems
> 
>
> Key: PIG-1564
> URL: https://issues.apache.org/jira/browse/PIG-1564
> Project: Pig
>  Issue Type: Improvement
>Reporter: Andrew Hitchcock
> Attachments: PIG-1564-1.patch
>
>
> Currently you can't run Pig scripts that read data from one file system and 
> write it to another. Also, Grunt doesn't support CDing from one directory to 
> another on different file systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-1518.
---

Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch is committed to trunk. Thanks Yan.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1569:
-

Assignee: Richard Ding

> java properties not honored in case of properties such as stop.on.failure
> -
>
> Key: PIG-1569
> URL: https://issues.apache.org/jira/browse/PIG-1569
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Richard Ding
> Fix For: 0.8.0
>
>
> In org.apache.pig.Main , properties are being set to default value without 
> checking if the java system properties have been set to something else.
> stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
> have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-26 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903072#action_12903072
 ] 

Richard Ding commented on PIG-1343:
---


The new patch logs NPE instead of the intended message:

{code}
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal 
error. null
{code}

> pig_log file missing even though Main tells it is creating one and an M/R job 
> fails 
> 
>
> Key: PIG-1343
> URL: https://issues.apache.org/jira/browse/PIG-1343
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch
>
>
> There is a particular case where I was running with the latest trunk of Pig.
> {code}
> $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
> [main] INFO  org.apache.pig.Main - Logging error messages to: 
> /homes/viraj/pig_1263420012601.log
> $ls -l pig_1263420012601.log
> ls: pig_1263420012601.log: No such file or directory
> {code}
> The job failed and the log file did not contain anything, the only way to 
> debug was to look into the Jobtracker logs.
> Here are some reasons which would have caused this behavior:
> 1) The underlying filer/NFS had some issues. In that case do we not error on 
> stdout?
> 2) There are some errors from the backend which are not being captured
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1458) aggregate files for replicated join

2010-08-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1458:
--

Attachment: PIG-1458.patch

This patch uses the new multi-file-combiner (PIG-1518) to concatenate many 
small files for replicated join. This is based on the assumption that the total 
size of the replicated files should be small enough to fit into main memory. 

> aggregate files for replicated join
> ---
>
> Key: PIG-1458
> URL: https://issues.apache.org/jira/browse/PIG-1458
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1458.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, 
> this puts  unneeded burden on the name node. pre-aggregating the files can 
> improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-27 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> ---
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
>  Issue Type: New Feature
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus 
> it's now possible to use Pig for querying Hadoop job history/xml files to get 
> script-level usage statistics. What we need is a Pig loader that can parse 
> these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
> (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
> j#'USER' as user, (Chararray) j#'JOBID' as job; 
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
> max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
> end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
> MIN(b.start)/1000;
> dump d;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-27 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903523#action_12903523
 ] 

Richard Ding commented on PIG-1343:
---


I run above script in local mode, both batch mode and interactive mode now 
generate the expected result:

{code}
ERROR 2244: Job failed, hadoop does not return any error message
{code}

> pig_log file missing even though Main tells it is creating one and an M/R job 
> fails 
> 
>
> Key: PIG-1343
> URL: https://issues.apache.org/jira/browse/PIG-1343
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, 
> pig_1343_4.patch, PIG_1343_5.patch
>
>
> There is a particular case where I was running with the latest trunk of Pig.
> {code}
> $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
> [main] INFO  org.apache.pig.Main - Logging error messages to: 
> /homes/viraj/pig_1263420012601.log
> $ls -l pig_1263420012601.log
> ls: pig_1263420012601.log: No such file or directory
> {code}
> The job failed and the log file did not contain anything, the only way to 
> debug was to look into the Jobtracker logs.
> Here are some reasons which would have caused this behavior:
> 1) The underlying filer/NFS had some issues. In that case do we not error on 
> stdout?
> 2) There are some errors from the backend which are not being captured
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904267#action_12904267
 ] 

Richard Ding commented on PIG-1343:
---

Patch is committed to the trunk. Thanks Niraj.

> pig_log file missing even though Main tells it is creating one and an M/R job 
> fails 
> 
>
> Key: PIG-1343
> URL: https://issues.apache.org/jira/browse/PIG-1343
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, 
> pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch
>
>
> There is a particular case where I was running with the latest trunk of Pig.
> {code}
> $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
> [main] INFO  org.apache.pig.Main - Logging error messages to: 
> /homes/viraj/pig_1263420012601.log
> $ls -l pig_1263420012601.log
> ls: pig_1263420012601.log: No such file or directory
> {code}
> The job failed and the log file did not contain anything, the only way to 
> debug was to look into the Jobtracker logs.
> Here are some reasons which would have caused this behavior:
> 1) The underlying filer/NFS had some issues. In that case do we not error on 
> stdout?
> 2) There are some errors from the backend which are not being captured
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1343:
--

Attachment: PIG-1343_6.patch

> pig_log file missing even though Main tells it is creating one and an M/R job 
> fails 
> 
>
> Key: PIG-1343
> URL: https://issues.apache.org/jira/browse/PIG-1343
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, 
> pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch
>
>
> There is a particular case where I was running with the latest trunk of Pig.
> {code}
> $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
> [main] INFO  org.apache.pig.Main - Logging error messages to: 
> /homes/viraj/pig_1263420012601.log
> $ls -l pig_1263420012601.log
> ls: pig_1263420012601.log: No such file or directory
> {code}
> The job failed and the log file did not contain anything, the only way to 
> debug was to look into the Jobtracker logs.
> Here are some reasons which would have caused this behavior:
> 1) The underlying filer/NFS had some issues. In that case do we not error on 
> stdout?
> 2) There are some errors from the backend which are not being captured
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1343:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

> pig_log file missing even though Main tells it is creating one and an M/R job 
> fails 
> 
>
> Key: PIG-1343
> URL: https://issues.apache.org/jira/browse/PIG-1343
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
>Reporter: Viraj Bhat
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, 
> pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch
>
>
> There is a particular case where I was running with the latest trunk of Pig.
> {code}
> $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
> [main] INFO  org.apache.pig.Main - Logging error messages to: 
> /homes/viraj/pig_1263420012601.log
> $ls -l pig_1263420012601.log
> ls: pig_1263420012601.log: No such file or directory
> {code}
> The job failed and the log file did not contain anything, the only way to 
> debug was to look into the Jobtracker logs.
> Here are some reasons which would have caused this behavior:
> 1) The underlying filer/NFS had some issues. In that case do we not error on 
> stdout?
> 2) There are some errors from the backend which are not being captured
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1578) PigServer.executeBatch does not return status of failed job

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1578:
-

Assignee: Richard Ding

> PigServer.executeBatch does not return status of failed job
> ---
>
> Key: PIG-1578
> URL: https://issues.apache.org/jira/browse/PIG-1578
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Richard Ding
> Fix For: 0.8.0
>
>
> For failed job PigServer.executeBatch does not return ExecJob . 
> ExecJobs are created using output statistics, and the output statistics for 
> jobs that failed does not seem to exist.
> The query i tried was a native mapreduce job, where the output file of the 
> native mr job already exists causing that job to fail.
> {code}
> A = load '" + INPUT_FILE + "';
> B = mapreduce '" + jarFileName + "' " +
> "Store A into 'table_testNativeMRJobSimple_input' "+
> "Load 'table_testNativeMRJobSimple_output' "+
> "`WordCount table_testNativeMRJobSimple_input " + INPUT_FILE + 
> "`;");
> Store B into 'table_testNativeMRJobSimpleDir';);
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904321#action_12904321
 ] 

Richard Ding commented on PIG-1570:
---

+1.

> native mapreduce operator MR job does not follow same failure handling logic 
> as other pig MR jobs
> -
>
> Key: PIG-1570
> URL: https://issues.apache.org/jira/browse/PIG-1570
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1570.1.patch
>
>
> The code path for handling failure in MR job corresponding to native MR is 
> different and does not have the same behavior.
> For example, even if the MR job for mapreduce operator fails, the number of 
> jobs that failed is being reported as 0 in PigStats log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1458:
--

Attachment: PIG-1458_1.patch

New patch addressing review comments.

> aggregate files for replicated join
> ---
>
> Key: PIG-1458
> URL: https://issues.apache.org/jira/browse/PIG-1458
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, 
> this puts  unneeded burden on the name node. pre-aggregating the files can 
> improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1569:
--

Status: Patch Available  (was: Open)

> java properties not honored in case of properties such as stop.on.failure
> -
>
> Key: PIG-1569
> URL: https://issues.apache.org/jira/browse/PIG-1569
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1569.patch
>
>
> In org.apache.pig.Main , properties are being set to default value without 
> checking if the java system properties have been set to something else.
> stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
> have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1569:
--

Attachment: PIG-1569.patch

> java properties not honored in case of properties such as stop.on.failure
> -
>
> Key: PIG-1569
> URL: https://issues.apache.org/jira/browse/PIG-1569
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1569.patch
>
>
> In org.apache.pig.Main , properties are being set to default value without 
> checking if the java system properties have been set to something else.
> stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
> have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904385#action_12904385
 ] 

Richard Ding commented on PIG-1458:
---

Koji,

Please open a jira on increasing the replication factor of the replicated 
files. Now it uses the default replication factor. 

Thanks,
-Richard 

> aggregate files for replicated join
> ---
>
> Key: PIG-1458
> URL: https://issues.apache.org/jira/browse/PIG-1458
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, 
> this puts  unneeded burden on the name node. pre-aggregating the files can 
> improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1569:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

> java properties not honored in case of properties such as stop.on.failure
> -
>
> Key: PIG-1569
> URL: https://issues.apache.org/jira/browse/PIG-1569
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1569.patch
>
>
> In org.apache.pig.Main , properties are being set to default value without 
> checking if the java system properties have been set to something else.
> stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
> have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-1458.
---

Hadoop Flags: [Reviewed]
  Resolution: Fixed

> aggregate files for replicated join
> ---
>
> Key: PIG-1458
> URL: https://issues.apache.org/jira/browse/PIG-1458
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, 
> this puts  unneeded burden on the name node. pre-aggregating the files can 
> improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904451#action_12904451
 ] 

Richard Ding commented on PIG-1458:
---

Patch committed to trunk.

> aggregate files for replicated join
> ---
>
> Key: PIG-1458
> URL: https://issues.apache.org/jira/browse/PIG-1458
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1458.patch, PIG-1458_1.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, 
> this puts  unneeded burden on the name node. pre-aggregating the files can 
> improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904452#action_12904452
 ] 

Richard Ding commented on PIG-1569:
---

Patch committed to trunk.

> java properties not honored in case of properties such as stop.on.failure
> -
>
> Key: PIG-1569
> URL: https://issues.apache.org/jira/browse/PIG-1569
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1569.patch
>
>
> In org.apache.pig.Main , properties are being set to default value without 
> checking if the java system properties have been set to something else.
> stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
> have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904453#action_12904453
 ] 

Richard Ding commented on PIG-1483:
---

Patch committed to trunk.

> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> ---
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
>  Issue Type: New Feature
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus 
> it's now possible to use Pig for querying Hadoop job history/xml files to get 
> script-level usage statistics. What we need is a Pig loader that can parse 
> these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
> (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
> j#'USER' as user, (Chararray) j#'JOBID' as job; 
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
> max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
> end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
> MIN(b.start)/1000;
> dump d;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904456#action_12904456
 ] 

Richard Ding commented on PIG-1557:
---

Patch committed to trunk.

> couple of issue mapping aliases to jobs
> ---
>
> Key: PIG-1557
> URL: https://issues.apache.org/jira/browse/PIG-1557
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1557.patch, PIG-1557_1.patch
>
>
> I have a simple script:
> A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> B = group A by name;
> C = foreach B generate group, COUNT(A);
> D = order C by $1;
> E = limit D 10;
> dump E;
> I noticed a couple of issues with alias to job mapping: neither load(A) nor 
> limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-09-02 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905744#action_12905744
 ] 

Richard Ding commented on PIG-1334:
---

Scott,

Please create a new Jira for this. Another follow-up jira (PIG-1562) has 
already been opened. 

-Richard

> Make pig artifacts available through maven
> --
>
> Key: PIG-1334
> URL: https://issues.apache.org/jira/browse/PIG-1334
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
> mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-02 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

Attachment: PIG-1458.patch


Results of test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

{code}

> Optimize scalar to consolidate the part file
> 
>
> Key: PIG-1548
> URL: https://issues.apache.org/jira/browse/PIG-1548
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1458.patch
>
>
> Current scalar implementation will write a scalar file onto dfs. When Pig 
> need the scalar, it will open the dfs file directly. Each scalar file 
> contains more than one part file though it contains only one record. This 
> puts a huge load to namenode. We should consolidate part file before open it. 
> Another optional step is put the consolicated file into distributed cache. 
> This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-02 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

Status: Patch Available  (was: Open)

> Optimize scalar to consolidate the part file
> 
>
> Key: PIG-1548
> URL: https://issues.apache.org/jira/browse/PIG-1548
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1458.patch
>
>
> Current scalar implementation will write a scalar file onto dfs. When Pig 
> need the scalar, it will open the dfs file directly. Each scalar file 
> contains more than one part file though it contains only one record. This 
> puts a huge load to namenode. We should consolidate part file before open it. 
> Another optional step is put the consolicated file into distributed cache. 
> This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

Attachment: PIG-1548.patch

> Optimize scalar to consolidate the part file
> 
>
> Key: PIG-1548
> URL: https://issues.apache.org/jira/browse/PIG-1548
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1548.patch
>
>
> Current scalar implementation will write a scalar file onto dfs. When Pig 
> need the scalar, it will open the dfs file directly. Each scalar file 
> contains more than one part file though it contains only one record. This 
> puts a huge load to namenode. We should consolidate part file before open it. 
> Another optional step is put the consolicated file into distributed cache. 
> This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

Attachment: (was: PIG-1458.patch)

> Optimize scalar to consolidate the part file
> 
>
> Key: PIG-1548
> URL: https://issues.apache.org/jira/browse/PIG-1548
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1548.patch
>
>
> Current scalar implementation will write a scalar file onto dfs. When Pig 
> need the scalar, it will open the dfs file directly. Each scalar file 
> contains more than one part file though it contains only one record. This 
> puts a huge load to namenode. We should consolidate part file before open it. 
> Another optional step is put the consolicated file into distributed cache. 
> This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-09-03 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906008#action_12906008
 ] 

Richard Ding commented on PIG-1543:
---

+1. Looks good.

> IsEmpty returns the wrong value after using LIMIT
> -
>
> Key: PIG-1543
> URL: https://issues.apache.org/jira/browse/PIG-1543
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Hu
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1543-1.patch
>
>
> 1. Two input files:
> 1a: limit_empty.input_a
> 1
> 1
> 1
> 1b: limit_empty.input_b
> 2
> 2
> 2.
> The pig script: limit_empty.pig
> -- A contains only 1's & B contains only 2's
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_a' as (b1:int);
> C =COGROUP A by a1, B by b1;
> D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
> COUNT(B);
> store D into 'limit_empty.output/d';
> -- After the script done, we see the right results:
> -- {(1),(1),(1)}   {}  1   0   3   0
> -- {} {(2),(2)}  0   1   0   2
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> store D1 into 'limit_empty.output/d1';
> -- After the script done, we see the unexpected results:
> -- {(1)}   {}1   1   1   0
> -- {}  {(2)} 1   1   0   1
> dump D;
> dump D1;
> 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
> The major one:
> IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
> IsEmpty() returns correctly in limit_empty.output/d/*.
> The difference is that one has been applied with "LIMIT" before using 
> IsEmpty().
> The minor one:
> The redirected output only contains the first dump:
> ({(1),(1),(1)},{},1,0,3L,0L)
> ({},{(2),(2)},0,1,0L,2L)
> We expect two more lines like:
> ({(1)},{},1,1,1L,0L)
> ({},{(2)},1,1,0L,1L)
> Besides, there is error says:
> [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

Attachment: PIG-1548_1.patch

The patch excludes some multiquery cases where more information is needed to 
correlate and determine the files to consolidate. We'll consider those cases in 
a separate jira.  

> Optimize scalar to consolidate the part file
> 
>
> Key: PIG-1548
> URL: https://issues.apache.org/jira/browse/PIG-1548
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1548.patch, PIG-1548_1.patch
>
>
> Current scalar implementation will write a scalar file onto dfs. When Pig 
> need the scalar, it will open the dfs file directly. Each scalar file 
> contains more than one part file though it contains only one record. This 
> puts a huge load to namenode. We should consolidate part file before open it. 
> Another optional step is put the consolicated file into distributed cache. 
> This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1599) pig gives generic message for few cases

2010-09-03 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906153#action_12906153
 ] 

Richard Ding commented on PIG-1599:
---

I manually run related tests and they all passed. I'm going to check in the 
patch to the trunk and 0.8 branch.

> pig gives generic message for few cases
> ---
>
> Key: PIG-1599
> URL: https://issues.apache.org/jira/browse/PIG-1599
> Project: Pig
>  Issue Type: Bug
>Reporter: niraj rai
>Assignee: niraj rai
> Attachments: pig-1599_0.patch, pig-1599_1.patch
>
>
> When we run the script:
> register testudf.jar;
> a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> b = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> c = cogroup a by name, b by name;
> d = foreach c generate flatten(org.apache.pig.test.udf.evalfunc.BadUdf(a,b));
> dump d;
> we get the error:
> now we get "ERROR 2088: Unable to get results for: 
> hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp1787360727/tmp509618997:org.apache.pig.impl.io.InterStorage".
> The udf is bad udf and it should throw:
> ERROR 2078: Caught error from UDF: org.apache.pig.test.udf.evalfunc.BadUdf, 
> Out of bounds access [Index: 2, Size: 2]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1599) pig gives generic message for few cases

2010-09-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-1599.
---

Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch is committed to both trunk and 0.8 branch. Thanks Niraj.

> pig gives generic message for few cases
> ---
>
> Key: PIG-1599
> URL: https://issues.apache.org/jira/browse/PIG-1599
> Project: Pig
>  Issue Type: Bug
>Reporter: niraj rai
>Assignee: niraj rai
> Attachments: pig-1599_0.patch, pig-1599_1.patch
>
>
> When we run the script:
> register testudf.jar;
> a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> b = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> c = cogroup a by name, b by name;
> d = foreach c generate flatten(org.apache.pig.test.udf.evalfunc.BadUdf(a,b));
> dump d;
> we get the error:
> now we get "ERROR 2088: Unable to get results for: 
> hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp1787360727/tmp509618997:org.apache.pig.impl.io.InterStorage".
> The udf is bad udf and it should throw:
> ERROR 2078: Caught error from UDF: org.apache.pig.test.udf.evalfunc.BadUdf, 
> Out of bounds access [Index: 2, Size: 2]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

patch committed to both trunk and 0.8 branch.

> Optimize scalar to consolidate the part file
> 
>
> Key: PIG-1548
> URL: https://issues.apache.org/jira/browse/PIG-1548
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1548.patch, PIG-1548_1.patch
>
>
> Current scalar implementation will write a scalar file onto dfs. When Pig 
> need the scalar, it will open the dfs file directly. Each scalar file 
> contains more than one part file though it contains only one record. This 
> puts a huge load to namenode. We should consolidate part file before open it. 
> Another optional step is put the consolicated file into distributed cache. 
> This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1479) Embed Pig in scripting languages

2010-09-10 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1479:
--

Attachment: PIG-1479.patch

Thanks Julien. I rebased the patch with the latest trunk and added an option 
(-greek) in the Main class.

Now one can run a "PIG-Greek" script with following command:

{code}
java -cp pig.jar:: org.apache.pig.Main -g 

{code}

or in local mode: 

{code}
java -cp pig.jar: org.apache.pig.Main -x local -g 
{code}


> Embed Pig in scripting languages
> 
>
> Key: PIG-1479
> URL: https://issues.apache.org/jira/browse/PIG-1479
> Project: Pig
>  Issue Type: New Feature
>Reporter: Julien Le Dem
> Attachments: PIG-1479.patch, pig-greek.tgz
>
>
> It should be possible to embed Pig calls in a scripting language and let 
> functions defined in the same script available as UDFs.
> This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which 
> lets users define UDFs in scripting languages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven

2010-09-13 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1562:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch. Thanks Niraj!.

> Fix the version for the dependent packages for the maven 
> -
>
> Key: PIG-1562
> URL: https://issues.apache.org/jira/browse/PIG-1562
> Project: Pig
>  Issue Type: Bug
>Reporter: niraj rai
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: PIG-1562_1.patch, PIG-1562_2.patch, PIG_1562_0.patch
>
>
> We need to fix the set version so that, version is properly set for the 
> dependent packages in the maven repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-630) provide indication that pig script only partially succeeded

2010-09-13 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-630.
--

 Assignee: Olga Natkovich
Fix Version/s: 0.8.0
   Resolution: Fixed

This jira has been fixed with MultiQuery optimization and Pig Stats.

> provide indication that pig script only partially succeeded
> ---
>
> Key: PIG-630
> URL: https://issues.apache.org/jira/browse/PIG-630
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.8.0
>
>
> Currently, if you have multiple queries (stores/dumps) within the same pig 
> script, the script return the result of the last one which does not provide 
> sufficient information to the users. We need to provide to the user the 
> following information:
> - return code that indicates the script only partioally succeeded
> - indication which parts have succeeded

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1589) add test cases for mapreduce operator which use distributed cache

2010-09-13 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909061#action_12909061
 ] 

Richard Ding commented on PIG-1589:
---

+1

> add test cases for mapreduce operator which use distributed cache
> -
>
> Key: PIG-1589
> URL: https://issues.apache.org/jira/browse/PIG-1589
> Project: Pig
>  Issue Type: Task
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1589.1.patch, TestWordCount.jar
>
>
> '-files filename' can be specified in the parameters for mapreduce operator 
> to send files to distributed cache. Need to add test cases for that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1609) 'union onschema' should give a more useful error message when schema of one of the relations has null column name

2010-09-14 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909412#action_12909412
 ] 

Richard Ding commented on PIG-1609:
---

+1

> 'union onschema' should give a more useful error message when schema of one 
> of the relations has null column name
> -
>
> Key: PIG-1609
> URL: https://issues.apache.org/jira/browse/PIG-1609
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1609.1.patch
>
>
> A better error message needs to be given in this case -
> {code}
> grunt> l = load '/tmp/empty.bag' as (i : int);
> grunt> f = foreach l generate i+1;
> grunt> describe f;
> f: {int}
> grunt> u = union onschema l , f;
> 2010-09-10 18:08:13,000 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1000: Error during parsing. Error merging
> schemas for union operator
> Details at logfile: /Users/tejas/pig_nmr_syn/trunk/pig_1284167020897.log
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1479) Embed Pig in scripting languages

2010-09-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1479:
--

Attachment: PIG-1479_2.patch

In the previous patch, the executeScript method on ScriptPigServer returns a 
list of ExecJobs (one for each store statement in the script). Unfortunately, 
the order of ExecJobs in the list is indeterminate.  

This patch fixes this problem by making the executeScript method return a 
PigStats object. One then can retrieves the output result by the alias 
corresponding to store statement.

Here is a example:

{code}
P = pig.executeScript("""
A = load '${input}';
... ...
store G into '${output}'; """)

output = P.result("G")  # an OutputStats object
iter = output.iterator()
if iter.hasNext():
# do something
else:
# do something else
{code} 

> Embed Pig in scripting languages
> 
>
> Key: PIG-1479
> URL: https://issues.apache.org/jira/browse/PIG-1479
> Project: Pig
>  Issue Type: New Feature
>Reporter: Julien Le Dem
> Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek.tgz
>
>
> It should be possible to embed Pig calls in a scripting language and let 
> functions defined in the same script available as UDFs.
> This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which 
> lets users define UDFs in scripting languages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1479) Embed Pig in scripting languages

2010-09-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1479:
--

Attachment: pig-greek-test.tar

Attach the updated test program from Julien.

To run the example:

* tar -xvf pig-greek-test.tar
* java -cp pig.jar: org.apache.pig.Main -x local -g script/tc.py

> Embed Pig in scripting languages
> 
>
> Key: PIG-1479
> URL: https://issues.apache.org/jira/browse/PIG-1479
> Project: Pig
>  Issue Type: New Feature
>Reporter: Julien Le Dem
> Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek-test.tar, 
> pig-greek.tgz
>
>
> It should be possible to embed Pig calls in a scripting language and let 
> functions defined in the same script available as UDFs.
> This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which 
> lets users define UDFs in scripting languages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1607) pig should have separate javadoc.jar in the maven repository

2010-09-15 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909814#action_12909814
 ] 

Richard Ding commented on PIG-1607:
---


The test result can be viewed here:

{code}
https://repository.apache.org/content/repositories/snapshots/org/apache/hadoop/pig/0.8.0-SNAPSHOT/
{code}

> pig should have separate javadoc.jar in the maven repository
> 
>
> Key: PIG-1607
> URL: https://issues.apache.org/jira/browse/PIG-1607
> Project: Pig
>  Issue Type: Bug
>Reporter: niraj rai
>Assignee: niraj rai
> Attachments: PIG-1607_0.patch, PIG-1607_1.patch, PIG-1607_2.patch
>
>
> At this moment, javadoc is part of the source.jar but pig should have 
> separate javadoc.jar in the maven repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag

2010-09-16 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910407#action_12910407
 ] 

Richard Ding commented on PIG-1615:
---

This problem exists in Pig 0.7 and fixed in Pig 0.8.

> Return code from Pig is 0 even if the job fails when using -M flag
> --
>
> Key: PIG-1615
> URL: https://issues.apache.org/jira/browse/PIG-1615
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0, 0.7.0
>Reporter: Viraj Bhat
> Fix For: 0.8.0
>
>
> I have a Pig script of this form, which I used inside a workflow system such 
> as Oozie.
> {code}
> A = load  '$INPUT' using PigStorage();
> store A into '$OUTPUT';
> {code}
> I run this as with Multi-query optimization turned off :
> {quote}
> $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
> INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
> {quote}
> The directory "/user/viraj/junk1" is not present
> I get the following results:
> {quote}
> Input(s):
> Failed to read data from "/user/viraj/junk1"
> Output(s):
> Failed to produce result in "/user/viraj/junk2"
> {quote}
> This is expected, but the return code is still 0
> {code}
> $ echo $?
> 0
> {code}
> If I run this script with Multi-query optimization turned on, it gives, a 
> return code of 2, which is correct.
> {code}
> $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
> INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
> ...
> $ echo $?
> 2
> {code}
> I believe a wrong return code from Pig, is causing Oozie to believe that Pig 
> script succeeded.
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1610) 'union onschema' does handle some cases involving 'namespaced' column names in schema

2010-09-16 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910409#action_12910409
 ] 

Richard Ding commented on PIG-1610:
---

+1

> 'union onschema' does handle some cases involving 'namespaced' column names 
> in schema
> -
>
> Key: PIG-1610
> URL: https://issues.apache.org/jira/browse/PIG-1610
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1610.1.patch, PIG-1610.2.patch
>
>
> case 1:
> grunt> describe f;  
> f: {l1::a: bytearray,l1::b: bytearray}
> grunt> describe l1;
> l1: {a: bytearray,b: bytearray}
> grunt> dump f;
> (1,11)
> (2,22)
> (3,33)
> grunt> dump l1;
> (1,11)
> (2,22)
> (3,33)
> grunt> u = union onschema f, l1;
> grunt> describe u;
> u: {l1::a: bytearray,l1::b: bytearray}
> -- the dump u gives incorrect results
> grunt> dump u; 
> (,)
> (,)
> (,)
> (1,11)
> (2,22)
> (3,33)
> case 2:
> grunt> u = union onschema l1, f;
> grunt> describe u;
> 2010-09-13 15:11:13,877 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1108: Duplicate schema alias: l1::a
> Details at logfile: /Users/tejas/pig_unions_err2/trunk/pig_1284410413970.log

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1479) Embed Pig in scripting languages

2010-09-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1479:
--

Attachment: pig-greek-test.tar

Attach the test script modified based on Julien's comment. As for commend line 
option -g, it can  also use one parameter (script file name) and  let Pig 
determine the script engine by the file extension.



> Embed Pig in scripting languages
> 
>
> Key: PIG-1479
> URL: https://issues.apache.org/jira/browse/PIG-1479
> Project: Pig
>  Issue Type: New Feature
>Reporter: Julien Le Dem
> Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek-test.tar, 
> pig-greek-test.tar, pig-greek.tgz
>
>
> It should be possible to embed Pig calls in a scripting language and let 
> functions defined in the same script available as UDFs.
> This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which 
> lets users define UDFs in scripting languages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved

2010-09-20 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912696#action_12912696
 ] 

Richard Ding commented on PIG-1616:
---

+1

> 'union onschema' does not use create output with correct schema when udfs are 
> involved
> --
>
> Key: PIG-1616
> URL: https://issues.apache.org/jira/browse/PIG-1616
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1616.1.patch
>
>
> 'union onshcema' creates a merged schema based on the input schemas. It does 
> that in the queryparser, and at that stage the udf return type used is the 
> default return type.  The actual return type for the udf is determined later 
> in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping().
> 'union onschema' should use the final type for its input relation to create 
> the merged schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1641) Incorrect counters in local mode

2010-09-22 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913736#action_12913736
 ] 

Richard Ding commented on PIG-1641:
---

Hadoop counters are not available in local mode (PIG-1286).

So for now I propose that, in local mode,  Pig stats output is changed to 
something like the following:

{code} 
Job Stats (time in seconds):
JobId  Alias Feature Outputs
job_local_0001 raw MAP_ONLY
job_local_0002 rank_sort SAMPLER
job_local_0003 rank_sort ORDER_BY Processed/user_visits_table,

Input(s):
Successfully read records from: "Data/Raw/UserVisits.dat"

Output(s):
Successfully stored records in: "Processed/user_visits_table"
{code}

> Incorrect counters in local mode
> 
>
> Key: PIG-1641
> URL: https://issues.apache.org/jira/browse/PIG-1641
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Ashutosh Chauhan
>
> User report, not verified.
> 
> HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
> 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
> 21:58:42ORDER_BY
> Success!
> Job Stats (time in seconds):
> JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
> MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
> job_local_000100000000rawMAP_ONLY
> job_local_000200000000rank_sort
> SAMPLER
> job_local_000300000000rank_sort
> ORDER_BYProcessed/user_visits_table,
> Input(s):
> Successfully read 0 records from: "Data/Raw/UserVisits.dat"
> Output(s):
> Successfully stored 0 records in: "Processed/user_visits_table"
> However, when I look in the output:
> $ ls -lh Processed/user_visits_table/CG0/
> total 15250760
> -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
> It read a 20G input file and generated some output...
> 
> Is it that in local mode counters are not available? If so, instead of 
> printing zeros we should print "Information Unavailable" or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1641) Incorrect counters in local mode

2010-09-22 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1641:
-

Assignee: Richard Ding

> Incorrect counters in local mode
> 
>
> Key: PIG-1641
> URL: https://issues.apache.org/jira/browse/PIG-1641
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Ashutosh Chauhan
>Assignee: Richard Ding
>
> User report, not verified.
> 
> HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
> 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
> 21:58:42ORDER_BY
> Success!
> Job Stats (time in seconds):
> JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
> MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
> job_local_000100000000rawMAP_ONLY
> job_local_000200000000rank_sort
> SAMPLER
> job_local_000300000000rank_sort
> ORDER_BYProcessed/user_visits_table,
> Input(s):
> Successfully read 0 records from: "Data/Raw/UserVisits.dat"
> Output(s):
> Successfully stored 0 records in: "Processed/user_visits_table"
> However, when I look in the output:
> $ ls -lh Processed/user_visits_table/CG0/
> total 15250760
> -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
> It read a 20G input file and generated some output...
> 
> Is it that in local mode counters are not available? If so, instead of 
> printing zeros we should print "Information Unavailable" or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-22 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1642:
--

Summary: Order by doesn't use estimation to determine the parallelism  
(was: Order by doesn't use estimation to determine the paralelism)

> Order by doesn't use estimation to determine the parallelism
> 
>
> Key: PIG-1642
> URL: https://issues.apache.org/jira/browse/PIG-1642
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Richard Ding
> Fix For: 0.8.0
>
>
> With PIG-1249, a simple heuristic is used to determine the number of reducers 
> if it isn't specified (via PARALLEL or default_parallel). For order by 
> statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   6   7   >