[jira] Created: (PIG-1662) Need better error message for MalFormedProbVecException

2010-10-01 Thread Richard Ding (JIRA)
Need better error message for MalFormedProbVecException
---

 Key: PIG-1662
 URL: https://issues.apache.org/jira/browse/PIG-1662
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0


Instead the generic error message:

Backend error message
-

Caused by: 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.MalFormedProbVecException:
 ERROR 2122: Sum of probabilities should be one
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.DiscreteProbabilitySampleGenerator.init(DiscreteProbabilitySampleGenerator.java:56)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:128)
... 10 more

it can easily print out the content of the malformed probability vector.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1662) Need better error message for MalFormedProbVecException

2010-10-01 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1662:
--

Attachment: PIG-1662.patch

 Need better error message for MalFormedProbVecException
 ---

 Key: PIG-1662
 URL: https://issues.apache.org/jira/browse/PIG-1662
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1662.patch


 Instead the generic error message:
 Backend error message
 -
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.MalFormedProbVecException:
  ERROR 2122: Sum of probabilities should be one
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.DiscreteProbabilitySampleGenerator.init(DiscreteProbabilitySampleGenerator.java:56)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:128)
   ... 10 more
 it can easily print out the content of the malformed probability vector.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1662) Need better error message for MalFormedProbVecException

2010-10-01 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1662:
--

Status: Patch Available  (was: Open)

 Need better error message for MalFormedProbVecException
 ---

 Key: PIG-1662
 URL: https://issues.apache.org/jira/browse/PIG-1662
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1662.patch


 Instead the generic error message:
 Backend error message
 -
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.MalFormedProbVecException:
  ERROR 2122: Sum of probabilities should be one
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.DiscreteProbabilitySampleGenerator.init(DiscreteProbabilitySampleGenerator.java:56)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:128)
   ... 10 more
 it can easily print out the content of the malformed probability vector.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1656) TOBAG udfs ignores columns with null value; it does not use input type to determine output schema

2010-10-01 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917050#action_12917050
 ] 

Richard Ding commented on PIG-1656:
---


We need to make it clear how the output schema of TOBAG is generated. For 
example, in the first case, the type is preserved in the inner schema:

{code}
grunt a = load 'input' as (a0:int, a1:int);
grunt b = foreach a generate TOBAG(a0, a1);
grunt describe b;
b: {{int}}
{code}

but not in the second case:

{code}
grunt a = load 'input' as (a0:int, a1:int);
grunt c = group a by a0 ;
grunt b = foreach c generate TOBAG(a.a0, a.a1);
grunt describe b;
b: {{NULL}}
{code}

 TOBAG  udfs ignores columns with null value;  it does not use input type to 
 determine output schema
 ---

 Key: PIG-1656
 URL: https://issues.apache.org/jira/browse/PIG-1656
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1656.1.patch


 TOBAG udf ignores columns with null value
 {code}
 R4= foreach B generate $0,  TOBAG( id, null, id,null );
 grunt dump R4;
 1000{(1),(1)}
 1000{(2),(2)}
 1000{(3),(3)}
 1000{(4),(4)}
 {code}
  TOBAG does not use input type to determine output schema
 {code}
 grunt B1 = foreach B generate TOBAG( 1, 2, 3); 
 grunt describe B1;
 B1: {{null}}
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1656) TOBAG udfs ignores columns with null value; it does not use input type to determine output schema

2010-10-01 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12917108#action_12917108
 ] 

Richard Ding commented on PIG-1656:
---

+1

 TOBAG  udfs ignores columns with null value;  it does not use input type to 
 determine output schema
 ---

 Key: PIG-1656
 URL: https://issues.apache.org/jira/browse/PIG-1656
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1656.1.patch, PIG-1656.2.patch


 TOBAG udf ignores columns with null value
 {code}
 R4= foreach B generate $0,  TOBAG( id, null, id,null );
 grunt dump R4;
 1000{(1),(1)}
 1000{(2),(2)}
 1000{(3),(3)}
 1000{(4),(4)}
 {code}
  TOBAG does not use input type to determine output schema
 {code}
 grunt B1 = foreach B generate TOBAG( 1, 2, 3); 
 grunt describe B1;
 B1: {{null}}
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1651) PIG class loading mishandled

2010-09-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1651:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

 PIG class loading mishandled
 

 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1651.patch


 If just having zebra.jar as being registered in a PIG script but not in the 
 CLASSPATH, the query using zebra fails since there appear to be multiple 
 classes loaded into JVM, causing static variable set previously not seen 
 after one instance of the class is created through reflection. (After the 
 zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is 
 as follows:
 ackend error message during job submission
 ---
 org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
 create input splits for: hdfs://hostname/pathto/zebra_dir :: null
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
 at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
 ... 7 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915889#action_12915889
 ] 

Richard Ding commented on PIG-1648:
---

+1

 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1648.patch


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1651) PIG class loading mishandled

2010-09-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915945#action_12915945
 ] 

Richard Ding commented on PIG-1651:
---

The problem here is that PigContext uses LogicalPlanBuilder.classloader to 
instantiate the LoadFuncs, but the context ClassLoader for the Thread uses a 
different class loader, and hence the static variable set for the class loaded 
by one loader is not visible by the class loaded by the other loader. The 
solution is to use the same LogicalPlanBuilder.classloader as the context 
ClassLoader for the Thread.

 PIG class loading mishandled
 

 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0


 If just having zebra.jar as being registered in a PIG script but not in the 
 CLASSPATH, the query using zebra fails since there appear to be multiple 
 classes loaded into JVM, causing static variable set previously not seen 
 after one instance of the class is created through reflection. (After the 
 zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is 
 as follows:
 ackend error message during job submission
 ---
 org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
 create input splits for: hdfs://hostname/pathto/zebra_dir :: null
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
 at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
 ... 7 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1651) PIG class loading mishandled

2010-09-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1651:
--

Status: Patch Available  (was: Open)

 PIG class loading mishandled
 

 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1651.patch


 If just having zebra.jar as being registered in a PIG script but not in the 
 CLASSPATH, the query using zebra fails since there appear to be multiple 
 classes loaded into JVM, causing static variable set previously not seen 
 after one instance of the class is created through reflection. (After the 
 zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is 
 as follows:
 ackend error message during job submission
 ---
 org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
 create input splits for: hdfs://hostname/pathto/zebra_dir :: null
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
 at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
 ... 7 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1651) PIG class loading mishandled

2010-09-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1651:
--

Attachment: PIG-1651.patch

 PIG class loading mishandled
 

 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1651.patch


 If just having zebra.jar as being registered in a PIG script but not in the 
 CLASSPATH, the query using zebra fails since there appear to be multiple 
 classes loaded into JVM, causing static variable set previously not seen 
 after one instance of the class is created through reflection. (After the 
 zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is 
 as follows:
 ackend error message during job submission
 ---
 org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
 create input splits for: hdfs://hostname/pathto/zebra_dir :: null
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
 at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
 ... 7 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1649) FRJoin fails to compute number of input files for replicated input

2010-09-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915985#action_12915985
 ] 

Richard Ding commented on PIG-1649:
---

+1. Looks good.

 FRJoin fails to compute number of input files for replicated input
 --

 Key: PIG-1649
 URL: https://issues.apache.org/jira/browse/PIG-1649
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1649.1.patch, PIG-1649.2.patch, PIG-1649.3.patch, 
 PIG-1649.4.patch


 In FRJoin, if input path has curly braces, it fails to compute number of 
 input files and logs the following exception in the log -
 10/09/27 14:31:13 WARN mapReduceLayer.MRCompiler: failed to get number of 
 input files
 java.net.URISyntaxException: Illegal character in path at index 12: 
 /user/tejas/{std*txt}
 at java.net.URI$Parser.fail(URI.java:2809)
 at java.net.URI$Parser.checkChars(URI.java:2982)
 at java.net.URI$Parser.parseHierarchical(URI.java:3066)
 at java.net.URI$Parser.parse(URI.java:3024)
 at java.net.URI.init(URI.java:578)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.hasTooManyInputFiles(MRCompiler.java:1283)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:1203)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:188)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:475)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:454)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:336)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:468)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:301)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
 at org.apache.pig.Main.run(Main.java:453)
 at org.apache.pig.Main.main(Main.java:107)
 This does not cause a query to fail. But since the number of input files 
 don't get calculated, the optimizations added in PIG-1458 to reduce load on 
 name node will not get used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-27 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1642:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1642.patch, PIG-1642_1.patch, PIG-1642_1.patch


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1641) Incorrect counters in local mode

2010-09-27 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch.

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1641.patch


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1642:
--

Attachment: PIG-1642.patch

The patch passed test-core.

The results of test-patch:

{code}
[exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 8 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{code}

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1642.patch


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1642:
--

Status: Patch Available  (was: Open)

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1642.patch


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1642:
--

Attachment: PIG-1642_1.patch

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1642.patch, PIG-1642_1.patch


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1642:
--

Attachment: PIG-1642_1.patch

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1642.patch, PIG-1642_1.patch, PIG-1642_1.patch


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914667#action_12914667
 ] 

Richard Ding commented on PIG-1642:
---

New patch to address the review comments.

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1642.patch, PIG-1642_1.patch, PIG-1642_1.patch


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1642:
-

Assignee: Richard Ding

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1641) Incorrect counters in local mode

2010-09-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

Fix Version/s: 0.8.0

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1641) Incorrect counters in local mode

2010-09-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

Status: Patch Available  (was: Open)

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1641.patch


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1641) Incorrect counters in local mode

2010-09-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1641:
--

Attachment: PIG-1641.patch

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1641.patch


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1641) Incorrect counters in local mode

2010-09-22 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913736#action_12913736
 ] 

Richard Ding commented on PIG-1641:
---

Hadoop counters are not available in local mode (PIG-1286).

So for now I propose that, in local mode,  Pig stats output is changed to 
something like the following:

{code} 
Job Stats (time in seconds):
JobId  Alias Feature Outputs
job_local_0001 raw MAP_ONLY
job_local_0002 rank_sort SAMPLER
job_local_0003 rank_sort ORDER_BY Processed/user_visits_table,

Input(s):
Successfully read records from: Data/Raw/UserVisits.dat

Output(s):
Successfully stored records in: Processed/user_visits_table
{code}

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan

 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1641) Incorrect counters in local mode

2010-09-22 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1641:
-

Assignee: Richard Ding

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding

 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism

2010-09-22 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1642:
--

Summary: Order by doesn't use estimation to determine the parallelism  
(was: Order by doesn't use estimation to determine the paralelism)

 Order by doesn't use estimation to determine the parallelism
 

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
 Fix For: 0.8.0


 With PIG-1249, a simple heuristic is used to determine the number of reducers 
 if it isn't specified (via PARALLEL or default_parallel). For order by 
 statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1642) Order by doesn't use estimation to determine the paralelism

2010-09-22 Thread Richard Ding (JIRA)
Order by doesn't use estimation to determine the paralelism
---

 Key: PIG-1642
 URL: https://issues.apache.org/jira/browse/PIG-1642
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
 Fix For: 0.8.0


With PIG-1249, a simple heuristic is used to determine the number of reducers 
if it isn't specified (via PARALLEL or default_parallel). For order by 
statement, however, it still defaults to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved

2010-09-20 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12912696#action_12912696
 ] 

Richard Ding commented on PIG-1616:
---

+1

 'union onschema' does not use create output with correct schema when udfs are 
 involved
 --

 Key: PIG-1616
 URL: https://issues.apache.org/jira/browse/PIG-1616
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1616.1.patch


 'union onshcema' creates a merged schema based on the input schemas. It does 
 that in the queryparser, and at that stage the udf return type used is the 
 default return type.  The actual return type for the udf is determined later 
 in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping().
 'union onschema' should use the final type for its input relation to create 
 the merged schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1479) Embed Pig in scripting languages

2010-09-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1479:
--

Attachment: pig-greek-test.tar

Attach the test script modified based on Julien's comment. As for commend line 
option -g, it can  also use one parameter (script file name) and  let Pig 
determine the script engine by the file extension.



 Embed Pig in scripting languages
 

 Key: PIG-1479
 URL: https://issues.apache.org/jira/browse/PIG-1479
 Project: Pig
  Issue Type: New Feature
Reporter: Julien Le Dem
 Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek-test.tar, 
 pig-greek-test.tar, pig-greek.tgz


 It should be possible to embed Pig calls in a scripting language and let 
 functions defined in the same script available as UDFs.
 This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which 
 lets users define UDFs in scripting languages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag

2010-09-16 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910407#action_12910407
 ] 

Richard Ding commented on PIG-1615:
---

This problem exists in Pig 0.7 and fixed in Pig 0.8.

 Return code from Pig is 0 even if the job fails when using -M flag
 --

 Key: PIG-1615
 URL: https://issues.apache.org/jira/browse/PIG-1615
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0, 0.7.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


 I have a Pig script of this form, which I used inside a workflow system such 
 as Oozie.
 {code}
 A = load  '$INPUT' using PigStorage();
 store A into '$OUTPUT';
 {code}
 I run this as with Multi-query optimization turned off :
 {quote}
 $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
 INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
 {quote}
 The directory /user/viraj/junk1 is not present
 I get the following results:
 {quote}
 Input(s):
 Failed to read data from /user/viraj/junk1
 Output(s):
 Failed to produce result in /user/viraj/junk2
 {quote}
 This is expected, but the return code is still 0
 {code}
 $ echo $?
 0
 {code}
 If I run this script with Multi-query optimization turned on, it gives, a 
 return code of 2, which is correct.
 {code}
 $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p 
 INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig
 ...
 $ echo $?
 2
 {code}
 I believe a wrong return code from Pig, is causing Oozie to believe that Pig 
 script succeeded.
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1610) 'union onschema' does handle some cases involving 'namespaced' column names in schema

2010-09-16 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12910409#action_12910409
 ] 

Richard Ding commented on PIG-1610:
---

+1

 'union onschema' does handle some cases involving 'namespaced' column names 
 in schema
 -

 Key: PIG-1610
 URL: https://issues.apache.org/jira/browse/PIG-1610
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1610.1.patch, PIG-1610.2.patch


 case 1:
 grunt describe f;  
 f: {l1::a: bytearray,l1::b: bytearray}
 grunt describe l1;
 l1: {a: bytearray,b: bytearray}
 grunt dump f;
 (1,11)
 (2,22)
 (3,33)
 grunt dump l1;
 (1,11)
 (2,22)
 (3,33)
 grunt u = union onschema f, l1;
 grunt describe u;
 u: {l1::a: bytearray,l1::b: bytearray}
 -- the dump u gives incorrect results
 grunt dump u; 
 (,)
 (,)
 (,)
 (1,11)
 (2,22)
 (3,33)
 case 2:
 grunt u = union onschema l1, f;
 grunt describe u;
 2010-09-13 15:11:13,877 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1108: Duplicate schema alias: l1::a
 Details at logfile: /Users/tejas/pig_unions_err2/trunk/pig_1284410413970.log

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1609) 'union onschema' should give a more useful error message when schema of one of the relations has null column name

2010-09-14 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909412#action_12909412
 ] 

Richard Ding commented on PIG-1609:
---

+1

 'union onschema' should give a more useful error message when schema of one 
 of the relations has null column name
 -

 Key: PIG-1609
 URL: https://issues.apache.org/jira/browse/PIG-1609
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1609.1.patch


 A better error message needs to be given in this case -
 {code}
 grunt l = load '/tmp/empty.bag' as (i : int);
 grunt f = foreach l generate i+1;
 grunt describe f;
 f: {int}
 grunt u = union onschema l , f;
 2010-09-10 18:08:13,000 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Error merging
 schemas for union operator
 Details at logfile: /Users/tejas/pig_nmr_syn/trunk/pig_1284167020897.log
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1479) Embed Pig in scripting languages

2010-09-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1479:
--

Attachment: PIG-1479_2.patch

In the previous patch, the executeScript method on ScriptPigServer returns a 
list of ExecJobs (one for each store statement in the script). Unfortunately, 
the order of ExecJobs in the list is indeterminate.  

This patch fixes this problem by making the executeScript method return a 
PigStats object. One then can retrieves the output result by the alias 
corresponding to store statement.

Here is a example:

{code}
P = pig.executeScript(
A = load '${input}';
... ...
store G into '${output}'; )

output = P.result(G)  # an OutputStats object
iter = output.iterator()
if iter.hasNext():
# do something
else:
# do something else
{code} 

 Embed Pig in scripting languages
 

 Key: PIG-1479
 URL: https://issues.apache.org/jira/browse/PIG-1479
 Project: Pig
  Issue Type: New Feature
Reporter: Julien Le Dem
 Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek.tgz


 It should be possible to embed Pig calls in a scripting language and let 
 functions defined in the same script available as UDFs.
 This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which 
 lets users define UDFs in scripting languages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1479) Embed Pig in scripting languages

2010-09-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1479:
--

Attachment: pig-greek-test.tar

Attach the updated test program from Julien.

To run the example:

* tar -xvf pig-greek-test.tar
* java -cp pig.jar:jython jar org.apache.pig.Main -x local -g script/tc.py

 Embed Pig in scripting languages
 

 Key: PIG-1479
 URL: https://issues.apache.org/jira/browse/PIG-1479
 Project: Pig
  Issue Type: New Feature
Reporter: Julien Le Dem
 Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek-test.tar, 
 pig-greek.tgz


 It should be possible to embed Pig calls in a scripting language and let 
 functions defined in the same script available as UDFs.
 This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which 
 lets users define UDFs in scripting languages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven

2010-09-13 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1562:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to both trunk and 0.8 branch. Thanks Niraj!.

 Fix the version for the dependent packages for the maven 
 -

 Key: PIG-1562
 URL: https://issues.apache.org/jira/browse/PIG-1562
 Project: Pig
  Issue Type: Bug
Reporter: niraj rai
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: PIG-1562_1.patch, PIG-1562_2.patch, PIG_1562_0.patch


 We need to fix the set version so that, version is properly set for the 
 dependent packages in the maven repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-630) provide indication that pig script only partially succeeded

2010-09-13 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-630.
--

 Assignee: Olga Natkovich
Fix Version/s: 0.8.0
   Resolution: Fixed

This jira has been fixed with MultiQuery optimization and Pig Stats.

 provide indication that pig script only partially succeeded
 ---

 Key: PIG-630
 URL: https://issues.apache.org/jira/browse/PIG-630
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0


 Currently, if you have multiple queries (stores/dumps) within the same pig 
 script, the script return the result of the last one which does not provide 
 sufficient information to the users. We need to provide to the user the 
 following information:
 - return code that indicates the script only partioally succeeded
 - indication which parts have succeeded

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1589) add test cases for mapreduce operator which use distributed cache

2010-09-13 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909061#action_12909061
 ] 

Richard Ding commented on PIG-1589:
---

+1

 add test cases for mapreduce operator which use distributed cache
 -

 Key: PIG-1589
 URL: https://issues.apache.org/jira/browse/PIG-1589
 Project: Pig
  Issue Type: Task
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1589.1.patch, TestWordCount.jar


 '-files filename' can be specified in the parameters for mapreduce operator 
 to send files to distributed cache. Need to add test cases for that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1479) Embed Pig in scripting languages

2010-09-10 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1479:
--

Attachment: PIG-1479.patch

Thanks Julien. I rebased the patch with the latest trunk and added an option 
(-greek) in the Main class.

Now one can run a PIG-Greek script with following command:

{code}
java -cp pig.jar:jython jar:hadoop config dir org.apache.pig.Main -g 
pig-greek script
{code}

or in local mode: 

{code}
java -cp pig.jar:jython jar org.apache.pig.Main -x local -g pig-greek script
{code}


 Embed Pig in scripting languages
 

 Key: PIG-1479
 URL: https://issues.apache.org/jira/browse/PIG-1479
 Project: Pig
  Issue Type: New Feature
Reporter: Julien Le Dem
 Attachments: PIG-1479.patch, pig-greek.tgz


 It should be possible to embed Pig calls in a scripting language and let 
 functions defined in the same script available as UDFs.
 This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which 
 lets users define UDFs in scripting languages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

Attachment: PIG-1548.patch

 Optimize scalar to consolidate the part file
 

 Key: PIG-1548
 URL: https://issues.apache.org/jira/browse/PIG-1548
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1548.patch


 Current scalar implementation will write a scalar file onto dfs. When Pig 
 need the scalar, it will open the dfs file directly. Each scalar file 
 contains more than one part file though it contains only one record. This 
 puts a huge load to namenode. We should consolidate part file before open it. 
 Another optional step is put the consolicated file into distributed cache. 
 This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

Attachment: (was: PIG-1458.patch)

 Optimize scalar to consolidate the part file
 

 Key: PIG-1548
 URL: https://issues.apache.org/jira/browse/PIG-1548
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1548.patch


 Current scalar implementation will write a scalar file onto dfs. When Pig 
 need the scalar, it will open the dfs file directly. Each scalar file 
 contains more than one part file though it contains only one record. This 
 puts a huge load to namenode. We should consolidate part file before open it. 
 Another optional step is put the consolicated file into distributed cache. 
 This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-09-03 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906008#action_12906008
 ] 

Richard Ding commented on PIG-1543:
---

+1. Looks good.

 IsEmpty returns the wrong value after using LIMIT
 -

 Key: PIG-1543
 URL: https://issues.apache.org/jira/browse/PIG-1543
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Hu
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1543-1.patch


 1. Two input files:
 1a: limit_empty.input_a
 1
 1
 1
 1b: limit_empty.input_b
 2
 2
 2.
 The pig script: limit_empty.pig
 -- A contains only 1's  B contains only 2's
 A = load 'limit_empty.input_a' as (a1:int);
 B = load 'limit_empty.input_a' as (b1:int);
 C =COGROUP A by a1, B by b1;
 D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
 COUNT(B);
 store D into 'limit_empty.output/d';
 -- After the script done, we see the right results:
 -- {(1),(1),(1)}   {}  1   0   3   0
 -- {} {(2),(2)}  0   1   0   2
 C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
 D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
 0:1), COUNT(Alim), COUNT(Blim);
 store D1 into 'limit_empty.output/d1';
 -- After the script done, we see the unexpected results:
 -- {(1)}   {}1   1   1   0
 -- {}  {(2)} 1   1   0   1
 dump D;
 dump D1;
 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
 The major one:
 IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
 IsEmpty() returns correctly in limit_empty.output/d/*.
 The difference is that one has been applied with LIMIT before using 
 IsEmpty().
 The minor one:
 The redirected output only contains the first dump:
 ({(1),(1),(1)},{},1,0,3L,0L)
 ({},{(2),(2)},0,1,0L,2L)
 We expect two more lines like:
 ({(1)},{},1,1,1L,0L)
 ({},{(2)},1,1,0L,1L)
 Besides, there is error says:
 [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
 java.lang.ClassCastException: java.lang.Integer cannot be cast to 
 org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1599) pig gives generic message for few cases

2010-09-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-1599.
---

Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch is committed to both trunk and 0.8 branch. Thanks Niraj.

 pig gives generic message for few cases
 ---

 Key: PIG-1599
 URL: https://issues.apache.org/jira/browse/PIG-1599
 Project: Pig
  Issue Type: Bug
Reporter: niraj rai
Assignee: niraj rai
 Attachments: pig-1599_0.patch, pig-1599_1.patch


 When we run the script:
 register testudf.jar;
 a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 b = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 c = cogroup a by name, b by name;
 d = foreach c generate flatten(org.apache.pig.test.udf.evalfunc.BadUdf(a,b));
 dump d;
 we get the error:
 now we get ERROR 2088: Unable to get results for: 
 hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp1787360727/tmp509618997:org.apache.pig.impl.io.InterStorage.
 The udf is bad udf and it should throw:
 ERROR 2078: Caught error from UDF: org.apache.pig.test.udf.evalfunc.BadUdf, 
 Out of bounds access [Index: 2, Size: 2]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-03 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

patch committed to both trunk and 0.8 branch.

 Optimize scalar to consolidate the part file
 

 Key: PIG-1548
 URL: https://issues.apache.org/jira/browse/PIG-1548
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1548.patch, PIG-1548_1.patch


 Current scalar implementation will write a scalar file onto dfs. When Pig 
 need the scalar, it will open the dfs file directly. Each scalar file 
 contains more than one part file though it contains only one record. This 
 puts a huge load to namenode. We should consolidate part file before open it. 
 Another optional step is put the consolicated file into distributed cache. 
 This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-09-02 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905744#action_12905744
 ] 

Richard Ding commented on PIG-1334:
---

Scott,

Please create a new Jira for this. Another follow-up jira (PIG-1562) has 
already been opened. 

-Richard

 Make pig artifacts available through maven
 --

 Key: PIG-1334
 URL: https://issues.apache.org/jira/browse/PIG-1334
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
 mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file

2010-09-02 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1548:
--

Attachment: PIG-1458.patch


Results of test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

{code}

 Optimize scalar to consolidate the part file
 

 Key: PIG-1548
 URL: https://issues.apache.org/jira/browse/PIG-1548
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch


 Current scalar implementation will write a scalar file onto dfs. When Pig 
 need the scalar, it will open the dfs file directly. Each scalar file 
 contains more than one part file though it contains only one record. This 
 puts a huge load to namenode. We should consolidate part file before open it. 
 Another optional step is put the consolicated file into distributed cache. 
 This further bring down the load of namenode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904267#action_12904267
 ] 

Richard Ding commented on PIG-1343:
---

Patch is committed to the trunk. Thanks Niraj.

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, 
 pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1343:
--

Attachment: PIG-1343_6.patch

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, 
 pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1343:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, 
 pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904321#action_12904321
 ] 

Richard Ding commented on PIG-1570:
---

+1.

 native mapreduce operator MR job does not follow same failure handling logic 
 as other pig MR jobs
 -

 Key: PIG-1570
 URL: https://issues.apache.org/jira/browse/PIG-1570
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1570.1.patch


 The code path for handling failure in MR job corresponding to native MR is 
 different and does not have the same behavior.
 For example, even if the MR job for mapreduce operator fails, the number of 
 jobs that failed is being reported as 0 in PigStats log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1458:
--

Attachment: PIG-1458_1.patch

New patch addressing review comments.

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch, PIG-1458_1.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1569:
--

Status: Patch Available  (was: Open)

 java properties not honored in case of properties such as stop.on.failure
 -

 Key: PIG-1569
 URL: https://issues.apache.org/jira/browse/PIG-1569
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1569.patch


 In org.apache.pig.Main , properties are being set to default value without 
 checking if the java system properties have been set to something else.
 stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
 have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1569:
--

Attachment: PIG-1569.patch

 java properties not honored in case of properties such as stop.on.failure
 -

 Key: PIG-1569
 URL: https://issues.apache.org/jira/browse/PIG-1569
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1569.patch


 In org.apache.pig.Main , properties are being set to default value without 
 checking if the java system properties have been set to something else.
 stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
 have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904385#action_12904385
 ] 

Richard Ding commented on PIG-1458:
---

Koji,

Please open a jira on increasing the replication factor of the replicated 
files. Now it uses the default replication factor. 

Thanks,
-Richard 

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch, PIG-1458_1.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1569:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

 java properties not honored in case of properties such as stop.on.failure
 -

 Key: PIG-1569
 URL: https://issues.apache.org/jira/browse/PIG-1569
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1569.patch


 In org.apache.pig.Main , properties are being set to default value without 
 checking if the java system properties have been set to something else.
 stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
 have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-1458.
---

Hadoop Flags: [Reviewed]
  Resolution: Fixed

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch, PIG-1458_1.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904451#action_12904451
 ] 

Richard Ding commented on PIG-1458:
---

Patch committed to trunk.

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch, PIG-1458_1.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904453#action_12904453
 ] 

Richard Ding commented on PIG-1483:
---

Patch committed to trunk.

 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch, PIG-1483_1.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904456#action_12904456
 ] 

Richard Ding commented on PIG-1557:
---

Patch committed to trunk.

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch, PIG-1557_1.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1564) add support for multiple filesystems

2010-08-26 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902952#action_12902952
 ] 

Richard Ding commented on PIG-1564:
---

Hi Andrew,

HDataStorage is a thin layer on top of Hadoop FileSystem. Since moving its 
local mode to Hadoop local mode, Pig no longer needs this layer.  We intends to 
remove it in the feature.

On Pig reading data from one file system and writing it to another, this 
feature is supported since Pig 0.7.

-Richard 

 add support for multiple filesystems
 

 Key: PIG-1564
 URL: https://issues.apache.org/jira/browse/PIG-1564
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
 Attachments: PIG-1564-1.patch


 Currently you can't run Pig scripts that read data from one file system and 
 write it to another. Also, Grunt doesn't support CDing from one directory to 
 another on different file systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-1518.
---

Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch is committed to trunk. Thanks Yan.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1569:
-

Assignee: Richard Ding

 java properties not honored in case of properties such as stop.on.failure
 -

 Key: PIG-1569
 URL: https://issues.apache.org/jira/browse/PIG-1569
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Richard Ding
 Fix For: 0.8.0


 In org.apache.pig.Main , properties are being set to default value without 
 checking if the java system properties have been set to something else.
 stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
 have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-26 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903072#action_12903072
 ] 

Richard Ding commented on PIG-1343:
---


The new patch logs NPE instead of the intended message:

{code}
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal 
error. null
{code}

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1458) aggregate files for replicated join

2010-08-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1458:
--

Attachment: PIG-1458.patch

This patch uses the new multi-file-combiner (PIG-1518) to concatenate many 
small files for replicated join. This is based on the assumption that the total 
size of the replicated files should be small enough to fit into main memory. 

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901992#action_12901992
 ] 

Richard Ding commented on PIG-1551:
---


The typo is still there:

{code}
private static final Class? LONG_ARRAY_CLASS = new Long[0].getClass();
{code}

It seems what you want is 

{code}
private static final Class? LONG_ARRAY_CLASS = new long[0].getClass();
{code}

so it's consistent with other array classes.

This does raise a question about array parameters: the first form applies to 
methods like _amethod(Long[] nums)_, while the second supports methods like 
_amethod(long[] nums)_. And they are not exchangeable. 

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902030#action_12902030
 ] 

Richard Ding commented on PIG-1343:
---

The log file is created when running in batch mode, but not in interactive mode.

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902042#action_12902042
 ] 

Richard Ding commented on PIG-1551:
---

+1.

I'm fine with arrays of primitive types. I can't think of a Java method that 
uses an array of object Long as a parameter.

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

Attachment: PIG-1483_1.patch

New patch adding unit test.

 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch, PIG-1483_1.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1483:
--

Status: Patch Available  (was: Open)

 [piggybank] Add HadoopJobHistoryLoader to the piggybank
 ---

 Key: PIG-1483
 URL: https://issues.apache.org/jira/browse/PIG-1483
 Project: Pig
  Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1483.patch, PIG-1483_1.patch


 PIG-1333 added many script-related entries to the MR job xml file and thus 
 it's now possible to use Pig for querying Hadoop job history/xml files to get 
 script-level usage statistics. What we need is a Pig loader that can parse 
 these files and generate corresponding data objects.
 The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
 Here is an example that shows the intended usage:
 *Find all the jobs grouped by script and user:*
 {code}
 a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
 (j:map[], m:map[], r:map[]);
 b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
 j#'USER' as user, (Chararray) j#'JOBID' as job; 
 c = filter b by not (id is null);
 d = group c by (id, user);
 e = foreach d generate flatten(group), c.job;
 dump e;
 {code}
 A couple more examples:
 *Find scripts that use only the default parallelism:*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
 c = group b by (id, user, script_name) parallel 10;
 d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
 max_reduces;
 e = filter d by max_reduces == 1;
 dump e;
 {code}
 *Find the running time of each script (in seconds):*
 {code}
 a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
 m:map[], r:map[]);
 b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
 as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
 end;
 c = group b by (id, user, script_name)
 d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
 MIN(b.start)/1000;
 dump d;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Attachment: PIG-1557_1.patch

New patch adds a unit test.

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch, PIG-1557_1.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

  Status: Patch Available  (was: Open)
Hadoop Flags: [Reviewed]

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch, PIG-1557_1.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch, PIG-1557_1.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1505) support jars and scripts in dfs

2010-08-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1505:
--

Release Note: Pig now supports running scripts and registering jars that 
are stored in HDFS, Amazon S3, or other distributed file systems.   (was: Pig 
now supports running scripts and registering jars that are stored in HDFS, 
Amazon S3, or other distributed file systems. Also added a -R parameter which 
allows users to specify properties in key=value form on the command line.)

Remove -R option. In 0.8 Pig supports generic parameters such as -Dkey=value. 

 support jars and scripts in dfs
 ---

 Key: PIG-1505
 URL: https://issues.apache.org/jira/browse/PIG-1505
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Andrew Hitchcock
Assignee: Andrew Hitchcock
 Fix For: 0.8.0

 Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, 
 pig-jars-and-scripts-from-dfs-trunk-1.patch, 
 pig-jars-and-scripts-from-dfs-trunk-2.patch, 
 pig-jars-and-scripts-from-dfs-trunk.patch


 Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901600#action_12901600
 ] 

Richard Ding commented on PIG-1518:
---

+1. The patch looks good.

A few of minor points:

* In PigSplit, the method add(InputSplit split) is not used and can be removed
* In MapRedUtil, it would be better to not leave the debug verification code in 
the source code
* In PigRecordReader, the code can be simplified if the initNextRecordReader() 
from constructor to initialize() method

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-23 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901656#action_12901656
 ] 

Richard Ding commented on PIG-1551:
---


In Invoker.java, there is a typo:

{code}
private static final Class? LONG_ARRAY_CLASS = new String[0].getClass();
{code}

also in unPrimitivize method, this code seems unnecessary:

{code}
} else if (klass.equals(DOUBLE_ARRAY_CLASS)) {
return DOUBLE_ARRAY_CLASS;
{code}

Otherwise the patch looks good.

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1560) Build target 'checkstyle' fails

2010-08-23 Thread Richard Ding (JIRA)
Build target 'checkstyle' fails
---

 Key: PIG-1560
 URL: https://issues.apache.org/jira/browse/PIG-1560
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Giridharan Kesavan
 Fix For: 0.8.0



Stack trace:

{code}
/homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: 
org/apache/commons/logging/LogFactory
at 
org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
at org.apache.tools.ant.Main.runBuild(Main.java:801)
at org.apache.tools.ant.Main.startAnt(Main.java:218)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.logging.LogFactory
at 
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
at 
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
at 
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1560) Build target 'checkstyle' fails

2010-08-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1560:
--

Description: 
Stack trace:

{code}
/trunk/build.xml:894: java.lang.NoClassDefFoundError: 
org/apache/commons/logging/LogFactory
at 
org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
at org.apache.tools.ant.Main.runBuild(Main.java:801)
at org.apache.tools.ant.Main.startAnt(Main.java:218)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.logging.LogFactory
at 
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
at 
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
at 
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more
{code}

  was:

Stack trace:

{code}
/homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: 
org/apache/commons/logging/LogFactory
at 
org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
at org.apache.tools.ant.Main.runBuild(Main.java:801)
at org.apache.tools.ant.Main.startAnt(Main.java:218)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.logging.LogFactory
at 
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
at 
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
at 
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more
{code}


 Build target 'checkstyle' fails
 ---

 Key: PIG-1560
 URL: 

[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Attachment: PIG-1557.patch

The alias for load statement is missing. Add load alias to the alias list.

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-23 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Fix Version/s: 0.8.0

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1505) support jars and scripts in dfs

2010-08-20 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900811#action_12900811
 ] 

Richard Ding commented on PIG-1505:
---


The results of test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{code}

I'll commit the patch after running unit tests.




 support jars and scripts in dfs
 ---

 Key: PIG-1505
 URL: https://issues.apache.org/jira/browse/PIG-1505
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
Assignee: Andrew Hitchcock
 Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, 
 pig-jars-and-scripts-from-dfs-trunk-1.patch, 
 pig-jars-and-scripts-from-dfs-trunk-2.patch, 
 pig-jars-and-scripts-from-dfs-trunk.patch


 Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1505) support jars and scripts in dfs

2010-08-20 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1505:
--

Fix Version/s: 0.8.0
Affects Version/s: 0.7.0

 support jars and scripts in dfs
 ---

 Key: PIG-1505
 URL: https://issues.apache.org/jira/browse/PIG-1505
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Andrew Hitchcock
Assignee: Andrew Hitchcock
 Fix For: 0.8.0

 Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, 
 pig-jars-and-scripts-from-dfs-trunk-1.patch, 
 pig-jars-and-scripts-from-dfs-trunk-2.patch, 
 pig-jars-and-scripts-from-dfs-trunk.patch


 Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1334) Make pig artifacts available through maven

2010-08-20 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1334:
--

Hadoop Flags: [Reviewed]
Release Note: 
ant mvn-install   :To install artifact to the local filesystem
ant mvn-deploy  : To deploy snapshots to the apache nexus repo (looks for 
authentication in the ~/.m2/settings.xml)
ant mvn-deploy -Drepo=staging  :To deploy artifacts for voting before release , 
this also requires authentication configured in ~/.m2/settings.xml
Deploying artifacts to the staging repository requires signing the artifacts 
with gpg keys, mvn-deploy target takes care of signing the artifacts. While 
executing mvn-deploy target with -Drepo=staging it would ask for gpg passphrase 
which need to be keyed in. Once the deployment is successful, to make the 
artifact available in the staging repository , login into the staging 
repository and close the staging by right clicking on the staged artifact at 
http:/repository.apache.org


  was:
ant mvn-install   :To install artifact to the local filesystem
ant mvn-deploy  : To deploy snapshots to the apache nexus repo (looks for 
authentication in the ~/.m2/settings.xml)
ant mvn-deploy -Drepo=staging  :To deploy artifacts for voting before release , 
this also requires authentication configured in ~/.m2/settings.xml
Deploying artifacts to the staging repository requires signing the artifacts 
with gpg keys, mvn-deploy target takes care of signing the artifacts. While 
executing mvn-deploy target with -Drepo=staging it would ask for gpg passphrase 
which need to be keyed in. Once the deployment is successful, to make the 
artifact available in the staging repository , login into the staging 
repository and close the staging by right clicking on the staged artifact at 
http:/repository.apache.org
With this patch I have already uploaded artifacts to the stating repository; 
(only ppl with committer access would be able to view this, as the repository 
is not closed yet)


 Make pig artifacts available through maven
 --

 Key: PIG-1334
 URL: https://issues.apache.org/jira/browse/PIG-1334
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
 mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1334) Make pig artifacts available through maven

2010-08-20 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1334:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

The patch is committed to the trunk. Thanks Niraj for making this feature 
available.

 Make pig artifacts available through maven
 --

 Key: PIG-1334
 URL: https://issues.apache.org/jira/browse/PIG-1334
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
 mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1505) support jars and scripts in dfs

2010-08-20 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1505:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

All core tests passed. The patch is committed to the trunk. 

Thanks Andrew for contributing this feature!

 support jars and scripts in dfs
 ---

 Key: PIG-1505
 URL: https://issues.apache.org/jira/browse/PIG-1505
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Andrew Hitchcock
Assignee: Andrew Hitchcock
 Fix For: 0.8.0

 Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, 
 pig-jars-and-scripts-from-dfs-trunk-1.patch, 
 pig-jars-and-scripts-from-dfs-trunk-2.patch, 
 pig-jars-and-scripts-from-dfs-trunk.patch


 Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer

2010-08-19 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900376#action_12900376
 ] 

Richard Ding commented on PIG-1514:
---

Patch looks good. A couple of comments:

* It would be better to refactor the graph manipulation code into a helper 
class so that the graph transformation routines (such as swap, insert, remove, 
replace, ...) can be shared by all rules.
* Please remove tabs from the file. 

 Migrate logical optimization rule: OpLimitOptimizer
 ---

 Key: PIG-1514
 URL: https://issues.apache.org/jira/browse/PIG-1514
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1514-0.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-08-19 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900518#action_12900518
 ] 

Richard Ding commented on PIG-1334:
---

The new output is at 
https://repository.apache.org/content/repositories/snapshots/org/apache/hadoop/pig/0.8.0-SNAPSHOT/

 Make pig artifacts available through maven
 --

 Key: PIG-1334
 URL: https://issues.apache.org/jira/browse/PIG-1334
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
 mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

 to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
 --

 Key: PIG-1452
 URL: https://issues.apache.org/jira/browse/PIG-1452
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.8.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
 PIG-1452V4.PATCH


 pig use ivy for dependency management. But still it uses hadoop20.jar from 
 the lib folder. 
 Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
 should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1497) Mandatory rule PartitionFilterOptimizer

2010-08-18 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900100#action_12900100
 ] 

Richard Ding commented on PIG-1497:
---

Looks good. A few comments:

In _PartitionFilterPushDown_:

* In _check_ method, why changes the condition from

{code}
if(... || sucs.size() != 1 || ...) {
{code}

 to 

{code}
if(... || succeds.size() == 0 || ...)
{code}

* In _transform_ method, the original code

{code}
// remove this filter from the plan  
mPlan.removeAndReconnect(loFilter);
{code}

is replaced by its own implementation. It seems better to also migrate the 
_removeAndReconnect_ to the new _OperatorPlan_ since the logic there is more 
complicated (keeping the order of connections). 

* The javadoc for the class isn't migrated.

* Several variables (e.g. loadFunc, loLoad, loFilter, ...) now have scope 
within the _PartitionFilterPushDownTransformer_ class, so it would be better to 
put them inside the transformer class.

In addition,

* Need to remove all the tabs from the files and replace them with 4 spaces.
* Several unit tests now fail due to the dependency on other jiras.

 Mandatory rule PartitionFilterOptimizer
 ---

 Key: PIG-1497
 URL: https://issues.apache.org/jira/browse/PIG-1497
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1497-0.patch


 Need to migrate PartitionFilterOptimizer to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Attachment: PIG-1452V4.PATCH

New patch fixing the contrib projects. 

 to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
 --

 Key: PIG-1452
 URL: https://issues.apache.org/jira/browse/PIG-1452
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.8.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
 PIG-1452V4.PATCH


 pig use ivy for dependency management. But still it uses hadoop20.jar from 
 the lib folder. 
 Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
 should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Status: Open  (was: Patch Available)

 to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
 --

 Key: PIG-1452
 URL: https://issues.apache.org/jira/browse/PIG-1452
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.8.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
 PIG-1452V4.PATCH


 pig use ivy for dependency management. But still it uses hadoop20.jar from 
 the lib folder. 
 Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
 should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-17 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Status: Patch Available  (was: Open)

 to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
 --

 Key: PIG-1452
 URL: https://issues.apache.org/jira/browse/PIG-1452
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.8.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
 PIG-1452V4.PATCH


 pig use ivy for dependency management. But still it uses hadoop20.jar from 
 the lib folder. 
 Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
 should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-17 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899631#action_12899631
 ] 

Richard Ding commented on PIG-1452:
---

The target buildJar-withouthadoop doesn't depend on hadoop20.jar so this 
change doesn't affect this target.

 to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
 --

 Key: PIG-1452
 URL: https://issues.apache.org/jira/browse/PIG-1452
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.8.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, 
 PIG-1452V4.PATCH


 pig use ivy for dependency management. But still it uses hadoop20.jar from 
 the lib folder. 
 Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
 should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1392) Parser fails to recognize valid field

2010-08-16 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1392:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

The parser bug is fixed, but encounters another problem which is tracked by 
PIG-1545. The work around is to disable the secondary key optimization.

The patch is committed to the trunk.

 Parser fails to recognize valid field
 -

 Key: PIG-1392
 URL: https://issues.apache.org/jira/browse/PIG-1392
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: nested_parser.patch


 Using this script below, parser fails to recognize a valid field in the 
 relation and throws error
 A = LOAD '/tmp' as (a:int, b:chararray, c:int);
 B = GROUP A BY (a, b);
 C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ;
 The error thrown is
 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: 
 chararray),A: {a: int,b: chararray,c: int}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1392) Parser fails to recognize valid field

2010-08-16 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899003#action_12899003
 ] 

Richard Ding commented on PIG-1392:
---

Thanks Niraj for fixing this issue.

 Parser fails to recognize valid field
 -

 Key: PIG-1392
 URL: https://issues.apache.org/jira/browse/PIG-1392
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: nested_parser.patch


 Using this script below, parser fails to recognize a valid field in the 
 relation and throws error
 A = LOAD '/tmp' as (a:int, b:chararray, c:int);
 B = GROUP A BY (a, b);
 C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ;
 The error thrown is
 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: 
 chararray),A: {a: int,b: chararray,c: int}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-08-16 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899053#action_12899053
 ] 

Richard Ding commented on PIG-1334:
---

bq. 2. This jar is 11MB and includes a bunch of dependencies, many of which are 
optional:

We should deploy _pig-0.8.0-SNAPSHOT-core.jar (which contains only Pig classes) 
instead of _pig-0.8.0-SNAPSHOT.jar_ (which also contains dependent jars).

 Make pig artifacts available through maven
 --

 Key: PIG-1334
 URL: https://issues.apache.org/jira/browse/PIG-1334
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
 mvn_pig_4.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-08-16 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1452:
--

Attachment: PIG-1452_3.patch

I resynced the patch with the trunk and the size of pig.jar now is about 8M.

 to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
 --

 Key: PIG-1452
 URL: https://issues.apache.org/jira/browse/PIG-1452
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.8.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH


 pig use ivy for dependency management. But still it uses hadoop20.jar from 
 the lib folder. 
 Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
 should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1541) FR Join shouldn't match null values

2010-08-13 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1541:
--

Attachment: PIG-1541_1.patch

New patch to address the general case where the join key is tuple.

 FR Join shouldn't match null values
 ---

 Key: PIG-1541
 URL: https://issues.apache.org/jira/browse/PIG-1541
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1541.patch, PIG-1541_1.patch


 Here is an example:
 Data input:
 {code}
 1   1
 2
 {code}
 the script 
 {code}
 a = load 'input';
 b = load 'input';
 c = join a by $0, b by $0 using 'repl';
 dump c; 
 {code}
 generates results that matches null values:
 {code}
 (1,1,1,1)
 (,2,,2)
 {code}
 The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator

2010-08-13 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898450#action_12898450
 ] 

Richard Ding commented on PIG-1448:
---

+1. Looks good.

 Detach tuple from inner plans of physical operator 
 ---

 Key: PIG-1448
 URL: https://issues.apache.org/jira/browse/PIG-1448
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: multi_oom_filt.pig, PIG-1448.1.patch


 This is a follow-up on PIG-1446 which only addresses this general problem for 
 a specific instance of For Each. In general, all the physical operators which 
 can have inner plans are vulnerable to this. Few of them include 
 POLocalRearrange, POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1541) FR Join shouldn't match null values

2010-08-12 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897866#action_12897866
 ] 

Richard Ding commented on PIG-1541:
---


Results of test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to i
 [exec] nclude 6 new or modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{code}

 FR Join shouldn't match null values
 ---

 Key: PIG-1541
 URL: https://issues.apache.org/jira/browse/PIG-1541
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1541.patch


 Here is an example:
 Data input:
 {code}
 1   1
 2
 {code}
 the script 
 {code}
 a = load 'input';
 b = load 'input';
 c = join a by $0, b by $0 using 'repl';
 dump c; 
 {code}
 generates results that matches null values:
 {code}
 (1,1,1,1)
 (,2,,2)
 {code}
 The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897451#action_12897451
 ] 

Richard Ding commented on PIG-1458:
---

The proposal is to run another map-reduce job to merge the small files before 
the replicated join. This additional job will be added to the MR plan at the 
compile time.

We consider three cases of a replicated join: 

# The right input is a map-only job and input files exist at the compile time.
# The right input is a map-only job and input files do not exist at the compile 
time.
# The right input is a map-reduce job.

For 1., if the number of files exceeds the threshold specified in the property 
file (_pig.frjoin.merge.files.threshold_), a merge job is added between right 
input job and FR join job.

For 3., if the number of reducers exceeds the threshold specified in the 
property file (_pig.frjoin.merge.files.threshold_), a merge job is added 
between right input job and FR join job.

For 2., if the flag specified in the property file 
(_pig.frjoin.merge.files.optimistic_) is false,  a merge job is added between 
right input job and FR join job. The default value of this flag is false. 



 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-103) Shared Job /tmp location should be configurable

2010-08-11 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-103:
-

Tags: documentation

 Shared Job /tmp location should be configurable
 ---

 Key: PIG-103
 URL: https://issues.apache.org/jira/browse/PIG-103
 Project: Pig
  Issue Type: Improvement
  Components: impl
 Environment: Partially shared file:// filesystem (eg NFS)
Reporter: Craig Macdonald
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: conf_tmp_dir.patch, conf_tmp_dir_2.patch


 Hello,
 I'm investigating running pig in an environment where various parts of the 
 file:// filesystem are available on all nodes. I can tell hadoop to use a 
 file:// file system location for it's default, by seting 
 fs.default.name=file://path/to/shared/folder
 However, this creates issues for Pig, as Pig writes it's job information in a 
 folder that it assumes is a shared FS (eg DFS). However, in this scenario 
 /tmp is not shared on each machine.
 So /tmp should either be configurable, or Hadoop should tell you the actual 
 full location set in fs.default.name?
 Straightforward solution is to make /tmp/ a property in 
 src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext)
 Any suggestions of property names?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897484#action_12897484
 ] 

Richard Ding commented on PIG-1458:
---

For 1. and 2. above, another approach is to do nothing and rely on 
MultiFileInputFormat (PIG-1518) to merge small files. 

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   6   >