Re: [VOTE] Release Pig 0.3.0 (candidate 0)

2009-06-22 Thread Alan Gates

Downloaded, ran, ran tutorial, built piggybank.  All looks good.

+1

Alan.

On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote:


Hi,

I created a candidate build for Pig 0.3.0 release. The main feature of
this release is support for multiquery which allows to share  
computation

across multiple queries within the same script. We see significant
performance improvements (up to order of magnitude) as the result of
this optimization.

I ran the rat report and made sure that all the source files contain
proper headers. (Not attaching the report since it caused trouble with
the last release.)

Keys used to sign the release candidate are at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS.

Please, download and try the release candidate:
http://people.apache.org/~olga/pig-0.3.0-candidate-0/.

Please, vote by Wednesday, June 24th.

Olga





[jira] Created: (PIG-859) Optimizer throw error on self-joins

2009-06-22 Thread Ashutosh Chauhan (JIRA)
Optimizer throw error on self-joins
---

 Key: PIG-859
 URL: https://issues.apache.org/jira/browse/PIG-859
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
 Fix For: 0.4.0


Doing self-join results in exception thrown by Optimizer. Consider the 
following query
{code}
grunt A = load 'a';
grunt B = Join A by $0, A by $0;
grunt explain B;

2009-06-20 15:51:38,303 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1094: Attempt to insert between two nodes that were not connected.
Details at logfile: pig_1245538027026.log
{code}

Relevant stack-trace from log-file:
{code}

Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR
2047: Internal error. Unable to introduce split operators.
at
org.apache.pig.impl.logicalLayer.optimizer.ImplicitSplitInserter.transform(ImplicitSplitInserter.java:163)
at
org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:163)
at org.apache.pig.PigServer.compileLp(PigServer.java:844)
at org.apache.pig.PigServer.compileLp(PigServer.java:781)
at org.apache.pig.PigServer.getStorePlan(PigServer.java:723)
at org.apache.pig.PigServer.explain(PigServer.java:566)
... 8 more
Caused by: org.apache.pig.impl.plan.PlanException: ERROR 1094: Attempt
to insert between two nodes that were not connected.
at
org.apache.pig.impl.plan.OperatorPlan.doInsertBetween(OperatorPlan.java:500)
at
org.apache.pig.impl.plan.OperatorPlan.insertBetween(OperatorPlan.java:480)
at
org.apache.pig.impl.logicalLayer.optimizer.ImplicitSplitInserter.transform(ImplicitSplitInserter.java:139)
... 13 more
{code}


A possible workaround is:
{code}

grunt A = load 'a';
grunt B = load 'a';
grunt C = join A by $0, B by $0;
grunt explain C;
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-22 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722701#action_12722701
 ] 

Santhosh Srinivasan commented on PIG-773:
-

Comments:

1. Minor comment - the comment on the empty productions all have the same text 
For tuple and bag, it should be changed to tuple and bag respectively

{code}
+   |{ } // Match the empty content in map.
{code}

2. I am not sure about the test case testEmptyBagConstRecursive. Here the bag 
contains an empty tuple. As a result, the field schema for the bag should 
contain the schema of the empty tuple. The test case will probably fail.

{code}
+@Test
+public void testEmptyBagConstRecursive() throws FrontendException{
+   
+LogicalPlan lp = buildPlan(a = foreach (load 'b') generate {()};);
+LOForEach foreach = (LOForEach) lp.getLeaves().get(0);
+
+Schema.FieldSchema bagFs = new 
Schema.FieldSchema(null,null,DataType.BAG);
+Schema expectedSchema = new Schema(bagFs);
+   
+assertTrue(Schema.equals(foreach.getSchema(), expectedSchema, false, 
true));
+}

{code}

3. There are no tests that check if the empty constants are actually created, 
i.e., there are no checks for expected empty constants. The test below checks 
if the parser can parse the new syntax for empty constants. In addition, the 
values generated by the parser have to checked against expected values for 
these constants.

{code}
+@Test
+public void testRandomEmptyConst(){
+// Various random scripts to test recursive nature of parser with 
empty constants.
+   
+buildPlan(a = foreach (load 'b') generate {({})};);
+buildPlan(a = foreach (load 'b') generate ({()}););
+buildPlan(a = foreach (load 'b') generate {(),()};);
+buildPlan(a = foreach (load 'b') generate ({},{}););
+buildPlan(a = foreach (load 'b') generate ((),()););
+buildPlan(a = foreach (load 'b') generate ([],[]););
+buildPlan(a = foreach (load 'b') generate {({},{})};);
+buildPlan(a = foreach (load 'b') generate {([],[])};);
+buildPlan(a = foreach (load 'b') generate (({},{})););
+buildPlan(a = foreach (load 'b') generate (([],[])););
+}
{code}

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Priority: Minor
 Attachments: pig-773.patch, pig-773_v2.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722706#action_12722706
 ] 

Ashutosh Chauhan commented on PIG-820:
--

In the patch RandomSampleLoader is marked as serializable and loader field in 
it is marked as transient. Since loader is  initialized in constructor and is 
used later on findbugs is complaining : This class contains a field that is 
updated at multiple places in the class, thus it seems to be part of the state 
of the class.However, since the field is marked as transient and not set in 
readObject or readResolve, it will contain the default value in any 
deserialized instance of the class.  However there is no need for 
RandomSampleLoader to implement Serializable anyway (and thus loader to be 
marked as transient) because loader is reconstructed from FunSpec later on. 
Because of this reason, both PigStorage and BinStorage also doesnt implement 
serializable. Will be submitting a new patch with the required changes.


 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: pig-820.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Release Pig 0.3.0 (candidate 0)

2009-06-22 Thread Arun C Murthy


On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote:


Hi,

I created a candidate build for Pig 0.3.0 release. The main feature of
this release is support for multiquery which allows to share  
computation

across multiple queries within the same script. We see significant
performance improvements (up to order of magnitude) as the result of
this optimization.



+1

I downloaded the release, validated checksums and ran the unit-tests  
successfully.


Arun



Re: [VOTE] Release Pig 0.3.0 (candidate 0)

2009-06-22 Thread Nigel Daley

+1.

On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote:


Hi,

I created a candidate build for Pig 0.3.0 release. The main feature of
this release is support for multiquery which allows to share  
computation

across multiple queries within the same script. We see significant
performance improvements (up to order of magnitude) as the result of
this optimization.

I ran the rat report and made sure that all the source files contain
proper headers. (Not attaching the report since it caused trouble with
the last release.)

Keys used to sign the release candidate are at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS.

Please, download and try the release candidate:
http://people.apache.org/~olga/pig-0.3.0-candidate-0/.

Please, vote by Wednesday, June 24th.

Olga





RE: [VOTE] Release Pig 0.3.0 (candidate 0)

2009-06-22 Thread Pradeep Kamath
+1 for release.

-Pradeep

-Original Message-
From: Alan Gates [mailto:ga...@yahoo-inc.com] 
Sent: Monday, June 22, 2009 9:30 AM
To: priv...@hadoop.apache.org
Cc: pig-dev@hadoop.apache.org; gene...@hadoop.apache.org
Subject: Re: [VOTE] Release Pig 0.3.0 (candidate 0)

Downloaded, ran, ran tutorial, built piggybank.  All looks good.

+1

Alan.

On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote:

 Hi,

 I created a candidate build for Pig 0.3.0 release. The main feature of
 this release is support for multiquery which allows to share  
 computation
 across multiple queries within the same script. We see significant
 performance improvements (up to order of magnitude) as the result of
 this optimization.

 I ran the rat report and made sure that all the source files contain
 proper headers. (Not attaching the report since it caused trouble with
 the last release.)

 Keys used to sign the release candidate are at
 http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS.

 Please, download and try the release candidate:
 http://people.apache.org/~olga/pig-0.3.0-candidate-0/.

 Please, vote by Wednesday, June 24th.

 Olga




requirements for Pig 1.0?

2009-06-22 Thread Dmitriy Ryaboy
I know there was some discussion of making the types release (0.2) a Pig 1
release, but that got nixed. There wasn't a similar discussion on 0.3.
Has the list of want-to-haves for Pig 1.0 been discussed since?


[jira] Resolved: (PIG-600) PiggyBank compilation instructions don't work

2009-06-22 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-600.


Resolution: Fixed

Thanks, Dmitry for your help with this issue!

 PiggyBank compilation instructions don't work
 -

 Key: PIG-600
 URL: https://issues.apache.org/jira/browse/PIG-600
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: David Ciemiewicz

 I know that PiggyBank is as-is but the instructions are incomplete and 
 should be complete enough (all steps) required to compile PiggyBank.
 http://wiki.apache.org/pig/PiggyBank
 I checked out the types branch version of PiggyBank by modifying the 
 instructions to check out:
 svn co 
 http://svn.apache.org/repos/asf/hadoop/pig/branches/types/contrib/piggybank/
 At step 2 it says:
 To build a jar file that contains all available user defined functions 
 (UDFs), please follow the steps:
 1. Checkout UDF code: svn co 
 http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank
 2. Build the jar file: from trunk/contrib/piggybank/java directory run ant. 
 This will generate piggybank.jar in the same directory.
 So I went into the piggybank/java directory and and ran ant and got the 
 following errors:
 {code}
 -bash-3.00$ ant
 Buildfile: build.xml
 init:
 compile:
  [echo]  *** Compiling Pig UDFs ***
 [javac] Compiling 70 source files to 
 /homes/ciemo/piggybank/java/build/classes
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:25:
  cannot find symbol
 [javac] symbol  : class EvalFunc
 [javac] location: package org.apache.pig
 [javac] import org.apache.pig.EvalFunc;
 [javac]  ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:26:
  cannot find symbol
 [javac] symbol  : class FuncSpec
 [javac] location: package org.apache.pig
 [javac] import org.apache.pig.FuncSpec;
 [javac]  ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:27:
  package org.apache.pig.data does not exist
 [javac] import org.apache.pig.data.Tuple;
 [javac]   ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:28:
  package org.apache.pig.impl.logicalLayer.schema does not exist
 [javac] import org.apache.pig.impl.logicalLayer.schema.Schema;
 [javac]   ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:29:
  package org.apache.pig.data does not exist
 [javac] import org.apache.pig.data.DataType;
 [javac]   ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:30:
  package org.apache.pig.impl.logicalLayer does not exist
 [javac] import org.apache.pig.impl.logicalLayer.FrontendException;
 [javac]^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:31:
  package org.apache.pig.impl.util does not exist
 [javac] import org.apache.pig.impl.util.WrappedIOException;
 [javac]^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:61:
  cannot find symbol
 [javac] symbol: class EvalFunc
 [javac] public class ABS extends EvalFuncDouble{
 [javac]  ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:67:
  cannot find symbol
 [javac] symbol  : class Tuple
 [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS
 [javac] public Double exec(Tuple input) throws IOException {
 [javac]^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85:
  cannot find symbol
 [javac] symbol  : class Schema
 [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS
 [javac] public Schema outputSchema(Schema input) {
 [javac]^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85:
  cannot find symbol
 [javac] symbol  : class Schema
 [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS
 [javac] public Schema outputSchema(Schema input) {
 [javac]^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:93:
  cannot find symbol
 

[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-22 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-851:


Status: In Progress  (was: Patch Available)

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
 Attachments: patch_815.txt


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-851) Map type used as return type in UDFs not recognized at all times

2009-06-22 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-851:


Patch Info: [Patch Available]

 Map type used as return type in UDFs not recognized at all times
 

 Key: PIG-851
 URL: https://issues.apache.org/jira/browse/PIG-851
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Santhosh Srinivasan
 Attachments: patch_815.txt


 When an UDF returns a map and the outputSchema method is not overridden, Pig 
 does not figure out the data type. As a result, the type is set to unknown 
 resulting in run time failure. An example script and UDF follow
 {code}
 public class mapUDF extends EvalFuncMapObject, Object {
 @Override
 public MapObject, Object exec(Tuple input) throws IOException {
 return new HashMapObject, Object();
 }
 //Note that the outputSchema method is commented out
 /*
 @Override
 public Schema outputSchema(Schema input) {
 try {
 return new Schema(new Schema.FieldSchema(null, null, 
 DataType.MAP));
 } catch (FrontendException e) {
 return null;
 }
 }
 */
 {code}
 {code}
 grunt a = load 'student_tab.data';   
 grunt b = foreach a generate EXPLODE(1);
 grunt describe b;
 b: {Unknown}
 grunt dump b;
 2009-06-15 17:59:01,776 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Failed!
 2009-06-15 17:59:01,781 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2080: Foreach currently does not handle type Unknown
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-860) Load split by 'file' - not documented in pig latin reference manual

2009-06-22 Thread Thejas M Nair (JIRA)
Load split by 'file' - not documented in pig latin reference manual
---

 Key: PIG-860
 URL: https://issues.apache.org/jira/browse/PIG-860
 Project: Pig
  Issue Type: Task
  Components: documentation
Reporter: Thejas M Nair
Priority: Minor


split by 'file'  is not documented in Pig Latin Reference Manual 
(http://hadoop.apache.org/pig/docs/r0.2.0/piglatin.html).

There is a description about the option here 
-http://wiki.apache.org/pig/PigStreamingFunctionalSpec (section 4.3).




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-860) Load split by 'file' - not documented in pig latin reference manual

2009-06-22 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722760#action_12722760
 ] 

Thejas M Nair edited comment on PIG-860 at 6/22/09 11:56 AM:
-

Use case, from discussion in user mailing list -

-- Forwarded Message
From: Pradeep Kamath prade...@yah...c.com
Reply-To: pig-u...@hadoop.apache.org pig-u...@hadoop.apache.org
Date: Mon, 22 Jun 2009 11:04:04 -0700
To: pig-u...@hadoop.apache.org
Conversation: How to make the pig unsplittable ?
Subject: RE: How to make the pig unsplittable ?

You can load using the following syntax:
A = load 'inputfile' split by 'file';

The split by 'file' will ensure issplittable is set to false and the
input is not split.

-Pradeep

-Original Message-
From: zhang jianfeng [mailto:zjff..ail.com] 
Sent: Sunday, June 21, 2009 10:48 PM
To: pig-u...@hadoop.apache.org
Subject: How to make the pig unsplittable ?

Hi all,

Because of my input file format, the first line of file is the
definition of
each field, and then lines of records. So I did not found one good
method of
using customer slicer.

So I'd like to make the pig do not split my file, but I did not found an
easy way. Now I have to change the code in POLoad and LOLoad, make the
variable isSplitable false.

Is there any easier way to make it unsplittable, such as configuration ?


Thank you for any help.


Jeff Zhang

-- End of Forwarded Message

  was (Author: thejas):
Use case, from discussion in user mailing list -

-- Forwarded Message
From: Pradeep Kamath prade...@yahoo-inc.com
Reply-To: pig-u...@hadoop.apache.org pig-u...@hadoop.apache.org
Date: Mon, 22 Jun 2009 11:04:04 -0700
To: pig-u...@hadoop.apache.org
Conversation: How to make the pig unsplittable ?
Subject: RE: How to make the pig unsplittable ?

You can load using the following syntax:
A = load 'inputfile' split by 'file';

The split by 'file' will ensure issplittable is set to false and the
input is not split.

-Pradeep

-Original Message-
From: zhang jianfeng [mailto:zjf...@gmail.com] 
Sent: Sunday, June 21, 2009 10:48 PM
To: pig-u...@hadoop.apache.org
Subject: How to make the pig unsplittable ?

Hi all,

Because of my input file format, the first line of file is the
definition of
each field, and then lines of records. So I did not found one good
method of
using customer slicer.

So I'd like to make the pig do not split my file, but I did not found an
easy way. Now I have to change the code in POLoad and LOLoad, make the
variable isSplitable false.

Is there any easier way to make it unsplittable, such as configuration ?


Thank you for any help.


Jeff Zhang

-- End of Forwarded Message
  
 Load split by 'file' - not documented in pig latin reference manual
 ---

 Key: PIG-860
 URL: https://issues.apache.org/jira/browse/PIG-860
 Project: Pig
  Issue Type: Task
  Components: documentation
Reporter: Thejas M Nair
Priority: Minor

 split by 'file'  is not documented in Pig Latin Reference Manual 
 (http://hadoop.apache.org/pig/docs/r0.2.0/piglatin.html).
 There is a description about the option here 
 -http://wiki.apache.org/pig/PigStreamingFunctionalSpec (section 4.3).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v2.patch

Patch which fixes findbugs warning.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Patch Available  (was: Open)

Submitting to hudson

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Fix Version/s: 0.4.0
 Assignee: Ashutosh Chauhan  (was: Alan Gates)
   Status: Open  (was: Patch Available)

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v3.patch

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Release Pig 0.3.0 (candidate 0)

2009-06-22 Thread Daniel Dai

Unit test pass on Windows. +1 for release.

Daniel

- Original Message - 
From: Santhosh Srinivasan s...@yahoo-inc.com

To: pig-dev@hadoop.apache.org
Sent: Monday, June 22, 2009 2:15 PM
Subject: RE: [VOTE] Release Pig 0.3.0 (candidate 0)


I was able to download the archive, verify the checksum and run the unit
test cases successfully. However, I was not able to run the tutorial.
One of the following has to be fixed:

1. The release notes should not reference the wiki for the tutorial.
2. The wiki for the tutorial has to be updated to allow users to run the
tutorial successfully.

Pending the fix for the aforementioned problems, +1 for the release.

Thanks,
Santhosh 


-Original Message-
From: Olga Natkovich [mailto:ol...@yahoo-inc.com] 
Sent: Thursday, June 18, 2009 12:30 PM

To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org;
gene...@hadoop.apache.org
Subject: [VOTE] Release Pig 0.3.0 (candidate 0)

Hi,

I created a candidate build for Pig 0.3.0 release. The main feature of
this release is support for multiquery which allows to share computation
across multiple queries within the same script. We see significant
performance improvements (up to order of magnitude) as the result of
this optimization.

I ran the rat report and made sure that all the source files contain
proper headers. (Not attaching the report since it caused trouble with
the last release.)

Keys used to sign the release candidate are at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS.

Please, download and try the release candidate:
http://people.apache.org/~olga/pig-0.3.0-candidate-0/.

Please, vote by Wednesday, June 24th.

Olga