date:20100823


[ 
https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901279#action_12901279
 ] 

Daniel Dai commented on PIG-1514:
-

One minor correction, adding:
{code}
currentPlan.remove(limit);
{code}
to OptimizeLimit:197

 Migrate logical optimization rule: OpLimitOptimizer
 ---

 Key: PIG-1514
 URL: https://issues.apache.org/jira/browse/PIG-1514
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1514-0.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [VOTE] Pig to become a top level Apache project

2010-08-23 Thread Alan Gates

With 9 +1 votes and no -1s the vote passes.  I will begin a vote on  
Hadoop general.


Alan.

On Aug 18, 2010, at 10:34 AM, Alan Gates wrote:


Earlier this week I began a discussion on Pig becoming a TLP 
(http://bit.ly/byD7L8
 ).  All of the received feedback was positive.  So, let's have a
formal vote.

I propose we move Pig to a top level Apache project.

I propose that the initial PMC of this project be the list of all
currently active Pig committers (http://hadoop.apache.org/pig/whoweare.html
 ) as of 18 August 2010.

I nominate Olga Natkovich as the chair of the PMC.  (PMC chairs have
no more power than other PMC members, but they are responsible for
writing regular reports for the Apache board, assigning rights to new
committers, etc.)

I propose that as part of the resolution that will be forwarded to the
Apache board we include that one of the first tasks of the new Pig PMC
will be to adopt bylaws for the governance of the project.

Alan.

P.S.
If this vote passes, the next step is that the proposal will be
forwarded to the Hadoop PMC for discussion and vote.
If the Hadoop PMC vote passes, a formal resolution is then drafted
(see http://bit.ly/bvOTRq for an example resolution) and sent to the
Apache board.
The Apache board will then vote on whether to make Pig a TLP.

Re: August Pig contributor workshop

2010-08-23 Thread Dmitriy Ryaboy

Olga,
We do have another couple of spots.

-Dmitriy

On Thu, Aug 19, 2010 at 10:28 AM, Olga Natkovich ol...@yahoo-inc.comwrote:

 Dmitry,

 Do you have any spots left?

 Olga

 -Original Message-
 From: Russell Jurney [mailto:russell.jur...@gmail.com]
 Sent: Thursday, August 19, 2010 5:22 AM
 To: pig-dev@hadoop.apache.org
 Subject: Re: August Pig contributor workshop

 Oh, +2 more - Pete Skomoroch  and Sam Shah will also attend, for a total of
 4 LinkedIners.

 On Wed, Aug 18, 2010 at 9:18 PM, Alan Gates ga...@yahoo-inc.com wrote:

  Confirming Olga and I will be there.
 
  Alan.
 
 
  On Aug 18, 2010, at 4:45 PM, Dmitriy Ryaboy wrote:
 
   Hi folks,
  Please do RSVP so that we know how many people are coming.
 
  Thanks,
  -Dmitriy
 
  On Tue, Aug 17, 2010 at 4:04 PM, Alan Gates ga...@yahoo-inc.com
 wrote:
 
   All,
 
  We will be holding the next Pig contributor workshop at Twitter on
  Wednesday, August 25 from 4-6.  The tentative agenda is to discuss:
 
  Making Piggybank better
  Pig and Azkaban integration
  Plans for features in 0.9
  An update on the Howl project
 
  Anyone contributing to or interested i

[jira] Created: (PIG-1556) Need a clean way to kill Pig jobs.

2010-08-23 Thread Aravind Srinivasan (JIRA)

Need a clean way to kill Pig jobs.
--

 Key: PIG-1556
 URL: https://issues.apache.org/jira/browse/PIG-1556
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.7.0
Reporter: Aravind Srinivasan
 Fix For: 0.9.0


We need a way to kill a running Pig script cleanly. This is very similar to  
hadoop job -kill command. This requirement means the following. 

1) Support a pig -kill script ID or a similar syntax. The script ID or some 
unique handle should be easily available for the user to identify a running Pig 
job.
2) The command will then identify all the MR jobs that are currently spawned by 
this given Pig script.
3) It will internally usse hadoop job -kill to kill each one of those MR jobs 
spawned.
4) It will do any other cleanup necessary and also make sure all 
mappers/reducers emanating from this Pig script are killed.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

[
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1178:

Attachment: PIG-1178-7.patch

PIG-1178-7.patch switch the flag to use new logical plan by default. It fix
most unit test except:
1. TestMultiQuery.testMultiQueryJiraPig1169, it depends on PIG-1514, will be
fixed automatically once PIG-1514 check in
2. TestPruneColumn.testMapKey3

Both test cases are temporarily commented out. All other unit tests pass.

Here is test-patch result:
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 36 new or
modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning
messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number
of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs
warnings.
[exec]
[exec] +1 release audit. The applied patch does not increase the
total number of release audit warnings.

LogicalPlan and Optimizer are too complex and hard to work with
---

Key: PIG-1178
URL: https://issues.apache.org/jira/browse/PIG-1178
Project: Pig
Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
Fix For: 0.8.0

Attachments: expressions-2.patch, expressions.patch, lp.patch,
lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, PIG-1178-6.patch,
PIG-1178-7.patch, pig_1178.patch, pig_1178.patch, PIG_1178.patch,
pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, pig_1178_3.4.patch,
pig_1178_3.patch

The current implementation of the logical plan and the logical optimizer in
Pig has proven to not be easily extensible. Developer feedback has indicated
that adding new rules to the optimizer is quite burdensome. In addition, the
logical plan has been an area of numerous bugs, many of which have been
difficult to fix. Developers also feel that the logical plan is difficult to
understand and maintain. The root cause for these issues is that a number of
design decisions that were made as part of the 0.2 rewrite of the front end
have now proven to be sub-optimal. The heart of this proposal is to revisit a
number of those proposals and rebuild the logical plan with a simpler design
that will make it much easier to maintain the logical plan as well as extend
the logical optimizer.
See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full
details.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

[
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1178:

Status: Open (was: Patch Available)

LogicalPlan and Optimizer are too complex and hard to work with
---

Key: PIG-1178
URL: https://issues.apache.org/jira/browse/PIG-1178
Project: Pig
Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
Fix For: 0.8.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

[
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1178:

Status: Patch Available (was: Open)

LogicalPlan and Optimizer are too complex and hard to work with
---

Key: PIG-1178
URL: https://issues.apache.org/jira/browse/PIG-1178
Project: Pig
Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
Fix For: 0.8.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

[
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901543#action_12901543
]

Daniel Dai commented on PIG-1178:
-

PIG-1178-7.patch committed.

LogicalPlan and Optimizer are too complex and hard to work with
---

Key: PIG-1178
URL: https://issues.apache.org/jira/browse/PIG-1178
Project: Pig
Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
Fix For: 0.8.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-506) Does pig need a NATIVE keyword?

2010-08-23 Thread Thejas M Nair (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-506:
--

Attachment: PIG-506.patch

New patch address my comments.
test-patch results - 
 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 10 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] -1 release audit.  The applied patch generated 433 release 
audit warnings (more than the trunk's current 425 warnings).

release audit warnings are for the javadoc html files
I will commit once all unit tests pass.

 Does pig need a NATIVE keyword?
 ---

 Key: PIG-506
 URL: https://issues.apache.org/jira/browse/PIG-506
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Aniket Mokashi
Priority: Minor
 Fix For: 0.8.0

 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, 
 NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.patch, 
 TestWordCount.jar


 Assume a user had a job that broke easily into three pieces.  Further assume 
 that pieces one and three were easily expressible in pig, but that piece two 
 needed to be written in map reduce for whatever reason (performance, 
 something that pig could not easily express, legacy job that was too 
 important to change, etc.).  Today the user would either have to use map 
 reduce for the entire job or manually handle the stitching together of pig 
 and map reduce jobs.  What if instead pig provided a NATIVE keyword that 
 would allow the script to pass off the data stream to the underlying system 
 (in this case map reduce).  The semantics of NATIVE would vary by underlying 
 system.  In the map reduce case, we would assume that this indicated a 
 collection of one or more fully contained map reduce jobs, so that pig would 
 store the data, invoke the map reduce jobs, and then read the resulting data 
 to continue.  It might look something like this:
 {code}
 A = load 'myfile';
 X = load 'myotherfile';
 B = group A by $0;
 C = foreach B generate group, myudf(B);
 D = native (jar=mymr.jar, infile=frompig outfile=topig);
 E = join D by $0, X by $0;
 ...
 {code}
 This differs from streaming in that it allows the user to insert an arbitrary 
 amount of native processing, whereas streaming allows the insertion of one 
 binary.  It also differs in that, for streaming, data is piped directly into 
 and out of the binary as part of the pig pipeline.  Here the pipeline would 
 be broken, data written to disk, and the native block invoked, then data read 
 back from disk.
 Another alternative is to say this is unnecessary because the user can do the 
 coordination from java, using the PIgServer interface to run pig and calling 
 the map reduce job explicitly.  The advantages of the native keyword are that 
 the user need not be worried about coordination between the jobs, pig will 
 take care of it.  Also the user can make use of existing java applications 
 without being a java programmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1555) [piggybank] add CSV Loader


[ 
https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901556#action_12901556
 ] 

Alan Gates commented on PIG-1555:
-

+1

If you have a chance sometime I'd be curious to learn the performance 
characteristics of this versus PigStorage.  I'm curious if there is substantial 
cost to dealing with escaping.

 [piggybank] add CSV Loader
 --

 Key: PIG-1555
 URL: https://issues.apache.org/jira/browse/PIG-1555
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG_1555.patch


 Users often ask for a CSV loader that can handle quoted commas. Let's get 'er 
 done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6


[ 
https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901559#action_12901559
 ] 

Alan Gates commented on PIG-1508:
-

Alright, I'll get this checked in before we branch for 0.8 then.

 Make 'docs' target (forrest) work with Java 1.6
 ---

 Key: PIG-1508
 URL: https://issues.apache.org/jira/browse/PIG-1508
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Carl Steinbach
Assignee: Carl Steinbach
 Attachments: PIG-1508.patch.txt


 FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with 
 Java 1.6
 The same ticket also suggests a workaround: disabling sitemap and stylesheet 
 validation
 by setting the forrest.validate.sitemap and forrest.validate.stylesheets 
 properties to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-908) Need a way to correlate MR jobs with Pig statements

[
https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Olga Natkovich updated PIG-908:
---

With Pig 0.8.0 we print a summary of the execution that contains (among other
things) how aliases mapped to jobs. Example:

JobId MapsReduces MaxMapTime MinMapTIme AvgMapTime
MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201004271216_12712 1 1 3 3 3 12 12
12 B,C GROUP_BY,COMBINER
job_201004271216_12713 1 1 3 3 3 12 12
12 D SAMPLER
job_201004271216_12714 1 1 3 3 3 12 12
12 D ORDER_BY,COMBINER
hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp743703298/tmp-2019944040,

Need a way to correlate MR jobs with Pig statements
---

Key: PIG-908
URL: https://issues.apache.org/jira/browse/PIG-908
Project: Pig
Issue Type: Wish
Reporter: Dmitriy V. Ryaboy
Assignee: Richard Ding
Fix For: 0.8.0

Complex Pig Scripts often generate many Map-Reduce jobs, especially with the
recent introduction of multi-store capabilities.
For example, the first script in the Pig tutorial produces 5 MR jobs.
There is currently very little support for debugging resulting jobs; if one
of the MR jobs fails, it is hard to figure out which part of the script it
was responsible for. Explain plans help, but even with the explain plan, a
fair amount of effort (and sometimes, experimentation) is required to
correlate the failing MR job with the corresponding PigLatin statements.
This ticket is created to discuss approaches to alleviating this problem.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1488) Make HDFS temp dir configurable


 [ 
https://issues.apache.org/jira/browse/PIG-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1488:


Release Note: Pig stores intermediate data generated between MR jobs in a 
temp location on HDFS. In Pig 0.8.0 this location is configurable by using 
pig.temp.dir property. The default is /tmp which is the same as hardcoded 
location in Pig 0.7.0 and earlier versions

 Make HDFS temp dir configurable
 ---

 Key: PIG-1488
 URL: https://issues.apache.org/jira/browse/PIG-1488
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
 Fix For: 0.8.0


 Currently it is hardcoded to /tmp. It should be made into a property.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1505) support jars and scripts in dfs


 [ 
https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1505:
--

Release Note: Pig now supports running scripts and registering jars that 
are stored in HDFS, Amazon S3, or other distributed file systems.   (was: Pig 
now supports running scripts and registering jars that are stored in HDFS, 
Amazon S3, or other distributed file systems. Also added a -R parameter which 
allows users to specify properties in key=value form on the command line.)

Remove -R option. In 0.8 Pig supports generic parameters such as -Dkey=value. 

 support jars and scripts in dfs
 ---

 Key: PIG-1505
 URL: https://issues.apache.org/jira/browse/PIG-1505
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Andrew Hitchcock
Assignee: Andrew Hitchcock
 Fix For: 0.8.0

 Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, 
 pig-jars-and-scripts-from-dfs-trunk-1.patch, 
 pig-jars-and-scripts-from-dfs-trunk-2.patch, 
 pig-jars-and-scripts-from-dfs-trunk.patch


 Pig can't operate on files stored in Amazon S3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1484) BinStorage should support comma seperated path


 [ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1484:


Release Note: 
In Pig 0.7.0 only a single location is supported as input to BinStorage. (This 
location can be a file, a directory or a glob). With Pig 0.8.0 we are making 
BinSTorage  (similar to PigStorage) support a list of locations.

Example:

a = load '1.bin,2.bin' using BinStorage();



 BinStorage should support comma seperated path
 --

 Key: PIG-1484
 URL: https://issues.apache.org/jira/browse/PIG-1484
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch


 BinStorage does not take comma seperated path. The following script fail:
 a = load '1.bin,2.bin' using BinStorage();
 dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1447) Tune memory usage of InternalCachedBag

2010-08-23 Thread Thejas M Nair (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1447:
---

Status: Patch Available  (was: Open)

Patch for increasing default value to 20%. 
No new test cases as this only changes the memory limit default.
All core tests pass. Result of test-patch -

 [exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no tests are needed for 
this patch.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


 Tune memory usage of InternalCachedBag
 --

 Key: PIG-1447
 URL: https://issues.apache.org/jira/browse/PIG-1447
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch


 We need to find a better value for pig.cachedbag.memusage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1557) couple of issue mapping aliases to jobs

couple of issue mapping aliases to jobs
---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding


I have a simple script:

A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
B = group A by name;
C = foreach B generate group, COUNT(A);
D = order C by $1;
E = limit D 10;
dump E;

I noticed a couple of issues with alias to job mapping: neither load(A) nor 
limit(E) shows in the output


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: August Pig contributor workshop

2010-08-23 Thread Olga Natkovich

Ok, thanks Dmitry we have at least one more person coming with us.

Olga

-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] 
Sent: Monday, August 23, 2010 10:02 AM
To: pig-dev@hadoop.apache.org
Subject: Re: August Pig contributor workshop

Olga,
We do have another couple of spots.

-Dmitriy

On Thu, Aug 19, 2010 at 10:28 AM, Olga Natkovich ol...@yahoo-inc.comwrote:

 Dmitry,

 Do you have any spots left?

 Olga

 -Original Message-
 From: Russell Jurney [mailto:russell.jur...@gmail.com]
 Sent: Thursday, August 19, 2010 5:22 AM
 To: pig-dev@hadoop.apache.org
 Subject: Re: August Pig contributor workshop

 Oh, +2 more - Pete Skomoroch  and Sam Shah will also attend, for a total of
 4 LinkedIners.

 On Wed, Aug 18, 2010 at 9:18 PM, Alan Gates ga...@yahoo-inc.com wrote:

  Confirming Olga and I will be there.

  Alan.

  On Aug 18, 2010, at 4:45 PM, Dmitriy Ryaboy wrote:

   Hi folks,
  Please do RSVP so that we know how many people are coming.

  Thanks,
  -Dmitriy

  On Tue, Aug 17, 2010 at 4:04 PM, Alan Gates ga...@yahoo-inc.com
 wrote:

   All,

  We will be holding the next Pig contributor workshop at Twitter on
  Wednesday, August 25 from 4-6.  The tentative agenda is to discuss:

  Making Piggybank better
  Pig and Azkaban integration
  Plans for features in 0.9
  An update on the Howl project

  Anyone contributing to or interested i

[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag


[ 
https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901576#action_12901576
 ] 

Olga Natkovich commented on PIG-1447:
-

This is probably the smallest patch I have reviewed recently :). +1

 Tune memory usage of InternalCachedBag
 --

 Key: PIG-1447
 URL: https://issues.apache.org/jira/browse/PIG-1447
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch


 We need to find a better value for pig.cachedbag.memusage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods


[ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901577#action_12901577
 ] 

Olga Natkovich commented on PIG-1354:
-

Dmitry, Could you add release notes on how to use this?

 UDFs for dynamic invocation of simple Java methods
 --

 Key: PIG-1354
 URL: https://issues.apache.org/jira/browse/PIG-1354
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch


 The need to create wrapper UDFs for simple Java functions creates unnecessary 
 work for Pig users, slows down the development process, and produces a lot of 
 trivial classes. We can use Java's reflection to allow invoking a number of 
 methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1447) Tune memory usage of InternalCachedBag

2010-08-23 Thread Thejas M Nair (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1447:
---

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to trunk.

 Tune memory usage of InternalCachedBag
 --

 Key: PIG-1447
 URL: https://issues.apache.org/jira/browse/PIG-1447
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch


 We need to find a better value for pig.cachedbag.memusage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods

2010-08-23 Thread Dmitriy V. Ryaboy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901584#action_12901584
 ] 

Dmitriy V. Ryaboy commented on PIG-1354:


Olga,
There is a follow-up ticket here: https://issues.apache.org/jira/browse/PIG-1551
If that gets committed, I have a pretty detailed explanation of how to use the 
stuff in 
http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/
 (happy to put the link in release notes, or just paste the whole post).

 UDFs for dynamic invocation of simple Java methods
 --

 Key: PIG-1354
 URL: https://issues.apache.org/jira/browse/PIG-1354
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch


 The need to create wrapper UDFs for simple Java functions creates unnecessary 
 work for Pig users, slows down the development process, and produces a lot of 
 trivial classes. We can use Java's reflection to allow invoking a number of 
 methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods


[ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901585#action_12901585
 ] 

Olga Natkovich commented on PIG-1354:
-

Sounds good, Dmitry. Richard will review and commit the patch and then please 
paste the release notes.

 UDFs for dynamic invocation of simple Java methods
 --

 Key: PIG-1354
 URL: https://issues.apache.org/jira/browse/PIG-1354
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch


 The need to create wrapper UDFs for simple Java functions creates unnecessary 
 work for Pig users, slows down the development process, and produces a lot of 
 trivial classes. We can use Java's reflection to allow invoking a number of 
 methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6

[
https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901586#action_12901586
]

Alan Gates commented on PIG-1508:
-

I can't figure out a way to test test-patch.sh without checking it in. And, if
this does break something it will make life hard for developers who are trying
to get their patches in before the 0.8 branch is cut. So, I propose that I
hold off checking this in until we have all other pre-0.8 patches checked in.
Then I'll check it in and do extensive testing with test-patch. That way I can
quickly fix any issues I find and not disrupt others. Than we can branch for
0.8. Seem reasonable?

As a side note, we still need Java 1.5 for forrest in the site docs. This
patch only claims to fix it for the docs target, which it does. I'll open a
separate JIRA to fix it on the site side, as it would be really nice to not
force people to have 2 versions of Java to build Pig stuff.

Make 'docs' target (forrest) work with Java 1.6
---

Key: PIG-1508
URL: https://issues.apache.org/jira/browse/PIG-1508
Project: Pig
Issue Type: Bug
Components: documentation
Affects Versions: 0.7.0
Reporter: Carl Steinbach
Assignee: Carl Steinbach
Attachments: PIG-1508.patch.txt

FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with
Java 1.6
The same ticket also suggests a workaround: disabling sitemap and stylesheet
validation
by setting the forrest.validate.sitemap and forrest.validate.stylesheets
properties to false.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1311) Pig interfaces should be clearly classified in terms of scope and stability


[ 
https://issues.apache.org/jira/browse/PIG-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901587#action_12901587
 ] 

Olga Natkovich commented on PIG-1311:
-

+1, please, commit

 Pig interfaces should be clearly classified in terms of scope and stability
 ---

 Key: PIG-1311
 URL: https://issues.apache.org/jira/browse/PIG-1311
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.8.0

 Attachments: PIG-1311.patch


 Clearly marking Pig interfaces (Java interfaces but also things like config 
 files, CLIs, Pig Latin syntax and semantics, etc.) to show scope 
 (public/private) and stability (stable/evolving/unstable) will help users 
 understand how to interact with Pig and developers to understand what things 
 they can and cannot change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1558) build.xml for site directory does not work

build.xml for site directory does not work
--

 Key: PIG-1558
 URL: https://issues.apache.org/jira/browse/PIG-1558
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.8.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0


Going to the site directory and running ant produces:  

{code}
ant 
Buildfile: build.xml

clean:
   [delete] Deleting directory /Users/gates/src/pig/apache/site/author/build

update:

BUILD FAILED
/Users/gates/src/pig/apache/site/build.xml:6: Execute failed: 
java.io.IOException: Cannot run program forrest (in directory 
/Users/gates/src/pig/apache/site/author): error=2, No such file or directory
{code}

Also, forrest here still requires Java 1.5, which can be fixed (see PIG-1508).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1552) Nested describe failed when the alias is not referred in the first foreach inner plan


[ 
https://issues.apache.org/jira/browse/PIG-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901593#action_12901593
 ] 

Daniel Dai commented on PIG-1552:
-

Unit test pass.

test-patch result:

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


 Nested describe failed when the alias is not referred in the first foreach 
 inner plan
 -

 Key: PIG-1552
 URL: https://issues.apache.org/jira/browse/PIG-1552
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1552-1.patch


 The following script fail:
 {code}
 A = load 'studentab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B {
 D = distinct A.age;
 generate group, COUNT(D);
 }
 describe C::D;
 {code}
 If we remove group from generate statement, then it works

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1552) Nested describe failed when the alias is not referred in the first foreach inner plan


 [ 
https://issues.apache.org/jira/browse/PIG-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1552:


  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed.

 Nested describe failed when the alias is not referred in the first foreach 
 inner plan
 -

 Key: PIG-1552
 URL: https://issues.apache.org/jira/browse/PIG-1552
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1552-1.patch


 The following script fail:
 {code}
 A = load 'studentab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B {
 D = distinct A.age;
 generate group, COUNT(D);
 }
 describe C::D;
 {code}
 If we remove group from generate statement, then it works

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901600#action_12901600
]

Richard Ding commented on PIG-1518:
---

+1. The patch looks good.

A few of minor points:

* In PigSplit, the method add(InputSplit split) is not used and can be removed
* In MapRedUtil, it would be better to not leave the debug verification code in
the source code
* In PigRecordReader, the code can be simplified if the initNextRecordReader()
from constructor to initialize() method

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

Attachments: PIG-1518.patch, PIG-1518.patch

We frequently run in the situation where Pig needs to deal with small files
in the input. In this case a separate map is created for each file which
could be very inefficient.
It would be greate to have an umbrella input format that can take multiple
files and use them in a single split. We would like to see this working with
different data formats if possible.
There are already a couple of input formats doing similar thing:
MultifileInputFormat as well as CombinedInputFormat; howevere, neither works
with ne Hadoop 20 API.
We at least want to do a feasibility study for Pig 0.8.0.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1558) build.xml for site directory does not work


 [ 
https://issues.apache.org/jira/browse/PIG-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1558:


Attachment: PIG-1558.patch

Attached patch makes it so that the ant invocation requires the user to specify 
the location of forrest.  Also, the validation phase of forrest is disabled so 
that Java 1.6 can be used.

Removal of the validation phase does not seem to impact creation of the web 
pages.

 build.xml for site directory does not work
 --

 Key: PIG-1558
 URL: https://issues.apache.org/jira/browse/PIG-1558
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.8.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1558.patch


 Going to the site directory and running ant produces:  
 {code}
 ant 
 Buildfile: build.xml
 clean:
[delete] Deleting directory /Users/gates/src/pig/apache/site/author/build
 update:
 BUILD FAILED
 /Users/gates/src/pig/apache/site/build.xml:6: Execute failed: 
 java.io.IOException: Cannot run program forrest (in directory 
 /Users/gates/src/pig/apache/site/author): error=2, No such file or directory
 {code}
 Also, forrest here still requires Java 1.5, which can be fixed (see PIG-1508).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1559) Several things stated in Pig philosophy page are out of date

Several things stated in Pig philosophy page are out of date


 Key: PIG-1559
 URL: https://issues.apache.org/jira/browse/PIG-1559
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0


The Pig philosophy page says several things that are no longer true (such as 
that Pig does not have an optimizer (it does now), that we someday hope to 
support streaming (we already do), that we some day hope to control splits (we 
don't, we just use what Hadoop gives us now)).  These need to be updated to 
reflect the current situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1558) build.xml for site directory does not work


[ 
https://issues.apache.org/jira/browse/PIG-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901612#action_12901612
 ] 

Olga Natkovich commented on PIG-1558:
-

+1

 build.xml for site directory does not work
 --

 Key: PIG-1558
 URL: https://issues.apache.org/jira/browse/PIG-1558
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.8.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1558.patch


 Going to the site directory and running ant produces:  
 {code}
 ant 
 Buildfile: build.xml
 clean:
[delete] Deleting directory /Users/gates/src/pig/apache/site/author/build
 update:
 BUILD FAILED
 /Users/gates/src/pig/apache/site/build.xml:6: Execute failed: 
 java.io.IOException: Cannot run program forrest (in directory 
 /Users/gates/src/pig/apache/site/author): error=2, No such file or directory
 {code}
 Also, forrest here still requires Java 1.5, which can be fixed (see PIG-1508).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6


[ 
https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901617#action_12901617
 ] 

Alan Gates commented on PIG-1508:
-

I'm guessing the contrib failures are just because Hudson isn't working 
properly.  I run contrib tests only with 1.6 all the time and don't see issues.

The site issues I'm talking about are under pig/site (not pig/trunk).  I've 
already posted another patch (see PIG-1558) to deal with it. 

 Make 'docs' target (forrest) work with Java 1.6
 ---

 Key: PIG-1508
 URL: https://issues.apache.org/jira/browse/PIG-1508
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Carl Steinbach
Assignee: Carl Steinbach
 Attachments: PIG-1508.patch.txt


 FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with 
 Java 1.6
 The same ticket also suggests a workaround: disabling sitemap and stylesheet 
 validation
 by setting the forrest.validate.sitemap and forrest.validate.stylesheets 
 properties to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Pig optimizer

2010-08-23 Thread Renato Marroquín Mogrovejo

Hey everyone, I was wondering if anybody has any references or suggestion on
how to learn about Pig's optimizer besides the source code or Pig's paper.
Thanks in advance.


Renato M.

[jira] Updated: (PIG-1510) Add `deepCopy` for LogicalExpressions

[
https://issues.apache.org/jira/browse/PIG-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1510:

Status: Resolved (was: Patch Available)
Hadoop Flags: [Reviewed]
Resolution: Fixed

Patch committed to trunk. Thanks Swati for contributing!

Add `deepCopy` for LogicalExpressions
-

Key: PIG-1510
URL: https://issues.apache.org/jira/browse/PIG-1510
Project: Pig
Issue Type: New Feature
Components: data
Affects Versions: 0.8.0
Reporter: Swati Jain
Assignee: Swati Jain
Fix For: 0.8.0

Attachments: deepCopy.patch, deepCopy.patch

It would be useful to have a way to `deepCopy` an expression. `deepCopy` will
create a new object so that changes made to one object will not reflect in
the copy. There are 2 reasons why we don't override clone.
* It may be better to use `deepCopy` since the copy semantics are explicit
(since deepCopy may be expensive).
* A second important reason for defining `deepCopy` as a separate routine is
that it can be passed a plan as an argument which will be updated as the
expression is copied (through plan.add and plan.connect).
The usage would look like the following:
{noformat}
LogicalExpressionPlan logicalPlan = new LogicalExpressionPlan();
LogicalExpression copyExpression = origExpression.deepCopy( logicalPlan );
{noformat}
An immediate motivation for this would be for constructing the expressions
that constitute the CNF form of an expression.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1559) Several things stated in Pig philosophy page are out of date


 [ 
https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1559:


Attachment: PIG-1559.patch

 Several things stated in Pig philosophy page are out of date
 

 Key: PIG-1559
 URL: https://issues.apache.org/jira/browse/PIG-1559
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1559.patch


 The Pig philosophy page says several things that are no longer true (such as 
 that Pig does not have an optimizer (it does now), that we someday hope to 
 support streaming (we already do), that we some day hope to control splits 
 (we don't, we just use what Hadoop gives us now)).  These need to be updated 
 to reflect the current situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1559) Several things stated in Pig philosophy page are out of date


 [ 
https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1559:


Status: Patch Available  (was: Open)

 Several things stated in Pig philosophy page are out of date
 

 Key: PIG-1559
 URL: https://issues.apache.org/jira/browse/PIG-1559
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1559.patch


 The Pig philosophy page says several things that are no longer true (such as 
 that Pig does not have an optimizer (it does now), that we someday hope to 
 support streaming (we already do), that we some day hope to control splits 
 (we don't, we just use what Hadoop gives us now)).  These need to be updated 
 to reflect the current situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Pig optimizer

2010-08-23 Thread Daniel Dai


Hi, Renato,
There is a description of optimization rule in Pig Latin reference menu: 
http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html#Optimization+Rules. 
Is that enough?


Daniel

Renato Marroquín Mogrovejo wrote:

Hey everyone, I was wondering if anybody has any references or suggestion on
how to learn about Pig's optimizer besides the source code or Pig's paper.
Thanks in advance.


Renato M.

is Hudson awol?

2010-08-23 Thread Dmitriy Ryaboy

Haven't heard anything from Hudson in a while...

-D

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

The add method if PigSplit is removed. The debug code is left to facilitate 
future debugging work. The use of initNextRecordReader is pretty cloned from 
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader and I'll leave it 
as is too.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters


[ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901656#action_12901656
 ] 

Richard Ding commented on PIG-1551:
---


In Invoker.java, there is a typo:

{code}
private static final Class? LONG_ARRAY_CLASS = new String[0].getClass();
{code}

also in unPrimitivize method, this code seems unnecessary:

{code}
} else if (klass.equals(DOUBLE_ARRAY_CLASS)) {
return DOUBLE_ARRAY_CLASS;
{code}

Otherwise the patch looks good.

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: split operator

2010-08-23 Thread Gang Luo

Hi Daniel,
This is a question long ago, but I suddenly come up with some more thoughts on 
this. In a query as simple as this:

A = LOAD 'input';
B = FILTER A BY $1 == 1;
C = COGROUP A BY $0, B BY $0;

the optimizer will insert a split operator to reuse A. According to the source 
code, a map-reduce job will be ended when it sees split and output the result 
to 
A1 and A2 which will be used by two subsequent jobs to process B and C. In this 
case, the first job does nothing meaningful but copy the souce 'input' twice. 
Is 
there some optimization applied here (like the MultiQueryOptimizer you 
mentioned 
previously) ? How?

Since I didn't take a look at the MultiQueryOptimizer, it will be great help if 
you can briefly describe how MultiQueryOptimizer works. Thanks a lot.

-Gang




- 原始邮件 
发件人： Daniel Dai jiany...@yahoo-inc.com
收件人： pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org
发送日期： 2010/7/26 (周一) 4:58:49 下午
主   题： Re: split operator

Hi, Gang,
It is about multiquery optimization. In MRCompiler, we will create a
map-reduce boundary for split, later in MultiQueryOptimizer, we will
merge several split into one map-reduce job. In this map-reduce job, we
will nest several split plans.

Daniel

Gang Luo wrote:
 Hi Daniel,
 in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split 
 operator maintain one-tuple buffer for each branch and talks about how to 
 synchronize multiple branches. I do think that is the in-memory split.

 here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf


 -Gang



 - 原始邮件 
 发件人： Daniel Dai jiany...@yahoo-inc.com
 收件人： pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org
 发送日期： 2010/7/26 (周一) 2:09:25 下午
 主   题： Re: split operator

 Hi, Gang,
 Which part of the paper are you talking about? We don't do in-memory split. 
 We 


 dump the split result to a temporary file and start a new map-reduce job. 
 Split 


 do create a map-reduce boundary (Though it is not entirely true, multiquery 
 optimizer may combine some of these jobs)

 Daniel

 Gang Luo wrote:
  
 Hi all
 according to the vldb 09 paper, the split operator and all its successive 
 operators reside in memory without any blocking in between. However, the 
 source 


 code (version 0.7) shows that a MR job is actually ended when it meets the 
split 

 operator and multiple new MR jobs are created, each representing one branch. 
 This write-once-read-multiple-times method is different from the in-memory 
 method mentioned in that paper. Does pig change the strategy for split, or 
 is 


 there still an in-memory version of split I didn't discover?

 Thanks,
 -Gang

[jira] Created: (PIG-1560) Build target 'checkstyle' fails

Build target 'checkstyle' fails
---

 Key: PIG-1560
 URL: https://issues.apache.org/jira/browse/PIG-1560
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Richard Ding
Assignee: Giridharan Kesavan
 Fix For: 0.8.0



Stack trace:

{code}
/homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: 
org/apache/commons/logging/LogFactory
at 
org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
at org.apache.tools.ant.Main.runBuild(Main.java:801)
at org.apache.tools.ant.Main.startAnt(Main.java:218)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.logging.LogFactory
at 
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
at 
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
at 
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1560) Build target 'checkstyle' fails


 [ 
https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1560:
--

Description: 
Stack trace:

{code}
/trunk/build.xml:894: java.lang.NoClassDefFoundError: 
org/apache/commons/logging/LogFactory
at 
org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
at org.apache.tools.ant.Main.runBuild(Main.java:801)
at org.apache.tools.ant.Main.startAnt(Main.java:218)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.logging.LogFactory
at 
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
at 
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
at 
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more
{code}

  was:

Stack trace:

{code}
/homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: 
org/apache/commons/logging/LogFactory
at 
org.apache.commons.beanutils.ConvertUtilsBean.init(ConvertUtilsBean.java:130)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
at 
com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
at 
com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
at 
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
at org.apache.tools.ant.Main.runBuild(Main.java:801)
at org.apache.tools.ant.Main.startAnt(Main.java:218)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.logging.LogFactory
at 
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
at 
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
at 
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 22 more
{code}


 Build target 'checkstyle' fails
 ---

 Key: PIG-1560
 URL:

Re: split operator

2010-08-23 Thread Daniel Dai

Hi, Gang,
Yes, that's what MultiQueryOptimizer address. After splitting, we split
the script into smaller combinable pieces, and MultiQueryOptimizer will
combine as much splitter and splittees into the same map-reduce job. So
after SplitInserter, you might see more jobs, but you will end up with
fewer jobs. The algorithm for MultiQueryOptimizer is: for every
splitter, find as much combinable splittees, and combine them into the
same mapreduce job. You can find more details at
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

Daniel

Gang Luo wrote:
 Hi Daniel,
 This is a question long ago, but I suddenly come up with some more thoughts 
 on 
 this. In a query as simple as this:

 A = LOAD 'input';
 B = FILTER A BY $1 == 1;
 C = COGROUP A BY $0, B BY $0;

 the optimizer will insert a split operator to reuse A. According to the 
 source 
 code, a map-reduce job will be ended when it sees split and output the result 
 to 
 A1 and A2 which will be used by two subsequent jobs to process B and C. In 
 this 
 case, the first job does nothing meaningful but copy the souce 'input' twice. 
 Is 
 there some optimization applied here (like the MultiQueryOptimizer you 
 mentioned 
 previously) ? How?

 Since I didn't take a look at the MultiQueryOptimizer, it will be great help 
 if 
 you can briefly describe how MultiQueryOptimizer works. Thanks a lot.

 -Gang




 - 原始邮件 
 发件人： Daniel Dai jiany...@yahoo-inc.com
 收件人： pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org
 发送日期： 2010/7/26 (周一) 4:58:49 下午
 主   题： Re: split operator

 Hi, Gang,
 It is about multiquery optimization. In MRCompiler, we will create a
 map-reduce boundary for split, later in MultiQueryOptimizer, we will
 merge several split into one map-reduce job. In this map-reduce job, we
 will nest several split plans.

 Daniel

 Gang Luo wrote:
   
 Hi Daniel,
 in 4.3.1, the example and figure 6 show this. 5.1 last paragraph says split 
 operator maintain one-tuple buffer for each branch and talks about how to 
 synchronize multiple branches. I do think that is the in-memory split.

 here is the paper: http://www.vldb.org/pvldb/2/vldb09-1074.pdf


 -Gang



 - 原始邮件 
 发件人： Daniel Dai jiany...@yahoo-inc.com
 收件人： pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org
 发送日期： 2010/7/26 (周一) 2:09:25 下午
 主   题： Re: split operator

 Hi, Gang,
 Which part of the paper are you talking about? We don't do in-memory split. 
 We 
 


   
 dump the split result to a temporary file and start a new map-reduce job. 
 Split 


 do create a map-reduce boundary (Though it is not entirely true, multiquery 
 optimizer may combine some of these jobs)

 Daniel

 Gang Luo wrote:
  
 
 Hi all
 according to the vldb 09 paper, the split operator and all its successive 
 operators reside in memory without any blocking in between. However, the 
 source 


 code (version 0.7) shows that a MR job is actually ended when it meets the 
 split 

 operator and multiple new MR jobs are created, each representing one 
 branch. 
 This write-once-read-multiple-times method is different from the in-memory 
 method mentioned in that paper. Does pig change the strategy for split, or 
 is 
   


   
 there still an in-memory version of split I didn't discover?

 Thanks,
 -Gang

[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-23 Thread Yan Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

Fix a typo; rebase on the latest trunk.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1515) Migrate logical optimization rule: PushDownForeachFlatten

2010-08-23 Thread Xuefu Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1515:
-

Status: Patch Available  (was: Open)

 Migrate logical optimization rule: PushDownForeachFlatten
 -

 Key: PIG-1515
 URL: https://issues.apache.org/jira/browse/PIG-1515
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1515-1.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs


 [ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1557:
--

Attachment: PIG-1557.patch

The alias for load statement is missing. Add load alias to the alias list.

 couple of issue mapping aliases to jobs
 ---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1557.patch


 I have a simple script:
 A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 B = group A by name;
 C = foreach B generate group, COUNT(A);
 D = order C by $1;
 E = limit D 10;
 dump E;
 I noticed a couple of issues with alias to job mapping: neither load(A) nor 
 limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs