Review Request: PIG-1508: Make 'docs' target (forrest) work with Java 1.6

2010-08-26 Thread Carl Steinbach

---
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/725/
---

Review request for Pig Developers.


Summary
---

Remove Pig's dependency on Java5.


This addresses bug PIG-1508.
http://issues.apache.org/jira/browse/PIG-1508


Diffs
-

  build.xml b0a2ada 
  src/docs/forrest.properties 51f1af7 
  test/bin/test-patch.sh 55c449e 

Diff: http://review.cloudera.org/r/725/diff


Testing
---


Thanks,

Carl



[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6

2010-08-26 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902767#action_12902767
 ] 

HBase Review Board commented on PIG-1508:
-

Message from: Carl Steinbach c...@cloudera.com

---
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/725/
---

Review request for Pig Developers.


Summary
---

Remove Pig's dependency on Java5.


This addresses bug PIG-1508.
http://issues.apache.org/jira/browse/PIG-1508


Diffs
-

  build.xml b0a2ada 
  src/docs/forrest.properties 51f1af7 
  test/bin/test-patch.sh 55c449e 

Diff: http://review.cloudera.org/r/725/diff


Testing
---


Thanks,

Carl




 Make 'docs' target (forrest) work with Java 1.6
 ---

 Key: PIG-1508
 URL: https://issues.apache.org/jira/browse/PIG-1508
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Carl Steinbach
Assignee: Carl Steinbach
 Attachments: PIG-1508.patch.txt


 FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with 
 Java 1.6
 The same ticket also suggests a workaround: disabling sitemap and stylesheet 
 validation
 by setting the forrest.validate.sitemap and forrest.validate.stylesheets 
 properties to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Pig Contributor meeting notes

2010-08-26 Thread Jeff Zhang
Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
And any ppt shared ?



On Wed, Aug 25, 2010 at 8:32 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
 Twitter hosted this month's Pig contributor meeting.
 Developers from Yahoo, Twitter, LinkedIn, RichRelevance, and Cloudera were
 present.

 1. Howl
 First, Alan Gates demoed Howl, a project whose goal is to provide table
 management service for all of hadoop. The vision is that ultimately you will
 be able to read/write data using regular MR, or Pig, or Hive, and read it
 using any of those three, with full support of a partition-aware metadata
 store that will tell you what data is available, what its schema is, etc,
 reusing a single table abstraction.

 Currently, tables are created using (a restricted subset of) Hive ddl
 statements; a howl cli for this will be created, which will enforce the
 restricted subset.
 Writing to the table using Pig or MapReduce is supported. Reading can
 already be done using all three.

 At the moment, a single Pig store statement can only store into a single
 partition; adding ability to spray across partitions is on the roadmap.
 This, and a good api for interacting with the metastore, are the two areas
 that were identified as good opportunities for the wider developer community
 to get involved with the project. The source code is on GitHub, and is at
 the moment synchronized with the development trunk manually; Yahoo folks
 will look into changing this.

 Security is a concern, and Yahoo will be working on it. Making it possible
 for Hive to write to the tables is at the moment not as high a priority as
 the others listed, it would basically involve just writing a Hive SerDe (an
 equivalent of Pig's StoreFunc).

 2. Azkaban presentation
 Russel Jurney and Richard Park from LinkedIn presented the workflow
 management tool open-sourced by LinkedIn, called Azkaban. It allows you to
 declare job dependencies, has a web interface for launching and monitoring
 jobs, etc. It has a special exec mode for Pig that lets you set some
 Pig-specific options on a per-job basis. It does not currently have
 triggering or job-instance parameter substitution (it does have job-level
 parameter substitution).  When asked what would Pig could do to make life
 easier for Azkaban, the two things Richard identified were registering jars
 through the grunt command line and a way to monitor the running job -- both
 of these are already in trunk, so we're in pretty good shaped for 0.8

 3. Piggybank discussion
 Kevin Weil led a discussion of the piggybank. There are a few problems with
 it -- it's released on the Pig schedule, and has quite a few barriers to
 submission that are, anecdotally at least, preventing people from
 contributing. Several options were discussed, with the group finally
 settling on starting a community-curated GitHub project for piggybank. It
 will have a number of committers from different companies, and will aim to
 make it easy for folks to contribute (all contribs will still have to have
 tests, and be Apache 2.0-licensed). More details will be forthcoming as we
 figure them out. Initially this project will be seeded with the current
 Piggybank functions some time after 0.8 is branched. The initial list of
 committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
 (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone.
 Please send us any thoughts you might have on this subject. It was suggested
 that a lot of common code might be shared with Hive UDFs, which have the
 same problems as Piggybank does, and that perhaps the project can be another
 collaboration point between the projects. Not clear how that would work,
 Carl will talk to other Hive people.

 Pig 0.9
 So far the items on the list for 0.9 are: better type propagation /
 resolution story and documentation,  perhaps different parser (ANTLR?), some
 performance tweaks, and map types with fixed-type values. Much still to be
 decided.

 The next contributor meeting will be hosted by LinkedIn in October.

 -Dmitriy




-- 
Best Regards

Jeff Zhang


Added Pig to the list of projects on Cloudera's public ReviewBoard instance

2010-08-26 Thread Carl Steinbach
Hi,

I added Pig to the list of projects that can be reviewed on Cloudera's
public
ReviewBoard instance, located at http://review.cloudera.org (AKA
review.hbase.org).

Review requests and comments are automatically forwarded to the pig-dev
mailing
list, and they also get posted back to the original JIRA ticket.

Please refer to the Review Process section of HBase's HowToContribute page
for
more information on using ReviewBoard:
http://wiki.apache.org/hadoop/Hbase/HowToContribute

Thanks.

Carl


[jira] Created: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-26 Thread Thejas M Nair (JIRA)
java properties not honored in case of properties such as stop.on.failure
-

 Key: PIG-1569
 URL: https://issues.apache.org/jira/browse/PIG-1569
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
 Fix For: 0.8.0


In org.apache.pig.Main , properties are being set to default value without 
checking if the java system properties have been set to something else.
stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
have this problem.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Status: Patch Available  (was: Open)

This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more  storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

An example is the following test.pig script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Pig optimizer

2010-08-26 Thread Renato Marroquín Mogrovejo
Anyone, please?

Renato M.

2010/8/24 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com

 Hi Daniel,

 Thanks, but that was not what I was actually looking. What I want to know
 is for example, how the optimizer work when the bags' logical plans are
 combined, or if all commands are reduced at the end to CO-GROUP commands,
 how is this handled? I know from Pig's paper that the ORDER, and LOAD,
 commands generate new MapReduce jobs, are there any optimizations for the
 physical plans?
 Thanks in advanced.


 Renato M.

 2010/8/23 Daniel Dai jiany...@yahoo-inc.com

 Hi, Renato,
 There is a description of optimization rule in Pig Latin reference menu:
 http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html#Optimization+Rules.
 Is that enough?

 Daniel


 Renato Marroquín Mogrovejo wrote:

 Hey everyone, I was wondering if anybody has any references or suggestion
 on
 how to learn about Pig's optimizer besides the source code or Pig's
 paper.
 Thanks in advance.


 Renato M.







[jira] Created: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs

2010-08-26 Thread Thejas M Nair (JIRA)
native mapreduce operator MR job does not follow same failure handling logic as 
other pig MR jobs
-

 Key: PIG-1570
 URL: https://issues.apache.org/jira/browse/PIG-1570
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
 Fix For: 0.8.0


The code path for handling failure in MR job corresponding to native MR is 
different and does not have the same behavior.
For example, even if the MR job for mapreduce operator fails, the number of 
jobs that failed is being reported as 0 in PigStats log.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1571) add a compile time check to see if the output file of native mapreduce operator exists

2010-08-26 Thread Thejas M Nair (JIRA)
add a compile time check to see if the output file of native mapreduce operator 
exists
--

 Key: PIG-1571
 URL: https://issues.apache.org/jira/browse/PIG-1571
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair


If the output file for native MR operator exist, the query does not fail at 
compile time, it fails only at runtime. This file loaded in the nested load of 
native MR operator, it should be possible to check for this file.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-506) Does pig need a NATIVE keyword?

2010-08-26 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-506:
--

Attachment: PIG-506.2.patch

PIG-506.2.patch has
- Changes to get mapreduce operator working with new logical plan
- Changes to LO/PO Native operators - The store and load for the operator are 
no longer within it, they are part of the plan. As a result, several changes in 
visitors made for handling the load/store within LONative has been reverted.
- Fix for reporting failure when MR job corresponding to native operator fails.
- Removed TestTestNativeMapReduce from exclude list in ant target.

Some issues still to be fixed, which i will address as part of new jiras -
-  PIG-1570  The code path for handling failure in MR job corresponding to 
native MR is different and does not have the same behavior. 
-  PIG-1571 If the output file for native MR exist, the query does not fail at 
compile time, it fails only at runtime. This file loaded in the nested load of 
native MR operator, it should be possible to check for this file. 


 Does pig need a NATIVE keyword?
 ---

 Key: PIG-506
 URL: https://issues.apache.org/jira/browse/PIG-506
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Aniket Mokashi
Priority: Minor
 Fix For: 0.8.0

 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, 
 NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, 
 PIG-506.patch, TestWordCount.jar


 Assume a user had a job that broke easily into three pieces.  Further assume 
 that pieces one and three were easily expressible in pig, but that piece two 
 needed to be written in map reduce for whatever reason (performance, 
 something that pig could not easily express, legacy job that was too 
 important to change, etc.).  Today the user would either have to use map 
 reduce for the entire job or manually handle the stitching together of pig 
 and map reduce jobs.  What if instead pig provided a NATIVE keyword that 
 would allow the script to pass off the data stream to the underlying system 
 (in this case map reduce).  The semantics of NATIVE would vary by underlying 
 system.  In the map reduce case, we would assume that this indicated a 
 collection of one or more fully contained map reduce jobs, so that pig would 
 store the data, invoke the map reduce jobs, and then read the resulting data 
 to continue.  It might look something like this:
 {code}
 A = load 'myfile';
 X = load 'myotherfile';
 B = group A by $0;
 C = foreach B generate group, myudf(B);
 D = native (jar=mymr.jar, infile=frompig outfile=topig);
 E = join D by $0, X by $0;
 ...
 {code}
 This differs from streaming in that it allows the user to insert an arbitrary 
 amount of native processing, whereas streaming allows the insertion of one 
 binary.  It also differs in that, for streaming, data is piped directly into 
 and out of the binary as part of the pig pipeline.  Here the pipeline would 
 be broken, data written to disk, and the native block invoked, then data read 
 back from disk.
 Another alternative is to say this is unnecessary because the user can do the 
 coordination from java, using the PIgServer interface to run pig and calling 
 the map reduce job explicitly.  The advantages of the native keyword are that 
 the user need not be worried about coordination between the jobs, pig will 
 take care of it.  Also the user can make use of existing java applications 
 without being a java programmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs

2010-08-26 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902919#action_12902919
 ] 

Thejas M Nair commented on PIG-1570:


Another thing to investigate (somewhat related) - there seems to be a problem 
when PigServer is used to execute query having native mr operator -  i was 
unable to run the tests in local mode . But i am able to run query in local 
mode from commandline.


 native mapreduce operator MR job does not follow same failure handling logic 
 as other pig MR jobs
 -

 Key: PIG-1570
 URL: https://issues.apache.org/jira/browse/PIG-1570
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
 Fix For: 0.8.0


 The code path for handling failure in MR job corresponding to native MR is 
 different and does not have the same behavior.
 For example, even if the MR job for mapreduce operator fails, the number of 
 jobs that failed is being reported as 0 in PigStats log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Pig Contributor meeting notes

2010-08-26 Thread Russell Jurney
Slides about Azkaban and Pig:
http://www.slideshare.net/rjurney/azkaban-pig-5057793

On Thu, Aug 26, 2010 at 12:55 AM, Jeff Zhang zjf...@gmail.com wrote:

 Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
 And any ppt shared ?



 On Wed, Aug 25, 2010 at 8:32 PM, Dmitriy Ryaboy dvrya...@gmail.com
 wrote:
  Twitter hosted this month's Pig contributor meeting.
  Developers from Yahoo, Twitter, LinkedIn, RichRelevance, and Cloudera
 were
  present.
 
  1. Howl
  First, Alan Gates demoed Howl, a project whose goal is to provide table
  management service for all of hadoop. The vision is that ultimately you
 will
  be able to read/write data using regular MR, or Pig, or Hive, and read it
  using any of those three, with full support of a partition-aware metadata
  store that will tell you what data is available, what its schema is, etc,
  reusing a single table abstraction.
 
  Currently, tables are created using (a restricted subset of) Hive ddl
  statements; a howl cli for this will be created, which will enforce the
  restricted subset.
  Writing to the table using Pig or MapReduce is supported. Reading can
  already be done using all three.
 
  At the moment, a single Pig store statement can only store into a single
  partition; adding ability to spray across partitions is on the roadmap.
  This, and a good api for interacting with the metastore, are the two
 areas
  that were identified as good opportunities for the wider developer
 community
  to get involved with the project. The source code is on GitHub, and is at
  the moment synchronized with the development trunk manually; Yahoo folks
  will look into changing this.
 
  Security is a concern, and Yahoo will be working on it. Making it
 possible
  for Hive to write to the tables is at the moment not as high a priority
 as
  the others listed, it would basically involve just writing a Hive SerDe
 (an
  equivalent of Pig's StoreFunc).
 
  2. Azkaban presentation
  Russel Jurney and Richard Park from LinkedIn presented the workflow
  management tool open-sourced by LinkedIn, called Azkaban. It allows you
 to
  declare job dependencies, has a web interface for launching and
 monitoring
  jobs, etc. It has a special exec mode for Pig that lets you set some
  Pig-specific options on a per-job basis. It does not currently have
  triggering or job-instance parameter substitution (it does have job-level
  parameter substitution).  When asked what would Pig could do to make life
  easier for Azkaban, the two things Richard identified were registering
 jars
  through the grunt command line and a way to monitor the running job --
 both
  of these are already in trunk, so we're in pretty good shaped for 0.8
 
  3. Piggybank discussion
  Kevin Weil led a discussion of the piggybank. There are a few problems
 with
  it -- it's released on the Pig schedule, and has quite a few barriers to
  submission that are, anecdotally at least, preventing people from
  contributing. Several options were discussed, with the group finally
  settling on starting a community-curated GitHub project for piggybank. It
  will have a number of committers from different companies, and will aim
 to
  make it easy for folks to contribute (all contribs will still have to
 have
  tests, and be Apache 2.0-licensed). More details will be forthcoming as
 we
  figure them out. Initially this project will be seeded with the current
  Piggybank functions some time after 0.8 is branched. The initial list of
  committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
  (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate
 someone.
  Please send us any thoughts you might have on this subject. It was
 suggested
  that a lot of common code might be shared with Hive UDFs, which have the
  same problems as Piggybank does, and that perhaps the project can be
 another
  collaboration point between the projects. Not clear how that would work,
  Carl will talk to other Hive people.
 
  Pig 0.9
  So far the items on the list for 0.9 are: better type propagation /
  resolution story and documentation,  perhaps different parser (ANTLR?),
 some
  performance tweaks, and map types with fixed-type values. Much still to
 be
  decided.
 
  The next contributor meeting will be hosted by LinkedIn in October.
 
  -Dmitriy
 



 --
 Best Regards

 Jeff Zhang



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-26 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1501:
---

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to trunk. Thanks Yan!

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-08-26 Thread Thejas M Nair (JIRA)
change default datatype when relations are used as scalar to bytearray
--

 Key: PIG-1572
 URL: https://issues.apache.org/jira/browse/PIG-1572
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
 Fix For: 0.8.0


When relations are cast to scalar, the current default type is chararray. This 
is inconsistent with the behavior in rest of pig-latin.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1518:
--

Attachment: PIG-1518.patch

rebased on the latest trunk

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1568) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly

2010-08-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1568:
-

Status: Open  (was: Patch Available)

 Optimization rule FilterAboveForeach is too restrictive and doesn't handle 
 project * correctly
 --

 Key: PIG-1568
 URL: https://issues.apache.org/jira/browse/PIG-1568
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1568-1.patch, jira-1568-1.patch


 FilterAboveForeach rule is to optimize the plan by pushing up filter above 
 previous foreach operator. However, during code review, two major problems 
 were found:
 1. Current implementation assumes that if no projection is found in the 
 filter condition then all columns from foreach are projected. This issue 
 prevents the following optimization:
   A = LOAD 'file.txt' AS (a(u,v), b, c);
   B = FOREACH A GENERATE $0, b;
   C = FILTER B BY 8  5;
   STORE C INTO 'empty';
 2. Current implementation doesn't handle * probjection, which means project 
 all columns. As a result, it wasn't able to optimize the following:
   A = LOAD 'file.txt' AS (a(u,v), b, c);
   B = FOREACH A GENERATE $0, b;
   C = FILTER B BY Identity.class.getName(*)  5;
   STORE C INTO 'empty';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1568) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly

2010-08-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1568:
-

Attachment: jira-1568-1.patch

 Optimization rule FilterAboveForeach is too restrictive and doesn't handle 
 project * correctly
 --

 Key: PIG-1568
 URL: https://issues.apache.org/jira/browse/PIG-1568
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1568-1.patch, jira-1568-1.patch


 FilterAboveForeach rule is to optimize the plan by pushing up filter above 
 previous foreach operator. However, during code review, two major problems 
 were found:
 1. Current implementation assumes that if no projection is found in the 
 filter condition then all columns from foreach are projected. This issue 
 prevents the following optimization:
   A = LOAD 'file.txt' AS (a(u,v), b, c);
   B = FOREACH A GENERATE $0, b;
   C = FILTER B BY 8  5;
   STORE C INTO 'empty';
 2. Current implementation doesn't handle * probjection, which means project 
 all columns. As a result, it wasn't able to optimize the following:
   A = LOAD 'file.txt' AS (a(u,v), b, c);
   B = FOREACH A GENERATE $0, b;
   C = FILTER B BY Identity.class.getName(*)  5;
   STORE C INTO 'empty';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1564) add support for multiple filesystems

2010-08-26 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902952#action_12902952
 ] 

Richard Ding commented on PIG-1564:
---

Hi Andrew,

HDataStorage is a thin layer on top of Hadoop FileSystem. Since moving its 
local mode to Hadoop local mode, Pig no longer needs this layer.  We intends to 
remove it in the feature.

On Pig reading data from one file system and writing it to another, this 
feature is supported since Pig 0.7.

-Richard 

 add support for multiple filesystems
 

 Key: PIG-1564
 URL: https://issues.apache.org/jira/browse/PIG-1564
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
 Attachments: PIG-1564-1.patch


 Currently you can't run Pig scripts that read data from one file system and 
 write it to another. Also, Grunt doesn't support CDing from one directory to 
 another on different file systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1568) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly

2010-08-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1568:
-

Status: Patch Available  (was: Open)

Regenerate the patch after fixing failed test case. The test case itself was 
changed as it uses an internal bug. When a UDF takes no argument, PIG backend 
passes the whole input to the UDF. This needs to be corrected. In another word, 
if a UDF doesn't specify any argument, we assume that it doesn't need any 
input. If a UDF needs all input, it can either specify a star (*). It can also 
list whatever it requires in the argument list.

A Jira tracking Pig backend changes will be created.


 Optimization rule FilterAboveForeach is too restrictive and doesn't handle 
 project * correctly
 --

 Key: PIG-1568
 URL: https://issues.apache.org/jira/browse/PIG-1568
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.8.0

 Attachments: jira-1568-1.patch, jira-1568-1.patch


 FilterAboveForeach rule is to optimize the plan by pushing up filter above 
 previous foreach operator. However, during code review, two major problems 
 were found:
 1. Current implementation assumes that if no projection is found in the 
 filter condition then all columns from foreach are projected. This issue 
 prevents the following optimization:
   A = LOAD 'file.txt' AS (a(u,v), b, c);
   B = FOREACH A GENERATE $0, b;
   C = FILTER B BY 8  5;
   STORE C INTO 'empty';
 2. Current implementation doesn't handle * probjection, which means project 
 all columns. As a result, it wasn't able to optimize the following:
   A = LOAD 'file.txt' AS (a(u,v), b, c);
   B = FOREACH A GENERATE $0, b;
   C = FILTER B BY Identity.class.getName(*)  5;
   STORE C INTO 'empty';

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1573) PIG shouldn't pass all input to a UDF if the UDF specify no argument

2010-08-26 Thread Xuefu Zhang (JIRA)
PIG shouldn't pass all input to a UDF if the UDF specify no argument


 Key: PIG-1573
 URL: https://issues.apache.org/jira/browse/PIG-1573
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Xuefu Zhang
 Fix For: 0.9.0


Currently If in a pig script user uses a UDF with no argument, PIG backend 
assumes that the UDF takes all input so at run time it passes all input as a 
tuple to the UDF. This assumption is incorrect, causing conceptual confusions. 
If a UDF takes all input, it can specify a star (*) as its argument. If it 
specify no argument at  all, then we assume that it requires no input data. 

We need to differentiate no input and all input for a UDF. Thus, in case that a 
UDF specify no argument, backend should pass the UDF  an empty tuple.

See notes in PIG-1586 for more information.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1573) PIG shouldn't pass all input to a UDF if the UDF specify no argument

2010-08-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang reassigned PIG-1573:


Assignee: Xuefu Zhang

 PIG shouldn't pass all input to a UDF if the UDF specify no argument
 

 Key: PIG-1573
 URL: https://issues.apache.org/jira/browse/PIG-1573
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.9.0


 Currently If in a pig script user uses a UDF with no argument, PIG backend 
 assumes that the UDF takes all input so at run time it passes all input as a 
 tuple to the UDF. This assumption is incorrect, causing conceptual 
 confusions. If a UDF takes all input, it can specify a star (*) as its 
 argument. If it specify no argument at  all, then we assume that it requires 
 no input data. 
 We need to differentiate no input and all input for a UDF. Thus, in case that 
 a UDF specify no argument, backend should pass the UDF  an empty tuple.
 See notes in PIG-1586 for more information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding resolved PIG-1518.
---

Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch is committed to trunk. Thanks Yan.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Pig optimizer

2010-08-26 Thread Daniel Dai

Hi, Renato,
I think you are talking about how we organize different operators into 
map-reduce jobs. Unfortunately there is no document currently. Basically 
we will put as much operators into one map-reduce job as possible. 
Co-group/Group, Join, Order, Distinct, Cross, Stream will create a 
map-reduce boundary; Most others we will put into existing jobs. The 
main logic is inside MRCompiler.java.


Daniel

Renato Marroquín Mogrovejo wrote:

Anyone, please?

Renato M.

2010/8/24 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com

  

Hi Daniel,

Thanks, but that was not what I was actually looking. What I want to know
is for example, how the optimizer work when the bags' logical plans are
combined, or if all commands are reduced at the end to CO-GROUP commands,
how is this handled? I know from Pig's paper that the ORDER, and LOAD,
commands generate new MapReduce jobs, are there any optimizations for the
physical plans?
Thanks in advanced.


Renato M.

2010/8/23 Daniel Dai jiany...@yahoo-inc.com

Hi, Renato,


There is a description of optimization rule in Pig Latin reference menu:
http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html#Optimization+Rules.
Is that enough?

Daniel


Renato Marroquín Mogrovejo wrote:

  

Hey everyone, I was wondering if anybody has any references or suggestion
on
how to learn about Pig's optimizer besides the source code or Pig's
paper.
Thanks in advance.


Renato M.



  




Re: Pig Contributor meeting notes

2010-08-26 Thread Alan Gates


On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote:


Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
And any ppt shared ?


Jeff,

We don't want to exclude our contributors who don't happen to live in  
the San Francisco Bay Area.  If we could include you via Skype or some  
other technology we'd be happy to set it up on our end.  Do you think  
something like that would work for you?


Alan.



Re: Added Pig to the list of projects on Cloudera's public ReviewBoard instance

2010-08-26 Thread Dmitriy Ryaboy
Thanks Carl!

On Thu, Aug 26, 2010 at 1:08 AM, Carl Steinbach c...@cloudera.com wrote:

 Hi,

 I added Pig to the list of projects that can be reviewed on Cloudera's
 public
 ReviewBoard instance, located at http://review.cloudera.org (AKA
 review.hbase.org).

 Review requests and comments are automatically forwarded to the pig-dev
 mailing
 list, and they also get posted back to the original JIRA ticket.

 Please refer to the Review Process section of HBase's HowToContribute
 page
 for
 more information on using ReviewBoard:
 http://wiki.apache.org/hadoop/Hbase/HowToContribute

 Thanks.

 Carl



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-26 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1343:
---

Status: Patch Available  (was: Open)

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-26 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1343:
---

Attachment: pig_1343_2.patch

Implemented the interactive mode logging as well.

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1555) [piggybank] add CSV Loader

2010-08-26 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1555:
---

  Status: Resolved  (was: Patch Available)
Release Note: 
CSVLoader can be used to load comma-separated value files.
It properly handles commas included inside quoted fields, and quotes escaped by 
preceding them with another quote character (Excel-style).
CSVLoader only handle single-line entries; quoting a multi-line value will 
*not* work.
  Resolution: Fixed

 [piggybank] add CSV Loader
 --

 Key: PIG-1555
 URL: https://issues.apache.org/jira/browse/PIG-1555
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG_1555.patch


 Users often ask for a CSV loader that can handle quoted commas. Let's get 'er 
 done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903031#action_12903031
 ] 

Dmitriy V. Ryaboy commented on PIG-1518:


This is a great feature, thanks Yan.

Could you comment on what the final solution was as far as PigStorage and 
OrderedLoadFunc? I see two ideas (yours and Ashutosh's) in the discussion, but 
not what the ultimate direction you took was.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1569) java properties not honored in case of properties such as stop.on.failure

2010-08-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1569:
-

Assignee: Richard Ding

 java properties not honored in case of properties such as stop.on.failure
 -

 Key: PIG-1569
 URL: https://issues.apache.org/jira/browse/PIG-1569
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Richard Ding
 Fix For: 0.8.0


 In org.apache.pig.Main , properties are being set to default value without 
 checking if the java system properties have been set to something else.
 stop.on.failure, opt.multiquery, aggregate.warning are some properties that 
 have this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-26 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903043#action_12903043
 ] 

Alan Gates commented on PIG-1205:
-

Comments

# As discussed previously, LoadStoreCaster should be changed so that there is a 
StoreCaster interface that has the toByte methods, and LoadStoreCaster is a 
convenience interface that extends LoadCaster and StoreCaster.
# It looks like with HBASE-1933 Hbase is now available via Maven.  Can we pull 
it from Maven rather than check in the jar to our lib directory?

Since I know little about Hbase I focussed my review on the Pig side.


 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, 
 PIG_1205_8.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-26 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903072#action_12903072
 ] 

Richard Ding commented on PIG-1343:
---


The new patch logs NPE instead of the intended message:

{code}
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal 
error. null
{code}

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-26 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1343:
---

Attachment: pig_1343_3.patch

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, 
 pig_1343_3.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-26 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1343:
---

Status: Open  (was: Patch Available)

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, 
 pig_1343_3.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1458) aggregate files for replicated join

2010-08-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1458:
--

Attachment: PIG-1458.patch

This patch uses the new multi-file-combiner (PIG-1518) to concatenate many 
small files for replicated join. This is based on the assumption that the total 
size of the replicated files should be small enough to fit into main memory. 

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903102#action_12903102
 ] 

Yan Zhou commented on PIG-1518:
---

It is not combinable if the loader is a CollectableLoadFunc AND a 
OrderedLoadFunc. Since PigStorage is a CollectableLoadFunc  but not a 
OrderedLoadFunc, it is combinable.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1565) additional piggybank datetime and string UDFs

2010-08-26 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903123#action_12903123
 ] 

Alan Gates commented on PIG-1565:
-

Comments
# ErrorCatchingBase swallows any non-ExecExceptions.  It should print their 
messages out as warnings.  Warnings are collated and the count reported at the 
end of the job.  Details are only printed if the user asks for them.  That way 
the user will still be informed that something unexpected happened and can 
investigate further if he wants to.
# On the duplication, it looks to me like INDEX_OF and LAST_INDEX_OF are 
supersets of the functions already in Pig.  You could submit a patch for those 
two functions (which are now builtins) to extend them to take the optional 
third argument.  SPLIT_ON_REGEX looks like a subset of the existing SPLIT 
function that is built into Pig, so other than having it as an alias so that 
Amazon users who are used to calling SPLIT_ON_REGEX I'm not clear what the 
value is.

Thanks for contributing all these, this is great.

I'll run test-patch and the unit tests and post the results.


 additional piggybank datetime and string UDFs
 -

 Key: PIG-1565
 URL: https://issues.apache.org/jira/browse/PIG-1565
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
 Attachments: PIG-1565-1.patch


 Pig is missing a variety of UDFs that might be helpful for users implementing 
 Pig scripts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1565) additional piggybank datetime and string UDFs

2010-08-26 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1565:
---

Assignee: Andrew Hitchcock

 additional piggybank datetime and string UDFs
 -

 Key: PIG-1565
 URL: https://issues.apache.org/jira/browse/PIG-1565
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
Assignee: Andrew Hitchcock
 Fix For: 0.8.0

 Attachments: PIG-1565-1.patch


 Pig is missing a variety of UDFs that might be helpful for users implementing 
 Pig scripts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-26 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1343:
---

Attachment: pig_1343_4.patch

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, 
 pig_1343_4.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1564) add support for multiple filesystems

2010-08-26 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903128#action_12903128
 ] 

Alan Gates commented on PIG-1564:
-

We do intend to remove it, though at the moment there is no other way to access 
HDFS for UDFs.  So before we can officially deprecate it we need to come up 
with a replacement.

Andrew, as Richard points out, as of Pig 0.7 load and store functions no longer 
use HDataStorage.  Do you still see this patch as being useful just for UDFs?  
Or are load and store functions the only use cases for it?

 add support for multiple filesystems
 

 Key: PIG-1564
 URL: https://issues.apache.org/jira/browse/PIG-1564
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
 Attachments: PIG-1564-1.patch


 Currently you can't run Pig scripts that read data from one file system and 
 write it to another. Also, Grunt doesn't support CDing from one directory to 
 another on different file systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-506) Does pig need a NATIVE keyword?

2010-08-26 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-506:
--

Attachment: PIG-506.3.patch

Updated patch, earlier patch was missing 
src/org/apache/pig/newplan/logical/relational/LONative.java.

test-patch and core tests are successful. 

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 8 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


 Does pig need a NATIVE keyword?
 ---

 Key: PIG-506
 URL: https://issues.apache.org/jira/browse/PIG-506
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Aniket Mokashi
Priority: Minor
 Fix For: 0.8.0

 Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, 
 NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, 
 PIG-506.3.patch, PIG-506.patch, TestWordCount.jar


 Assume a user had a job that broke easily into three pieces.  Further assume 
 that pieces one and three were easily expressible in pig, but that piece two 
 needed to be written in map reduce for whatever reason (performance, 
 something that pig could not easily express, legacy job that was too 
 important to change, etc.).  Today the user would either have to use map 
 reduce for the entire job or manually handle the stitching together of pig 
 and map reduce jobs.  What if instead pig provided a NATIVE keyword that 
 would allow the script to pass off the data stream to the underlying system 
 (in this case map reduce).  The semantics of NATIVE would vary by underlying 
 system.  In the map reduce case, we would assume that this indicated a 
 collection of one or more fully contained map reduce jobs, so that pig would 
 store the data, invoke the map reduce jobs, and then read the resulting data 
 to continue.  It might look something like this:
 {code}
 A = load 'myfile';
 X = load 'myotherfile';
 B = group A by $0;
 C = foreach B generate group, myudf(B);
 D = native (jar=mymr.jar, infile=frompig outfile=topig);
 E = join D by $0, X by $0;
 ...
 {code}
 This differs from streaming in that it allows the user to insert an arbitrary 
 amount of native processing, whereas streaming allows the insertion of one 
 binary.  It also differs in that, for streaming, data is piped directly into 
 and out of the binary as part of the pig pipeline.  Here the pipeline would 
 be broken, data written to disk, and the native block invoked, then data read 
 back from disk.
 Another alternative is to say this is unnecessary because the user can do the 
 coordination from java, using the PIgServer interface to run pig and calling 
 the map reduce job explicitly.  The advantages of the native keyword are that 
 the user need not be worried about coordination between the jobs, pig will 
 take care of it.  Also the user can make use of existing java applications 
 without being a java programmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1565) additional piggybank datetime and string UDFs

2010-08-26 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903174#action_12903174
 ] 

Alan Gates commented on PIG-1565:
-

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 5 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec]
 [exec]

 additional piggybank datetime and string UDFs
 -

 Key: PIG-1565
 URL: https://issues.apache.org/jira/browse/PIG-1565
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
Assignee: Andrew Hitchcock
 Fix For: 0.8.0

 Attachments: PIG-1565-1.patch


 Pig is missing a variety of UDFs that might be helpful for users implementing 
 Pig scripts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Pig Contributor meeting notes

2010-08-26 Thread Jeff Zhang
Alan,

That's great, next time I will try to join the contributor meeting.

On Thu, Aug 26, 2010 at 11:35 AM, Alan Gates ga...@yahoo-inc.com wrote:

 On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote:

 Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
 And any ppt shared ?

 Jeff,

 We don't want to exclude our contributors who don't happen to live in the
 San Francisco Bay Area.  If we could include you via Skype or some other
 technology we'd be happy to set it up on our end.  Do you think something
 like that would work for you?

 Alan.





-- 
Best Regards

Jeff Zhang


Re: Pig Contributor meeting notes

2010-08-26 Thread Jeff Zhang
BTW, actually Dmitriy has invited me to join this meeting through
skype, but it's pity that I have no time to join it this time.


On Thu, Aug 26, 2010 at 6:15 PM, Jeff Zhang zjf...@gmail.com wrote:
 Alan,

 That's great, next time I will try to join the contributor meeting.

 On Thu, Aug 26, 2010 at 11:35 AM, Alan Gates ga...@yahoo-inc.com wrote:

 On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote:

 Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
 And any ppt shared ?

 Jeff,

 We don't want to exclude our contributors who don't happen to live in the
 San Francisco Bay Area.  If we could include you via Skype or some other
 technology we'd be happy to set it up on our end.  Do you think something
 like that would work for you?

 Alan.





 --
 Best Regards

 Jeff Zhang




-- 
Best Regards

Jeff Zhang


[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-08-26 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1343:
---

Attachment: PIG_1343_5.patch

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, 
 pig_1343_4.patch, PIG_1343_5.patch


 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven

2010-08-26 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1562:
---

Attachment: PIG_1562_0.patch

This patch has fix for the version issue of the required packages.

 Fix the version for the dependent packages for the maven 
 -

 Key: PIG-1562
 URL: https://issues.apache.org/jira/browse/PIG-1562
 Project: Pig
  Issue Type: Bug
Reporter: niraj rai
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: PIG_1562_0.patch


 We need to fix the set version so that, version is properly set for the 
 dependent packages in the maven repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.