[jira] Reopened: (PIG-1182) Pig reference manual does not mention syntax for comments

2010-02-19 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz reopened PIG-1182:
---


Corinne, not sure what you are so resistant to following the basic principles 
of documenting ALL syntax, including comments, in the reference manual. If the 
document is open to the community to edit, I'm more than willing to do the work 
myself since I have contibuted as a technical writer for programming language 
reference manuals in my past as well as having been a developer of compilers 
and software development tools.

Also, I think the passage you sited could use a little work on the English: 

Using Comments in Scripts
If you place Pig Latin statements in a script, the script can include comments.

For multi-line comments use /*  */
For single line comments use --
/* myscript.pig
My script includes three simple Pig Latin Statements.
*/

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); 
-- load statement
B = FOREACH A GENERATE name;  -- foreach statement
DUMP B;  --dump statement
Case Sensitivity


 Pig reference manual does not mention syntax for comments
 -

 Key: PIG-1182
 URL: https://issues.apache.org/jira/browse/PIG-1182
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.5.0
Reporter: David Ciemiewicz
Assignee: Corinne Chandel
 Fix For: 0.7.0


 The Pig 0.5.0 reference manual does not mention how to write comments in your 
 pig code using -- (two dashes).
 http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html
 Also, does /* */ also work?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1216) New load store design does not allow Pig to validate inputs and outputs up front

2010-02-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reopened PIG-1216:
---


Reopening as the assumption made for the patch doesn't hold.

 New load store design does not allow Pig to validate inputs and outputs up 
 front
 

 Key: PIG-1216
 URL: https://issues.apache.org/jira/browse/PIG-1216
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1216.patch, pig-1216_1.patch


 In Pig 0.6 and before, Pig attempts to verify existence of inputs and 
 non-existence of outputs during parsing to avoid run time failures when 
 inputs don't exist or outputs can't be overwritten.  The downside to this was 
 that Pig assumed all inputs and outputs were HDFS files, which made 
 implementation harder for non-HDFS based load and store functions.  In the 
 load store redesign (PIG-966) this was delegated to InputFormats and 
 OutputFormats to avoid this problem and to make use of the checks already 
 being done in those implementations.  Unfortunately, for Pig Latin scripts 
 that run more then one MR job, this does not work well.  MR does not do 
 input/output verification on all the jobs at once.  It does them one at a 
 time.  So if a Pig Latin script results in 10 MR jobs and the file to store 
 to at the end already exists, the first 9 jobs will be run before the 10th 
 job discovers that the whole thing was doomed from the beginning.  
 To avoid this a validate call needs to be added to the new LoadFunc and 
 StoreFunc interfaces.  Pig needs to pass this method enough information that 
 the load function implementer can delegate to InputFormat.getSplits() and the 
 store function implementer to OutputFormat.checkOutputSpecs() if s/he decides 
 to.  Since 90% of all load and store functions use HDFS and PigStorage will 
 also need to, the Pig team should implement a default file existence check on 
 HDFS and make it available as a static method to other Load/Store function 
 implementers.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



COMPLETED merge of load-store-redesign branch to trunk

2010-02-19 Thread Pradeep Kamath
The merge from load-store-redesign branch to trunk is now completed. New
commits can now proceed on trunk. The load-store-redesign branch is
deprecated with this merge and no more commits should be done on that
branch.

 

Pradeep

 



From: Pradeep Kamath 
Sent: Thursday, February 18, 2010 11:20 AM
To: Pradeep Kamath; 'pig-dev@hadoop.apache.org';
'pig-u...@hadoop.apache.org'
Subject: BEGINNING merge of load-store-redesign branch to trunk - hold
off commits!

 

Hi,

  I will begin this activity now - a request to all committers to not
commit to trunk or load-store-redesign till I send an all clear message
- I am anticipating this will hopefully be completed by end of day
(Pacific time) tomorrow.

 

Thanks,

Pradeep

 



From: Pradeep Kamath 
Sent: Tuesday, February 16, 2010 11:34 AM
To: 'pig-dev@hadoop.apache.org'; 'pig-u...@hadoop.apache.org'
Subject: Plan to merge load-store-redesign branch to trunk

 

Hi,

   We would like to merge the load-store-redesign branch to trunk
tentatively on Thursday. To do this, I would like to request all
committers to not commit anything to load-store-redesign branch or trunk
during the period of the merge. I will send out a mail to indicate begin
and end of this activity - tentatively I am expecting this to be a day's
period between 9 AM PST Thursday to 9AM PST Friday so I can resolve any
conflicts and run all tests.

 

Pradeep

 



[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

2010-02-19 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835944#action_12835944
 ] 

Richard Ding commented on PIG-1188:
---

To summarize where we are:

Right now Pig project operator pads null if the value to be projected doesn't 
exist. As a consequence, the desired result is achieved if  PigStorage is used 
and a schema with data types is specified, since in this case Pig inserts a 
project+cast operator for each field in the schema.

In the case where no schema is specified in the load statement, Pig is doing a 
good job adhering to the Pig's philosophy and  let the program run without 
throwing runtime exception.

Now leave the case where a schema is specified without data types. There are 
several options:

   * Pig automatically insert a project operator for each field in the schema 
to ensure the input data matches the schema. The trade-off for this is the 
performance penalty. Is it worthwhile if most user data is well-behaved?

   * Users can explicitly add a foreach statement after the load statement 
which projects all the fields in the schema. This is similar to the practice by 
the users to run a map job first to cleanup the data.  

   * Pig can also delegate the padding work to the loaders. The problem is that 
now  the schema isn't passed to the loaders. 





 Padding nulls to the input tuple according to input schema
 --

 Key: PIG-1188
 URL: https://issues.apache.org/jira/browse/PIG-1188
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Richard Ding
 Fix For: 0.7.0


 Currently, the number of fields in the input tuple is determined by the data. 
 When we have schema, we should generate input data according to the schema, 
 and padding nulls if necessary. Here is one example:
 Pig script:
 {code}
 a = load '1.txt' as (a0, a1);
 dump a;
 {code}
 Input file:
 {code}
 1   2
 1   2   3
 1
 {code}
 Current result:
 {code}
 (1,2)
 (1,2,3)
 (1)
 {code}
 Desired result:
 {code}
 (1,2)
 (1,2)
 (1, null)
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: COMPLETED merge of load-store-redesign branch to trunk

2010-02-19 Thread Gerrit van Vuuren
Great stuff guys,

I've been keen on refactoring the pig HiveRCLoader reader and writer to use the 
new load-store redesign.

 

- Original Message -
From: Pradeep Kamath prade...@yahoo-inc.com
To: pig-dev@hadoop.apache.org pig-dev@hadoop.apache.org; 
pig-u...@hadoop.apache.org pig-u...@hadoop.apache.org
Sent: Fri Feb 19 20:05:54 2010
Subject: COMPLETED merge of load-store-redesign branch to trunk 

The merge from load-store-redesign branch to trunk is now completed. New
commits can now proceed on trunk. The load-store-redesign branch is
deprecated with this merge and no more commits should be done on that
branch.

 

Pradeep

 



From: Pradeep Kamath 
Sent: Thursday, February 18, 2010 11:20 AM
To: Pradeep Kamath; 'pig-dev@hadoop.apache.org';
'pig-u...@hadoop.apache.org'
Subject: BEGINNING merge of load-store-redesign branch to trunk - hold
off commits!

 

Hi,

  I will begin this activity now - a request to all committers to not
commit to trunk or load-store-redesign till I send an all clear message
- I am anticipating this will hopefully be completed by end of day
(Pacific time) tomorrow.

 

Thanks,

Pradeep

 



From: Pradeep Kamath 
Sent: Tuesday, February 16, 2010 11:34 AM
To: 'pig-dev@hadoop.apache.org'; 'pig-u...@hadoop.apache.org'
Subject: Plan to merge load-store-redesign branch to trunk

 

Hi,

   We would like to merge the load-store-redesign branch to trunk
tentatively on Thursday. To do this, I would like to request all
committers to not commit anything to load-store-redesign branch or trunk
during the period of the merge. I will send out a mail to indicate begin
and end of this activity - tentatively I am expecting this to be a day's
period between 9 AM PST Thursday to 9AM PST Friday so I can resolve any
conflicts and run all tests.

 

Pradeep

 



[jira] Updated: (PIG-1215) Make Hadoop jobId more prominent in the client log

2010-02-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1215:
--

Attachment: pig-1215_4.patch

Change as suggested by Olga. Other parts of patch are as before.

 Make Hadoop jobId more prominent in the client log
 --

 Key: PIG-1215
 URL: https://issues.apache.org/jira/browse/PIG-1215
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1215.patch, pig-1215.patch, pig-1215_1.patch, 
 pig-1215_3.patch, pig-1215_4.patch


 This is a request from applications that want to be able to programmatically 
 parse client logs to find hadoop Ids.
 The woould like to see each job id on a separate line in the following format:
 hadoopJobId: job_123456789
 They would also like to see the jobs in the order they are executed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1245) Remove the connection to nameone in HExecutionEngine.init()

2010-02-19 Thread Pradeep Kamath (JIRA)
Remove the connection to nameone in HExecutionEngine.init() 


 Key: PIG-1245
 URL: https://issues.apache.org/jira/browse/PIG-1245
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Pradeep Kamath
 Fix For: 0.7.0


PigContext.connect() calls HExecutionEngine.init(). The former is called from 
the backend map/reduce tasks in DefaultIndexableLoader used in merge join. It 
is not clear that a connection to the namenode is required in 
HExecutionEngine.init().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1188) Padding nulls to the input tuple according to input schema

2010-02-19 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1188:


Fix Version/s: (was: 0.7.0)

Looks like most common cases are already working. Unlinking from 0.7.0 release.

 Padding nulls to the input tuple according to input schema
 --

 Key: PIG-1188
 URL: https://issues.apache.org/jira/browse/PIG-1188
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Richard Ding

 Currently, the number of fields in the input tuple is determined by the data. 
 When we have schema, we should generate input data according to the schema, 
 and padding nulls if necessary. Here is one example:
 Pig script:
 {code}
 a = load '1.txt' as (a0, a1);
 dump a;
 {code}
 Input file:
 {code}
 1   2
 1   2   3
 1
 {code}
 Current result:
 {code}
 (1,2)
 (1,2,3)
 (1)
 {code}
 Desired result:
 {code}
 (1,2)
 (1,2)
 (1, null)
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1215) Make Hadoop jobId more prominent in the client log

2010-02-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1215:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked-in.

 Make Hadoop jobId more prominent in the client log
 --

 Key: PIG-1215
 URL: https://issues.apache.org/jira/browse/PIG-1215
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0

 Attachments: pig-1215.patch, pig-1215.patch, pig-1215_1.patch, 
 pig-1215_3.patch, pig-1215_4.patch


 This is a request from applications that want to be able to programmatically 
 parse client logs to find hadoop Ids.
 The woould like to see each job id on a separate line in the following format:
 hadoopJobId: job_123456789
 They would also like to see the jobs in the order they are executed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2010-02-19 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836035#action_12836035
 ] 

Pradeep Kamath commented on PIG-966:


LoadFunc is now an abstract class with default implementations for some of the 
methods - we hope this will aid implementers.  I would like to make the same 
change for StoreFunc. Since PigStorage currently does both load and store, we 
would need to also introduce an interface - StoreFuncInterface so that 
PigStorage can extend LoadFunc and implement StoreFuncInterface. To be 
symmetrical, we would need to also introduce a LoadFuncInterface. This 
interface can be used by implementers if they want their loadFunc 
implementation to extend some other class. We can document and recommend 
strongly to users to only use our abstract classes since that would be make 
them less vulnerable to incompatibile additions in the future (hopefully when 
we add new methods into these abstract classes we will give a default 
implementation).

I will upload a patch for this unless anyone has strong objections.

 Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
 ---

 Key: PIG-966
 URL: https://issues.apache.org/jira/browse/PIG-966
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
 significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
 full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1218) Use distributed cache to store samples

2010-02-19 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1218:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Committed patch PIG-1218_2.patch since the merge join changes need to be 
re-worked and will be handled in a different patch.

Thanks Richard!

 Use distributed cache to store samples
 --

 Key: PIG-1218
 URL: https://issues.apache.org/jira/browse/PIG-1218
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1218.patch, PIG-1218_2.patch, PIG-1218_3.patch


 Currently, in the case of skew join and order by we use sample that is just 
 written to the dfs (not distributed cache) and, as the result, get opened and 
 copied around more than necessary. This impacts query performance and also 
 places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1182) Pig reference manual does not mention syntax for comments

2010-02-19 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836055#action_12836055
 ] 

Olga Natkovich commented on PIG-1182:
-

Ciemo,

There is a reason why Corinne created to sections of the document. A single 
document was just too large so it was hard to manage changes and even to load 
it takes some time.

If I understand correctly, the real issue that you are pointing out is that it 
is hard to find specific information that you are looking for quickly. 
Traditionally indices are used for this purpose and pig documentation does not 
have one. 

Short term, Corinne does not have time to work on it due to other commitment. 
If you or other users would like to help with that, that would certainly be 
appreciated. 

 Pig reference manual does not mention syntax for comments
 -

 Key: PIG-1182
 URL: https://issues.apache.org/jira/browse/PIG-1182
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.5.0
Reporter: David Ciemiewicz
Assignee: Corinne Chandel
 Fix For: 0.7.0


 The Pig 0.5.0 reference manual does not mention how to write comments in your 
 pig code using -- (two dashes).
 http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html
 Also, does /* */ also work?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1246) SequenceFileLoader problem with compressed values

2010-02-19 Thread Derek Brown (JIRA)
SequenceFileLoader problem with compressed values
-

 Key: PIG-1246
 URL: https://issues.apache.org/jira/browse/PIG-1246
 Project: Pig
  Issue Type: Bug
Reporter: Derek Brown


I sent the following to the pig-users list, and Dmitriy said to open a ticket.

http://mail-archives.apache.org/mod_mbox/hadoop-pig-user/201002.mbox/%3c357a70951002191451n6136a3en8475652fc0bd3...@mail.gmail.com%3e

 I'm having a problem getting the SequenceFileLoader, from the Piggybank, to
 read sequence files whose values are block comressed (gzip'd). I'm using
 Pig
 0.4.99.0+10, and Hadoop hadoop-0.20.1+152, via Cloudera.

 Did the following:

 * Copied the SequenceFileLoader class into my own project

 * Removed

 public LoadFunc.RequiredFieldResponse
 fieldsToRead(LoadFunc.RequiredFieldList requiredFieldList)

 because LoadFunc.RequiredFieldList isn't resolvable, and added

 public void fieldsToRead(Schema schema)

 * Jarred up the .class file

 * Programmatically created a trivial sequence file of a few lines, with
 IntWritable keys and Text values, using the basic code in an example in
 Hadoop The Definitive Guide

 * That file is successfully read and keys/values displayed, with hadoop fs
 -text, as well as with pig, doing the following:

 grunt register sequencefileloader.jar;
 grunt r = load '/path/to/sequence_file' using
 com.foobar.SequenceFileLoader();
 grunt dump r;

 * The sequence file with the compressed values is successfully read with
 hadoop fs -text

 * When doing the load step in pig with that file, the following results:

 --
 2010-02-19 16:59:14,489 [main] WARN
  org.apache.hadoop.util.NativeCodeLoader
 - Unable to load native-hadoop library for your platform..
 . using builtin-java classes where applicable
 2010-02-19 16:59:14,490 [main] INFO
  org.apache.hadoop.io.compress.CodecPool
 - Got brand-new decompressor
 2010-02-19 16:59:14,498 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1018: Problem determining schema during load
 Details at logfile: /path/to/pig_1266616744562.log
 --

 That log file contains the following:

 --
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error
 during
 parsing. Problem determining schema during load
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1037)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:981)
at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
at
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:717)
at

 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:273)
at

 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
at

 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:363)
 Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem
 determining schema during load
at

 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734)
at

 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1031)
... 8 more
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018:
 Problem determining schema during load
at
 org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155)
at

 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732)
... 10 more
 Caused by: java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
at
 java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
at
 java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
at java.util.zip.GZIPInputStream.init(GZIPInputStream.java:58)
at java.util.zip.GZIPInputStream.init(GZIPInputStream.java:68)
at

 org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.init(GzipCodec.java:92)
at

 org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.init(GzipCodec.java:101)
at

 org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
at

 org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
at
 com.media6.SequenceFileLoader.inferReader(SequenceFileLoader.java:140)
at
 

[jira] Commented: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2010-02-19 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836080#action_12836080
 ] 

Pradeep Kamath commented on PIG-966:


In retrospect, I think we can skip on creating a LoadFuncInterface since 
currently there is no real use case for an interface - we are adding it to keep 
symmetry with StoreFuncINterface and to allow implementations which extends 
other classes to implement this interface. The first motivation is not very 
strong and second also can be achieved through composition rather than 
inheritance - it is unclear how inheriting a different class would benefit a 
Loader implementation over composition to delegation functionality. By 
introducing a LoadFuncInterface we would be exposing users who implement it to 
backward incompatible additions in the future. So I think we should not add a 
LoadFuncInterface now and ONLY if a real need arises add it. The rest of my 
proposal (making StoreFunc an abstract class and add a new StoreFuncInterface) 
still holds.

 Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
 ---

 Key: PIG-966
 URL: https://issues.apache.org/jira/browse/PIG-966
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
 significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
 full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1233) NullPointerException in AVG

2010-02-19 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1233:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.7.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1233) NullPointerException in AVG

2010-02-19 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836086#action_12836086
 ] 

Olga Natkovich commented on PIG-1233:
-

patch committed to the trunk. Thanks, Ankur!

 NullPointerException in AVG 
 

 Key: PIG-1233
 URL: https://issues.apache.org/jira/browse/PIG-1233
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur
Assignee: Ankur
 Fix For: 0.7.0

 Attachments: jira-1233.patch


 The overridden method - getValue() in AVG throws null pointer exception in 
 case accumulate() is not called leaving variable 'intermediateCount'  
 initialized to null. This causes java to throw exception when it tries to 
 'unbox' the value for numeric comparison.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-961) Integration with Hadoop 21

2010-02-19 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-961.


Resolution: Fixed

We have already integrated with Hadoop 20 API.

 Integration with Hadoop 21
 --

 Key: PIG-961
 URL: https://issues.apache.org/jira/browse/PIG-961
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Ying He
 Attachments: hadoop21.jar, PIG-961.patch, PIG-961.patch2


 Hadoop 21 is not yet released but we know that switch to new MR API is coming 
 there. This JIRA is for early integration with this API

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1241) Accumulator is turned on when a map is used with a non-accumulative UDF

2010-02-19 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1241:


Affects Version/s: 0.6.0
Fix Version/s: 0.7.0
 Assignee: Ying He

 Accumulator is turned on when a map is used with a non-accumulative UDF
 ---

 Key: PIG-1241
 URL: https://issues.apache.org/jira/browse/PIG-1241
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ying He
Assignee: Ying He
 Fix For: 0.7.0

 Attachments: accum.patch


 Exception is thrown for a script like the following:
 register /homes/yinghe/owl/string.jar;
 a = load 'a.txt' as (id, url);
 b = group  a by (id, url);
 c = foreach b generate  COUNT(a), (CHARARRAY) 
 string.URLPARSE(group.url)#'url';
 dump c;
 In this query, URLPARSE() is not accumulative, and it returns a map. 
 The accumulator optimizer failed to check UDF in this case, and tries to run 
 the job in accumulative mode. ClassCastException is thrown when trying to 
 cast UDF into Accumulator interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1241) Accumulator is turned on when a map is used with a non-accumulative UDF

2010-02-19 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1241:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed to the trunk. Thanks, Ying

 Accumulator is turned on when a map is used with a non-accumulative UDF
 ---

 Key: PIG-1241
 URL: https://issues.apache.org/jira/browse/PIG-1241
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ying He
Assignee: Ying He
 Fix For: 0.7.0

 Attachments: accum.patch


 Exception is thrown for a script like the following:
 register /homes/yinghe/owl/string.jar;
 a = load 'a.txt' as (id, url);
 b = group  a by (id, url);
 c = foreach b generate  COUNT(a), (CHARARRAY) 
 string.URLPARSE(group.url)#'url';
 dump c;
 In this query, URLPARSE() is not accumulative, and it returns a map. 
 The accumulator optimizer failed to check UDF in this case, and tries to run 
 the job in accumulative mode. ClassCastException is thrown when trying to 
 cast UDF into Accumulator interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1247) Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error

2010-02-19 Thread Viraj Bhat (JIRA)
Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error
-

 Key: PIG-1247
 URL: https://issues.apache.org/jira/browse/PIG-1247
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


I have a large script in which there are intermediate stores statements, one of 
them writes to a directory I do not have permission to write to. 

The stack trace I get from Pig is this:

2010-02-20 02:16:32,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2999: Unexpected internal error. 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error

Details at logfile: /home/viraj/pig_1266632145355.log

Pig Stack Trace
---

ERROR 2999: Unexpected internal error. 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error
java.lang.ClassCastException: 
org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
java.lang.Error
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3583)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1407)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:762)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:386)


The only way to find the error was to look at the javacc generated 
QueryParser.java code and do a System.out.println()


Here is a script to reproduce the problem:

{code}
A = load '/user/viraj/three.txt' using PigStorage();
B = foreach A generate ['a'#'12'] as b:map[] ;
store B into '/user/secure/pigtest' using PigStorage();
{code}

three.txt has 3 lines which contain nothing but the number 1.

{code}
$ hadoop fs -ls /user/secure/

ls: could not get get listing for 'hdfs://mynamenode/user/secure' : 
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=viraj, access=READ_EXECUTE, inode=secure:secure:users:rwx--

{code}


Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1247) Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. org.apache.pig.backend.datastorage.DataStorageException cannot be cast to java.lang.Error

2010-02-19 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836107#action_12836107
 ] 

Daniel Dai commented on PIG-1247:
-

This error handling code is hard coded by javacc. Seems we do not have a way to 
get around currently.

 Error Number makes it hard to debug: ERROR 2999: Unexpected internal error. 
 org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
 java.lang.Error
 -

 Key: PIG-1247
 URL: https://issues.apache.org/jira/browse/PIG-1247
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


 I have a large script in which there are intermediate stores statements, one 
 of them writes to a directory I do not have permission to write to. 
 The stack trace I get from Pig is this:
 2010-02-20 02:16:32,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2999: Unexpected internal error. 
 org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
 java.lang.Error
 Details at logfile: /home/viraj/pig_1266632145355.log
 Pig Stack Trace
 ---
 ERROR 2999: Unexpected internal error. 
 org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
 java.lang.Error
 java.lang.ClassCastException: 
 org.apache.pig.backend.datastorage.DataStorageException cannot be cast to 
 java.lang.Error
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3583)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1407)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:762)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:386)
 
 The only way to find the error was to look at the javacc generated 
 QueryParser.java code and do a System.out.println()
 Here is a script to reproduce the problem:
 {code}
 A = load '/user/viraj/three.txt' using PigStorage();
 B = foreach A generate ['a'#'12'] as b:map[] ;
 store B into '/user/secure/pigtest' using PigStorage();
 {code}
 three.txt has 3 lines which contain nothing but the number 1.
 {code}
 $ hadoop fs -ls /user/secure/
 ls: could not get get listing for 'hdfs://mynamenode/user/secure' : 
 org.apache.hadoop.security.AccessControlException: Permission denied: 
 user=viraj, access=READ_EXECUTE, inode=secure:secure:users:rwx--
 {code}
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-02-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836108#action_12836108
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Hey Woody,

Great work !! This will definitely be useful for lot of Pig users. I just 
hastily looked at your work. One question which stuck to me is you are doing 
lot of heavy lifting to provide for multi-language support by figuring out 
which language user is asking for and then doing reflection to load appropriate 
interpreter and stuff. I think it might be easier to use one of the frameworks 
here (BSF or javax.script) which hides this and allows handling of multiple 
language transparently. (atleast, thats what they claim to do) Have you taken a 
look at them? These frameworks  will arguably help us to provide support for 
more languages without maintaining lot of code on our part. Though, I am sure 
they will come at the performance cost (certainly CPU and possibly memory too). 

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
 Attachments: package.zip, scripting.tgz, scripting.tgz


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.