[jira] Commented: (PIG-812) COUNT(*) does not work

2009-07-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729550#action_12729550
 ] 

Hadoop QA commented on PIG-812:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12413078/PIG-812.patch
  against trunk revision 792663.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/121/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/121/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/121/console

This message is automatically generated.

 COUNT(*) does not work 
 ---

 Key: PIG-812
 URL: https://issues.apache.org/jira/browse/PIG-812
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
Assignee: Benjamin Reed
 Fix For: 0.2.0

 Attachments: PIG-812.patch, PIG-812.pdf, studenttab10k


 Pig script to count the number of rows in a studenttab10k file which contains 
 10k records.
 {code}
 studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float);
 X2 = GROUP studenttab ALL;
 describe X2;
 Y2 = FOREACH X2 GENERATE COUNT(*);
 explain Y2;
 DUMP Y2;
 {code}
 returns the following error
 
 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator 
 for alias Y2
 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log
 
 If you look at the log file:
 
 Caused by: java.lang.ClassCastException
 at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76)
 at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-07-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729700#action_12729700
 ] 

Alan Gates commented on PIG-794:


I agree with Doug's comments that it's better to use an API to build the schema 
that will give us compile time checking.  I think it will also (hopefully) be 
easier to figure out the schema when reading the code, as it will avoid the 
need to read JSON directly.

I have a general question on the approach.  This is a direct port of Pig's 
BinStorage to use Avro, including the writing of indicator bytes for types.  I 
do not have a deep knowledge of Avro.  But I had assumed that since it was a 
de/serialization framework with types, part of what it would provide was type 
recognition.  That is, can't this code rely on Avro to set the type for it?  Do 
we need to be writing those indicator bytes ourselves?  Perhaps this is the 
same comment that Doug is making about using GenericDatumReader and addField.

In response to Hong's comment, the sync marks are vulnerable as you point out.  
But the loader needs some way to find a proper starting place when it's handed 
any block but the initial block of a file.  I wonder if we could create a new 
sync type.  It would always consist of a 100 byte marker (say the first 25 
prime numbers, or the first 25 digits of pi or something).  We could then write 
a tuple with that sync type every 1000 records in the data.  Loaders that don't 
start at position 0 could then seek to the first sync type it found before it 
began reading.  All loaders would read past the end of their position until 
they saw a sync type.

As for this being compatible with with non-pig apps, that isn't the purpose of 
this AvroStorage function.  This is for pig to pass data between MR jobs for 
itself.  Having a tool independent storage format is a bigger project, as it 
requires agreeing on things like sync marks, how to represent different Avro 
objects, etc.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Pradeep Kamath (JIRA)
Pig should provide a way for input location string in load statement to be 
passed as-is to the Loader
-

 Key: PIG-879
 URL: https://issues.apache.org/jira/browse/PIG-879
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath


 Due to multiquery optimization, Pig always converts the filenames to absolute 
URIs (see http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - 
section about Incompatible Changes - Path Names and Schemes). This is necessary 
since the script may have cd .. statements between load or store statements 
and if the load statements have relative paths, we would need to convert to 
absolute paths to know where to load/store from. To do this 
QueryParser.massageFilename() has the code below[1] which basically gives the 
fully qualified hdfs path
 
However the issue with this approach is that if the filename string is 
something like 
hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
 the code below[1] actually translates this to 
hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
 and throws an exception that it is an incorrect path.
 
Some loaders may want to interpret the filenames (the input location string in 
the load statement) in any way they wish and may want Pig to not make absolute 
paths out of them.
 
There are a few options to address this:
1)A command line switch to indicate to Pig that pathnames in the script are 
all absolute and hence Pig should not alter them and pass them as-is to Loaders 
and Storers. 
2)A keyword in the load and store statements to indicate the same intent to 
pig
3)A property which users can supply on cmdline or in pig.properties to 
indicate the same intent.
4)A method in LoadFunc - relativeToAbsolutePath(String filename, String 
curDir) which does the conversion to absolute - this way Loader can chose to 
implement it as a noop.

Thoughts?
 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-880) Order by is borken with complex fields

2009-07-10 Thread Olga Natkovich (JIRA)
Order by is borken with complex fields
--

 Key: PIG-880
 URL: https://issues.apache.org/jira/browse/PIG-880
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
 Fix For: 0.4.0


Pig script:

a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
s = order f by $0;   
store s into 'sc.out' 

Stack:

Caused by: java.lang.ArrayStoreException
at java.lang.System.arraycopy(Native Method)
at java.util.Arrays.copyOf(Arrays.java:2763)
at java.util.ArrayList.toArray(ArrayList.java:305)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
... 5 more

at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
at org.apache.pig.PigServer.execute(PigServer.java:762)
at org.apache.pig.PigServer.access$100(PigServer.java:91)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
at 
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
at org.apache.pig.Main.main(Main.java:389)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729744#action_12729744
 ] 

Hong Tang commented on PIG-879:
---

1) and 3) are kind of equivalent to user, and are preferred for customized 
loaders that do not wish pig to do the escaping at all. 


 Pig should provide a way for input location string in load statement to be 
 passed as-is to the Loader
 -

 Key: PIG-879
 URL: https://issues.apache.org/jira/browse/PIG-879
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

  Due to multiquery optimization, Pig always converts the filenames to 
 absolute URIs (see 
 http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section 
 about Incompatible Changes - Path Names and Schemes). This is necessary since 
 the script may have cd .. statements between load or store statements and 
 if the load statements have relative paths, we would need to convert to 
 absolute paths to know where to load/store from. To do this 
 QueryParser.massageFilename() has the code below[1] which basically gives the 
 fully qualified hdfs path
  
 However the issue with this approach is that if the filename string is 
 something like 
 hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
  the code below[1] actually translates this to 
 hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
  and throws an exception that it is an incorrect path.
  
 Some loaders may want to interpret the filenames (the input location string 
 in the load statement) in any way they wish and may want Pig to not make 
 absolute paths out of them.
  
 There are a few options to address this:
 1)A command line switch to indicate to Pig that pathnames in the script 
 are all absolute and hence Pig should not alter them and pass them as-is to 
 Loaders and Storers. 
 2)A keyword in the load and store statements to indicate the same intent 
 to pig
 3)A property which users can supply on cmdline or in pig.properties to 
 indicate the same intent.
 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String 
 curDir) which does the conversion to absolute - this way Loader can chose to 
 implement it as a noop.
 Thoughts?
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729751#action_12729751
 ] 

Dmitriy V. Ryaboy commented on PIG-879:
---

Having this be a global flag through properties wouldn't work for scripts that 
require both behaviors in different load statements.

Maybe a boolean performPathConversion flag which is true by default, and can 
be overridden via the load statement?
Custom Loaders could change what their default is.
I think a boolean flag is more straightforward than a method you have to 
override with a no-op.

 Pig should provide a way for input location string in load statement to be 
 passed as-is to the Loader
 -

 Key: PIG-879
 URL: https://issues.apache.org/jira/browse/PIG-879
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

  Due to multiquery optimization, Pig always converts the filenames to 
 absolute URIs (see 
 http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section 
 about Incompatible Changes - Path Names and Schemes). This is necessary since 
 the script may have cd .. statements between load or store statements and 
 if the load statements have relative paths, we would need to convert to 
 absolute paths to know where to load/store from. To do this 
 QueryParser.massageFilename() has the code below[1] which basically gives the 
 fully qualified hdfs path
  
 However the issue with this approach is that if the filename string is 
 something like 
 hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
  the code below[1] actually translates this to 
 hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
  and throws an exception that it is an incorrect path.
  
 Some loaders may want to interpret the filenames (the input location string 
 in the load statement) in any way they wish and may want Pig to not make 
 absolute paths out of them.
  
 There are a few options to address this:
 1)A command line switch to indicate to Pig that pathnames in the script 
 are all absolute and hence Pig should not alter them and pass them as-is to 
 Loaders and Storers. 
 2)A keyword in the load and store statements to indicate the same intent 
 to pig
 3)A property which users can supply on cmdline or in pig.properties to 
 indicate the same intent.
 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String 
 curDir) which does the conversion to absolute - this way Loader can chose to 
 implement it as a noop.
 Thoughts?
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729755#action_12729755
 ] 

Thejas M Nair commented on PIG-879:
---

The problem with 1  3 is that the setting is universal to the grunt shell or 
script.
In cases where user wants to read from read from multiple sources with 
different loaders, it will be inconvenient to be forced to use absolute uri's 
for all of them.


 Pig should provide a way for input location string in load statement to be 
 passed as-is to the Loader
 -

 Key: PIG-879
 URL: https://issues.apache.org/jira/browse/PIG-879
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

  Due to multiquery optimization, Pig always converts the filenames to 
 absolute URIs (see 
 http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section 
 about Incompatible Changes - Path Names and Schemes). This is necessary since 
 the script may have cd .. statements between load or store statements and 
 if the load statements have relative paths, we would need to convert to 
 absolute paths to know where to load/store from. To do this 
 QueryParser.massageFilename() has the code below[1] which basically gives the 
 fully qualified hdfs path
  
 However the issue with this approach is that if the filename string is 
 something like 
 hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
  the code below[1] actually translates this to 
 hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
  and throws an exception that it is an incorrect path.
  
 Some loaders may want to interpret the filenames (the input location string 
 in the load statement) in any way they wish and may want Pig to not make 
 absolute paths out of them.
  
 There are a few options to address this:
 1)A command line switch to indicate to Pig that pathnames in the script 
 are all absolute and hence Pig should not alter them and pass them as-is to 
 Loaders and Storers. 
 2)A keyword in the load and store statements to indicate the same intent 
 to pig
 3)A property which users can supply on cmdline or in pig.properties to 
 indicate the same intent.
 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String 
 curDir) which does the conversion to absolute - this way Loader can chose to 
 implement it as a noop.
 Thoughts?
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Daniel Dai (JIRA)
Pig should ship load udfs to the backend


 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
 Fix For: 0.4.0


Currently, when we use load udfs, we have to use register statement. It is 
ideal that if user put udf jars in classpath, we can omit register statement, 
Pig can pick the udf from classpath automatically.

However, Pig do not ship load udfs currently, the classpath approach does not 
work. register works because Pig ship that entire jar. Pig do ship eval udfs 
and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai reassigned PIG-881:
--

Assignee: Daniel Dai

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729770#action_12729770
 ] 

Daniel Dai commented on PIG-881:


Find some problem, I will deliver patch again shortly.

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-881:
---

Attachment: (was: PIG-881-1.patch)

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729771#action_12729771
 ] 

Hong Tang commented on PIG-879:
---

Both are valid arguments. The problem of 2) and 4) are that they require change 
to the load statement syntax or load-func api and would take longer to get 
there. 

I guess we could structure the fix in two phases: Phase One: supporting 1) and 
3), so that we can have the minimum to move along without having to disable 
multi-query optimization completely. User should be able to modify the script 
to change all relative paths to absolute ones (the chance of such usage should 
be rare that most people should not be impacted). Phase Two: support either 2) 
or 4) (but I do not think we need both). And personally I think 4) would be 
better because loader should be the one that interprets the location string 
syntax.

 Pig should provide a way for input location string in load statement to be 
 passed as-is to the Loader
 -

 Key: PIG-879
 URL: https://issues.apache.org/jira/browse/PIG-879
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

  Due to multiquery optimization, Pig always converts the filenames to 
 absolute URIs (see 
 http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section 
 about Incompatible Changes - Path Names and Schemes). This is necessary since 
 the script may have cd .. statements between load or store statements and 
 if the load statements have relative paths, we would need to convert to 
 absolute paths to know where to load/store from. To do this 
 QueryParser.massageFilename() has the code below[1] which basically gives the 
 fully qualified hdfs path
  
 However the issue with this approach is that if the filename string is 
 something like 
 hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
  the code below[1] actually translates this to 
 hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
  and throws an exception that it is an incorrect path.
  
 Some loaders may want to interpret the filenames (the input location string 
 in the load statement) in any way they wish and may want Pig to not make 
 absolute paths out of them.
  
 There are a few options to address this:
 1)A command line switch to indicate to Pig that pathnames in the script 
 are all absolute and hence Pig should not alter them and pass them as-is to 
 Loaders and Storers. 
 2)A keyword in the load and store statements to indicate the same intent 
 to pig
 3)A property which users can supply on cmdline or in pig.properties to 
 indicate the same intent.
 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String 
 curDir) which does the conversion to absolute - this way Loader can chose to 
 implement it as a noop.
 Thoughts?
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-881:
---

Attachment: PIG-881-1.patch

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-881-1.patch


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729785#action_12729785
 ] 

Olga Natkovich commented on PIG-881:


+1; the patch looks good!

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-881-1.patch


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-882) log level not propogated to loggers

2009-07-10 Thread Thejas M Nair (JIRA)
log level not propogated to loggers 


 Key: PIG-882
 URL: https://issues.apache.org/jira/browse/PIG-882
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair


Pig accepts log level as a parameter. But the log level it captures is not set 
appropriately, so that loggers in different classes log at the specified level.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-881:
---

Status: Patch Available  (was: Open)

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-881-1.patch, PIG-881-2.patch


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729859#action_12729859
 ] 

Milind Bhandarkar commented on PIG-879:
---

I see some long term issues with all the approaches/options.

First, not all loaders require a path. (e.g. DBLoader) Some paths (e.g. hftp:// 
or hsftp://) do not have a notion of relative or absolute. Indeed, the right 
way to fix this is to change the syntax of load and store statements, so that 
the loader itself deals with the path handling, and not pig. Second, take out 
copyToLocal, cp, mv, and all the dfs shell functionality from pig. These are 
side effects and impose a barrier for optimization. In the current form, they 
do not belong in a dataflow language. Grunt could still support it.

 Pig should provide a way for input location string in load statement to be 
 passed as-is to the Loader
 -

 Key: PIG-879
 URL: https://issues.apache.org/jira/browse/PIG-879
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

  Due to multiquery optimization, Pig always converts the filenames to 
 absolute URIs (see 
 http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section 
 about Incompatible Changes - Path Names and Schemes). This is necessary since 
 the script may have cd .. statements between load or store statements and 
 if the load statements have relative paths, we would need to convert to 
 absolute paths to know where to load/store from. To do this 
 QueryParser.massageFilename() has the code below[1] which basically gives the 
 fully qualified hdfs path
  
 However the issue with this approach is that if the filename string is 
 something like 
 hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
  the code below[1] actually translates this to 
 hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
  and throws an exception that it is an incorrect path.
  
 Some loaders may want to interpret the filenames (the input location string 
 in the load statement) in any way they wish and may want Pig to not make 
 absolute paths out of them.
  
 There are a few options to address this:
 1)A command line switch to indicate to Pig that pathnames in the script 
 are all absolute and hence Pig should not alter them and pass them as-is to 
 Loaders and Storers. 
 2)A keyword in the load and store statements to indicate the same intent 
 to pig
 3)A property which users can supply on cmdline or in pig.properties to 
 indicate the same intent.
 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String 
 curDir) which does the conversion to absolute - this way Loader can chose to 
 implement it as a noop.
 Thoughts?
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-880) Order by is borken with complex fields

2009-07-10 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729882#action_12729882
 ] 

Pradeep Kamath commented on PIG-880:


The root cause of this issue is that in interpreting map data, PigStorage 
returns values in the map to be of the type that it deduces based on the data. 
So string data for values are returned as String, integer values are returned 
as Integer. However the logical layer in Pig assumes the type of the values in 
the map to be ByteArray since it cannot assume any type. If one of the sampled 
values forming the quantile list is a null, it is assumed to be of type of the 
reduce key of the final order by job. In this case, since the order by key is 
smap#'name', it is thought to be of type ByteArray. However the values 
resulting from the map lookup are actually of type String.  This mismatch 
results in the above exception - if nulls are filtered out, map.collect() fails 
because hadoop thinks the map key type is bytearray but it gets a Text (string).

A proposal to fix this is to Change TextDataParser which is used by PigStorage 
for reading map data to return ByteArray type for the values in the map.

Thoughts?



 Order by is borken with complex fields
 --

 Key: PIG-880
 URL: https://issues.apache.org/jira/browse/PIG-880
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Olga Natkovich
 Fix For: 0.4.0


 Pig script:
 a = load 'studentcomplextab10k' as (smap:map[],c2,c3);
 f = foreach a generate smap#'name, smap#'age', smap#'gpa' ;
 s = order f by $0;   
 store s into 'sc.out' 
 Stack:
 Caused by: java.lang.ArrayStoreException
 at java.lang.System.arraycopy(Native Method)
 at java.util.Arrays.copyOf(Arrays.java:2763)
 at java.util.ArrayList.toArray(ArrayList.java:305)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.convertToArray(WeightedRangePartitioner.java:154)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.configure(WeightedRangePartitioner.java:96)
 ... 5 more
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:230)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:179)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:265)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:769)
 at org.apache.pig.PigServer.execute(PigServer.java:762)
 at org.apache.pig.PigServer.access$100(PigServer.java:91)
 at org.apache.pig.PigServer$Graph.execute(PigServer.java:933)
 at org.apache.pig.PigServer.executeBatch(PigServer.java:245)
 at 
 org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:112)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:140)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:389)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-881:
---

Attachment: PIG-881-3.patch

Get all unit test pass.

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-881-1.patch, PIG-881-2.patch, PIG-881-3.patch


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-724) Treating map values in PigStorage

2009-07-10 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-724:
---

Summary: Treating map values in PigStorage  (was: Treating integers and 
strings in PigStorage)

 Treating map values in PigStorage
 -

 Key: PIG-724
 URL: https://issues.apache.org/jira/browse/PIG-724
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.1
Reporter: Santhosh Srinivasan
 Fix For: 0.2.1


 Currently, PigStorage cannot treats the materialized string 123 as an integer 
 with the value 123. If the user intended this to be the string 123, 
 PigStorage cannot deal with it. This reasoning also applies to doubles. Due 
 to this issue, maps that contain values which are of the same type but 
 manifest the issue discussed at beginning of the paragraph, Pig throws its 
 hands up at runtime.  An example to illustrate the problem will help.
 In the example below a sample row in the data (map.txt) contains the 
 following:
 [key01#35,key02#value01]
 When Pig tries to convert the stream to a map, it creates a MapObject, 
 Object where the key is a string and the value is an integer. Running the 
 script shown below, results in a run-time error.
 {code}
 grunt a = load 'map.txt' as (themap: map[]);
 grunt b = filter a by (chararray)(themap#'key01') == 'hello';
   
 grunt dump b;
 2009-03-18 15:19:03,773 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2009-03-18 15:19:28,797 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Map reduce job failed
 2009-03-18 15:19:28,817 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1081: Cannot cast to chararray. Expected bytearray but received: int
 {code} 
 There are two ways to resolve this issue:
 1. Change the conversion routine for bytesToMap to return a map where the 
 value is a bytearray and not the actual type. This change breaks backward 
 compatibility
 2. Introduce checks in POCast where conversions that are legal in the type 
 checking world are allowed, i.e., run time checks will be made to check for 
 compatible casts. In the above example, an int can be converted to a 
 chararray and the cast will be made. If on the other hand, it was a chararray 
 to int conversion then an exception will be thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-881:
---

Status: Patch Available  (was: In Progress)

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-881-1.patch, PIG-881-2.patch, PIG-881-3.patch


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-881:
---

Status: In Progress  (was: Patch Available)

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-881-1.patch, PIG-881-2.patch, PIG-881-3.patch


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729887#action_12729887
 ] 

Olga Natkovich commented on PIG-881:


+1 on the patch. Patch process seems to ve stuck again. We ran the tests 
manually and they passed, so please, commit the patch.

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-881-1.patch, PIG-881-2.patch, PIG-881-3.patch


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-881) Pig should ship load udfs to the backend

2009-07-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729892#action_12729892
 ] 

Hadoop QA commented on PIG-881:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12413156/PIG-881-2.patch
  against trunk revision 792663.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/122/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/122/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/122/console

This message is automatically generated.

 Pig should ship load udfs to the backend
 

 Key: PIG-881
 URL: https://issues.apache.org/jira/browse/PIG-881
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-881-1.patch, PIG-881-2.patch, PIG-881-3.patch


 Currently, when we use load udfs, we have to use register statement. It is 
 ideal that if user put udf jars in classpath, we can omit register statement, 
 Pig can pick the udf from classpath automatically.
 However, Pig do not ship load udfs currently, the classpath approach does not 
 work. register works because Pig ship that entire jar. Pig do ship eval 
 udfs and storage udfs, we should ship load udfs as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.