Re: Revisit Pig Philosophy?

2009-09-18 Thread Milind A Bhandarkar
It's Friday evening, so I have some time to discuss philosophy ;-)

Before we discuss any question about revisiting pig philosophy, the  
first question that needs to be answered is "what is pig" ? (this  
corresponds to the Hindu philosophy's basic argument, that any deep  
personal philosophical investigations need to start with a question  
"koham?" (in Sanskrit, it means 'who am I?'))

So, coming back to approx 4000 years after the origin of that  
philosophy, we need to ask "what is pig?" (incidentally, pig, or  
varaaha in Sanskrit, was the second incarnation of lord Vishnu in  
hindu scriptures, but that's not relevant here.)

What we need to decide is, is pig is a dataflow language ? I think  
not. "Pig Latin" is the language. Pig is referred to in countless  
slide decks ( aka pig scriptures, btw I own 50% of these scriptures)  
as a runtime system that interprets pig Latin, kind of like java and  
jvm. (Duality of nature, called "dwaita" philosophy in sanskrit is  
applicable here. But I won't go deeper than that.)

So, pig-Latin-the-language's stance  could still be that it could be  
implemented on any runtime. But pig the runtime's philosophy could be  
that it is a thin layer on top of hadoop. And all the world could  
breathe a sigh of relief. (mostly, by not having to answer these  
philosophical questions.)

So, 'koham' is the 4000 year old question this project needs to  
answer. That's all.

AUM.. (it's Friday.)

- (swami) Milind ;-)

On Sep 18, 2009, at 19:05, "Jeff Hammerbacher"   
wrote:

> Hey,
>
>> 2. Local mode and other parallel frameworks
>>
>> 
>> Pigs Live Anywhere
>>
>> Pig is intended to be a language for parallel data processing. It  
>> is not
>> tied to one particular parallel framework. It has been implemented  
>> first
>> on hadoop, but we do not intend that to be only on hadoop.
>> 
>>
>> Are we still holding onto this? What about local mode? Local mode  
>> is not
>> being treated on equal footing with that of Hadoop for practical
>> reasons. However, users expect things that work on local mode to work
>> without any hitches on Hadoop.
>>
>> Are we still designing the system assuming that Pig will be stacked  
>> on
>> top of other parallel frameworks?
>>
>
> FWIW, I appreciate this philosophical stance from Pig. Allowing  
> locally
> tested scripts to be migrated to the cluster without breakage is a  
> noble
> goal, and keeping the option of (one day) developing an alternative
> execution environment for Pig that runs over HDFS but uses a richer  
> physical
> set of operators than MapReduce would be great.
>
> Of course, those of you who are running Pig in production will have  
> a much
> better sense of the feasibility, rather than desirability, of this
> philosophical stance.
>
> Later,
> Jeff


Re: Revisit Pig Philosophy?

2009-09-18 Thread Jeff Hammerbacher
Hey,

> 2. Local mode and other parallel frameworks
>
> 
> Pigs Live Anywhere
>
> Pig is intended to be a language for parallel data processing. It is not
> tied to one particular parallel framework. It has been implemented first
> on hadoop, but we do not intend that to be only on hadoop.
> 
>
> Are we still holding onto this? What about local mode? Local mode is not
> being treated on equal footing with that of Hadoop for practical
> reasons. However, users expect things that work on local mode to work
> without any hitches on Hadoop.
>
> Are we still designing the system assuming that Pig will be stacked on
> top of other parallel frameworks?
>

FWIW, I appreciate this philosophical stance from Pig. Allowing locally
tested scripts to be migrated to the cluster without breakage is a noble
goal, and keeping the option of (one day) developing an alternative
execution environment for Pig that runs over HDFS but uses a richer physical
set of operators than MapReduce would be great.

Of course, those of you who are running Pig in production will have a much
better sense of the feasibility, rather than desirability, of this
philosophical stance.

Later,
Jeff


[jira] Commented: (PIG-968) findContainingJar fails when there's a + in the path

2009-09-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757596#action_12757596
 ] 

Hadoop QA commented on PIG-968:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12420098/pig-968.txt
  against trunk revision 816723.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/39/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/39/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/39/console

This message is automatically generated.

> findContainingJar fails when there's a + in the path
> 
>
> Key: PIG-968
> URL: https://issues.apache.org/jira/browse/PIG-968
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.4.0, 0.5.0
>Reporter: Todd Lipcon
> Attachments: pig-968.txt
>
>
> This is the same bug as in MAPREDUCE-714. Please see discussion there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-752) local mode doesn't read bzip2 and gzip compressed data files

2009-09-18 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757595#action_12757595
 ] 

Jeff Zhang commented on PIG-752:


Alan,

What does this message mean ?



> local mode doesn't read bzip2 and gzip compressed data files
> 
>
> Key: PIG-752
> URL: https://issues.apache.org/jira/browse/PIG-752
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: David Ciemiewicz
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_752.Patch
>
>
> Problem 1)  use of .bz2 file extension does not store results bzip2 
> compressed in Local mode (-exectype local)
> If I use the .bz2 filename extension in a STORE statement on HDFS, the 
> results are stored with bzip2 compression.
> If I use the .bz2 filename extension in a STORE statement on local file 
> system, the results are NOT stored with bzip2 compression.
> compact.bz2.pig:
> {code}
> A = load 'events.test' using PigStorage();
> store A into 'events.test.bz2' using PigStorage();
> C = load 'events.test.bz2' using PigStorage();
> C = limit C 10;
> dump C;
> {code}
> {code}
> -bash-3.00$ pig -exectype local compact.bz2.pig
> -bash-3.00$ file events.test
> events.test: ASCII English text, with very long lines
> -bash-3.00$ file events.test.bz2
> events.test.bz2: ASCII English text, with very long lines
> -bash-3.00$ cat events.test | bzip2 > events.test.bz2
> -bash-3.00$ file events.test.bz2
> events.test.bz2: bzip2 compressed data, block size = 900k
> {code}
> The output format in local mode is definitely not bzip2, but it should be.
> {code}
> Problem 2) pig in local mode does not decompress bzip2 compressed files, but 
> should, to be consistent with HDFS
> read.bz2.pig:
> {code}
> A = load 'events.test.bz2' using PigStorage();
> A = limit A 10;
> dump A;
> {code}
> The output should be human readable but is instead garbage, indicating no 
> decompression took place during the load:
> {code}
> -bash-3.00$ pig -exectype local read.bz2.pig
> USING: /grid/0/gs/pig/current
> 2009-04-03 18:26:30,455 [main] INFO  
> org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
> 2009-04-03 18:26:30,456 [main] INFO  
> org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
> (BZh91AY&syoz?u?...@{x_?d?|u-??mK???;??4?C??)
> ((R? 6?*m?&???g, 
> ?6?Zj?k,???0?QT?d???hY?#mJ?>[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a??
> ??U?p@@MT?$?B?P??N??=???(z<}gk...@c$\??i]?g:?J)
> a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m?
> (mP(i?4,#F[?I)@>?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?w>f??4z???4t?)
> (?oou?t???Kwl?3?nCM?WS?;l???P?s?x
> a???e)B??9?  ?44
> ((?...@4?)
> (f)
> (?...@+?d?0@>?U)
> (Q?SR)
> -bash-3.00$ 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-18 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757587#action_12757587
 ] 

Pradeep Kamath commented on PIG-948:


In my opinion, pig should only print the jobid. Users of pig would most likely 
already be using the hadoop job UI and should be able to track down the job 
given the job id. Giving the job id I think achieves the original issue of 
relating pig script to the corresponding MR jobs. While constructing the url is 
not complicated, embedding it in pig code seems ugly since we will most likely 
not track the changes in this url until a user notices it is broken - giving 
just the  job id is useful in itself I think.

> [Usability] Relating pig script with MR jobs
> 
>
> Key: PIG-948
> URL: https://issues.apache.org/jira/browse/PIG-948
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Attachments: pig-948.patch
>
>
> Currently its hard to find a way to relate pig script with specific MR job. 
> In a loaded cluster with multiple simultaneous job submissions, its not easy 
> to figure out which specific MR jobs were launched for a given pig script. If 
> Pig can provide this info, it will be useful to debug and monitor the jobs 
> resulting from a pig script.
> At the very least, Pig should be able to provide user the following 
> information
> 1) Job id of the launched job.
> 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-942) Maps are not implicitly casted

2009-09-18 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-942:
---

  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch committed to trunk.

> Maps are not implicitly casted
> --
>
> Key: PIG-942
> URL: https://issues.apache.org/jira/browse/PIG-942
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-942.patch
>
>
> A = load 'foo' as (m) throws the following exception when foo has maps.
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
> cast to java.util.Map
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-942) Maps are not implicitly casted

2009-09-18 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757568#action_12757568
 ] 

Daniel Dai commented on PIG-942:


+1

> Maps are not implicitly casted
> --
>
> Key: PIG-942
> URL: https://issues.apache.org/jira/browse/PIG-942
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-942.patch
>
>
> A = load 'foo' as (m) throws the following exception when foo has maps.
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
> cast to java.util.Map
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-513) PERFORMANCE: optimize some of the code in DefaultTuple

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-513:
---

   Resolution: Fixed
Fix Version/s: 0.6.0
   Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Ashutosh.

> PERFORMANCE: optimize some of the code in DefaultTuple
> --
>
> Key: PIG-513
> URL: https://issues.apache.org/jira/browse/PIG-513
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-513.patch, pig-513_2.patch
>
>
> The following areas in DefaultTuple.java can be changed:
> The member methods get(), set(), getType() and isNull() all call 
> checkBounds() which is redundant call since all these 4 functions throw 
> ExecException. Instead of doing a bounds check, we can catch the 
> IndexOutOfBounds exception in a try-catch and throw it as an ExecException
> The write() method has the following unused object (d in the code below):
> {code}
> for (int i = 0; i < sz; i++) {
> try {
> Object d = get(i);
> } catch (ExecException ee) {
> throw new RuntimeException(ee);
> }
> DataReaderWriter.writeDatum(out, mFields.get(i));
> }
> {code}
> {noformat}
> The get(i) call in the try should be replaced by the writeDatum call directly 
> since d is never used and there is an unncessary call to get()
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-969) Default constructor of UDF gets called for UDF with parameterised constructor , if the udf has a getArgToFuncMapping function defined

2009-09-18 Thread Thejas M Nair (JIRA)
Default constructor of UDF gets called for UDF with parameterised constructor , 
if the udf has a getArgToFuncMapping function defined
-

 Key: PIG-969
 URL: https://issues.apache.org/jira/browse/PIG-969
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair



This issue is discussed in  
http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg00524.html . I am 
able to reproduce the issue. While it is easy to fix the udf, it can take a lot 
of time to figure out the problem (until they find this email conversation!).

The root cause is that when getArgToFuncMapping is defined in the udf , the 
FuncSpec returned by the method replaces one set by define statement . The 
constructor arguments get lost.  We can handle this in following ways -

1. Preserve the constructor arguments, and use it with the class name of the 
matching FuncSpec from getArgToFuncMapping . 
2. Give an error if constructor paramerters are given for a udf which has 
FuncSpecs returned from getArgToFuncMapping .

The problem with  approach 1 is that we are letting the user define the 
FuncSpec , so user could have defined a FuncSpec with constructor (though they 
don't have a valid reason to do so.). It is also possible the the constructor 
of the different class that matched might not support same constructor 
parameters. The use of this function outside builtin udfs are also probably not 
common.

With option 2, we are telling the user that this is not a supported use case, 
and user can easily change the udf to fix the issue, or use the udf which would 
have matched given parameters (which unlikely to have the getArgToFuncMapping 
method defined).

I am proposing that we go with option 2 . 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-968) findContainingJar fails when there's a + in the path

2009-09-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated PIG-968:


Attachment: pig-968.txt

> findContainingJar fails when there's a + in the path
> 
>
> Key: PIG-968
> URL: https://issues.apache.org/jira/browse/PIG-968
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.4.0, 0.5.0
>Reporter: Todd Lipcon
> Attachments: pig-968.txt
>
>
> This is the same bug as in MAPREDUCE-714. Please see discussion there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-968) findContainingJar fails when there's a + in the path

2009-09-18 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated PIG-968:


Status: Patch Available  (was: Open)

> findContainingJar fails when there's a + in the path
> 
>
> Key: PIG-968
> URL: https://issues.apache.org/jira/browse/PIG-968
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.4.0, 0.5.0
>Reporter: Todd Lipcon
> Attachments: pig-968.txt
>
>
> This is the same bug as in MAPREDUCE-714. Please see discussion there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-968) findContainingJar fails when there's a + in the path

2009-09-18 Thread Todd Lipcon (JIRA)
findContainingJar fails when there's a + in the path


 Key: PIG-968
 URL: https://issues.apache.org/jira/browse/PIG-968
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0, 0.5.0
Reporter: Todd Lipcon


This is the same bug as in MAPREDUCE-714. Please see discussion there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Revisit Pig Philosophy?

2009-09-18 Thread Santhosh Srinivasan
Pig Developers,

I looked at the Pig philosophy page as it serves as a guideline for
accepting changes to Pig. Is it time to revisit the overall philosophy? 

Reference: http://hadoop.apache.org/pig/philosophy.html

Some items of interest:

1. SQL semantics and Pig

With the recent addition of SQL on top of Pig, we are making changes to
accommodate SQL semantics. Should this be part of Pig's philosophy?

2. Local mode and other parallel frameworks


Pigs Live Anywhere

Pig is intended to be a language for parallel data processing. It is not
tied to one particular parallel framework. It has been implemented first
on hadoop, but we do not intend that to be only on hadoop.


Are we still holding onto this? What about local mode? Local mode is not
being treated on equal footing with that of Hadoop for practical
reasons. However, users expect things that work on local mode to work
without any hitches on Hadoop.

Are we still designing the system assuming that Pig will be stacked on
top of other parallel frameworks?

Thanks,
Santhosh


[jira] Updated: (PIG-651) PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach has no flattens

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-651:
---

Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

> PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach 
> has no flattens
> ---
>
> Key: PIG-651
> URL: https://issues.apache.org/jira/browse/PIG-651
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-651.patch
>
>
> POForEach has lot of code to handle flattening (cross product) of the fields 
> in the generate. This is relevant only when atleast one field in the generate 
> needs to be flattened. If all fields in the generate do not need to be 
> flattened, a more simplified and hopefully more efficient POForEach can be 
> used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-967) Proposal for adding a metadata interface to Pig

2009-09-18 Thread Alan Gates (JIRA)
Proposal for adding a metadata interface to Pig
---

 Key: PIG-967
 URL: https://issues.apache.org/jira/browse/PIG-967
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates


Pig needs to have an interface to connect to metadata systems.  
http://wiki.apache.org/pig/MetadataInterfaceProposal proposes and interface for 
this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-942) Maps are not implicitly casted

2009-09-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757395#action_12757395
 ] 

Hadoop QA commented on PIG-942:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12420050/PIG-942.patch
  against trunk revision 816699.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/38/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/38/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/38/console

This message is automatically generated.

> Maps are not implicitly casted
> --
>
> Key: PIG-942
> URL: https://issues.apache.org/jira/browse/PIG-942
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-942.patch
>
>
> A = load 'foo' as (m) throws the following exception when foo has maps.
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
> cast to java.util.Map
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-966) Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces

2009-09-18 Thread Alan Gates (JIRA)
Proposed rework for LoadFunc, StoreFunc, and Slice/r interfaces
---

 Key: PIG-966
 URL: https://issues.apache.org/jira/browse/PIG-966
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates


I propose that we rework the LoadFunc, StoreFunc, and Slice/r interfaces 
significantly.  See http://wiki.apache.org/pig/LoadStoreRedesignProposal for 
full details

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-18 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757330#action_12757330
 ] 

Daniel Dai commented on PIG-948:


If Hadoop expose it, then there is no problem to include. But I don't think 
Hadoop expose this construction. I agree it is minimal to maintain and useful 
to many users. What I concern is to put undocumented Hadoop features into Pig 
code. I do not object to that, but I feel I need more inputs because we break a 
convention.

How does other developers feel?

> [Usability] Relating pig script with MR jobs
> 
>
> Key: PIG-948
> URL: https://issues.apache.org/jira/browse/PIG-948
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Attachments: pig-948.patch
>
>
> Currently its hard to find a way to relate pig script with specific MR job. 
> In a loaded cluster with multiple simultaneous job submissions, its not easy 
> to figure out which specific MR jobs were launched for a given pig script. If 
> Pig can provide this info, it will be useful to debug and monitor the jobs 
> resulting from a pig script.
> At the very least, Pig should be able to provide user the following 
> information
> 1) Job id of the launched job.
> 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-18 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757310#action_12757310
 ] 

Dmitriy V. Ryaboy commented on PIG-948:
---

I don't see a problem with url construction in Pig code.  If Hadoop exposed 
this, then sure, it would be better to use such a feature.

Since Hadoop does not expose it (afaik), it's more useful for the end-user to 
have this url than to have a jobid.  Maintenance on this piece of code is 
minimal -- after all, it's just a simple string concatenation we are talking 
about.  If Hadoop changes how this url is constructed, it will take about 3 
minutes to fix, 2.5 of which will be spent opening a Jira ticket.  

In the meantime, users will have a more usable product than they would without 
this one line of code.

> [Usability] Relating pig script with MR jobs
> 
>
> Key: PIG-948
> URL: https://issues.apache.org/jira/browse/PIG-948
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Attachments: pig-948.patch
>
>
> Currently its hard to find a way to relate pig script with specific MR job. 
> In a loaded cluster with multiple simultaneous job submissions, its not easy 
> to figure out which specific MR jobs were launched for a given pig script. If 
> Pig can provide this info, it will be useful to debug and monitor the jobs 
> resulting from a pig script.
> At the very least, Pig should be able to provide user the following 
> information
> 1) Job id of the launched job.
> 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-592) schema inferred incorrectly

2009-09-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757306#action_12757306
 ] 

Alan Gates commented on PIG-592:


+1, patch looks good.  Let's get this in, as it's an annoying bug.

> schema inferred incorrectly
> ---
>
> Key: PIG-592
> URL: https://issues.apache.org/jira/browse/PIG-592
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Christopher Olston
> Fix For: 0.5.0
>
> Attachments: PIG-592-1.patch
>
>
> A simple pig script, that never introduces any schema information:
> A = load 'foo';
> B = foreach (group A by $8) generate group, COUNT($1);
> C = load 'bar';   // ('bar' has two columns)
> D = join B by $0, C by $0;
> E = foreach D generate $0, $1, $3;
> Fails, complaining that $3 does not exist:
> java.io.IOException: Out of bound access. Trying to access non-existent 
> column: 3. Schema {B::group: bytearray,long,bytearray} has 3 column(s).
> Apparently Pig gets confused, and thinks it knows the schema for C (a single 
> bytearray column).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-651) PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach has no flattens

2009-09-18 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757305#action_12757305
 ] 

Olga Natkovich commented on PIG-651:


I agree that there is not enough gain

> PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach 
> has no flattens
> ---
>
> Key: PIG-651
> URL: https://issues.apache.org/jira/browse/PIG-651
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-651.patch
>
>
> POForEach has lot of code to handle flattening (cross product) of the fields 
> in the generate. This is relevant only when atleast one field in the generate 
> needs to be flattened. If all fields in the generate do not need to be 
> flattened, a more simplified and hopefully more efficient POForEach can be 
> used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-593) RegExLoader stops an non-matching line

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-593:
---

Resolution: Duplicate
Status: Resolved  (was: Patch Available)

Looks like this issue has already been addressed with a separate patch.

> RegExLoader stops an non-matching line
> --
>
> Key: PIG-593
> URL: https://issues.apache.org/jira/browse/PIG-593
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.1.0
>Reporter: Vadim Zaliva
>Priority: Minor
> Attachments: PIG-593.diff
>
>
> Class RegExLoader and all its subclasses stop if some of lines does not match 
> provided regular expression.
> In particular, I have noticed this when CombinedLogLoader stopped at the 
> following line:
> 58.210.62.24 - - [29/Dec/2008:23:06:57 -0800] "GET 
> /tor/browse/?id=24746&rel=FLY
> 999%40Jack's+Teen+America+22%2FFLY999原創%40單掛D.C.資訊交流網+Jack's+Teen+Ameri
> ca+22+cd1.avi HTTP/1.1" 8952 200 
> "http://img252.imageshack.us/tor/browse/?id=247
> 46&rel=FLY999%40Jack%27s+Teen+America+22" "Mozilla/4.0 (compatible; MSIE 6.0; 
> Wi
> ndows NT 5.1; )" "-"
> Looks like some japanese characters here do not match \S expression used.  
> In general I expect it to skip such lines, not to stop processing data file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-682) Fix the ssh tunneling code

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-682:
---

Status: Open  (was: Patch Available)

Moving to open until the patch is changed per the comments by Santhosh and 
Pradeep.

> Fix the ssh tunneling code
> --
>
> Key: PIG-682
> URL: https://issues.apache.org/jira/browse/PIG-682
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Benjamin Reed
> Attachments: jsch-0.1.41.jar, PIG-682.patch
>
>
> Hadoop has changed a bit and the ssh-gateway code no longer works. pig needs 
> to be updated to register with the new socket framework. reporting of 
> problems also needs to be better.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-651) PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach has no flattens

2009-09-18 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757297#action_12757297
 ] 

Alan Gates commented on PIG-651:


Is it worth adding this complexity to the code for a 2% speed up?  I'd vote no.

> PERFORMANCE: Use specialized POForEachNoFlatten for cases where the foreach 
> has no flattens
> ---
>
> Key: PIG-651
> URL: https://issues.apache.org/jira/browse/PIG-651
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Attachments: PIG-651.patch
>
>
> POForEach has lot of code to handle flattening (cross product) of the fields 
> in the generate. This is relevant only when atleast one field in the generate 
> needs to be flattened. If all fields in the generate do not need to be 
> flattened, a more simplified and hopefully more efficient POForEach can be 
> used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-752) local mode doesn't read bzip2 and gzip compressed data files

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-752:
---

Status: Open  (was: Patch Available)

When I try to apply this patch I get:

{code}
patching file src/org/apache/pig/impl/util/IOStreamFactory.java
patching file src/org/apache/pig/backend/hadoop/datastorage/HFile.java
Hunk #1 FAILED at 29.
Hunk #2 FAILED at 67.
2 out of 2 hunks FAILED -- saving rejects to file 
src/org/apache/pig/backend/hadoop/datastorage/HFile.java.rej
patching file src/org/apache/pig/backend/executionengine/PigSlice.java
patching file src/org/apache/pig/impl/io/FileLocalizer.java
patching file 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigOutputFormat.java
patching file test/org/apache/pig/test/TestBZip.java
{code}

> local mode doesn't read bzip2 and gzip compressed data files
> 
>
> Key: PIG-752
> URL: https://issues.apache.org/jira/browse/PIG-752
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: David Ciemiewicz
>Assignee: Jeff Zhang
> Fix For: 0.4.0
>
> Attachments: Pig_752.Patch
>
>
> Problem 1)  use of .bz2 file extension does not store results bzip2 
> compressed in Local mode (-exectype local)
> If I use the .bz2 filename extension in a STORE statement on HDFS, the 
> results are stored with bzip2 compression.
> If I use the .bz2 filename extension in a STORE statement on local file 
> system, the results are NOT stored with bzip2 compression.
> compact.bz2.pig:
> {code}
> A = load 'events.test' using PigStorage();
> store A into 'events.test.bz2' using PigStorage();
> C = load 'events.test.bz2' using PigStorage();
> C = limit C 10;
> dump C;
> {code}
> {code}
> -bash-3.00$ pig -exectype local compact.bz2.pig
> -bash-3.00$ file events.test
> events.test: ASCII English text, with very long lines
> -bash-3.00$ file events.test.bz2
> events.test.bz2: ASCII English text, with very long lines
> -bash-3.00$ cat events.test | bzip2 > events.test.bz2
> -bash-3.00$ file events.test.bz2
> events.test.bz2: bzip2 compressed data, block size = 900k
> {code}
> The output format in local mode is definitely not bzip2, but it should be.
> {code}
> Problem 2) pig in local mode does not decompress bzip2 compressed files, but 
> should, to be consistent with HDFS
> read.bz2.pig:
> {code}
> A = load 'events.test.bz2' using PigStorage();
> A = limit A 10;
> dump A;
> {code}
> The output should be human readable but is instead garbage, indicating no 
> decompression took place during the load:
> {code}
> -bash-3.00$ pig -exectype local read.bz2.pig
> USING: /grid/0/gs/pig/current
> 2009-04-03 18:26:30,455 [main] INFO  
> org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
> 2009-04-03 18:26:30,456 [main] INFO  
> org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
> (BZh91AY&syoz?u?...@{x_?d?|u-??mK???;??4?C??)
> ((R? 6?*m?&???g, 
> ?6?Zj?k,???0?QT?d???hY?#mJ?>[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a??
> ??U?p@@MT?$?B?P??N??=???(z<}gk...@c$\??i]?g:?J)
> a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m?
> (mP(i?4,#F[?I)@>?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?w>f??4z???4t?)
> (?oou?t???Kwl?3?nCM?WS?;l???P?s?x
> a???e)B??9?  ?44
> ((?...@4?)
> (f)
> (?...@+?d?0@>?U)
> (Q?SR)
> -bash-3.00$ 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-777:
---

Status: Open  (was: Patch Available)

Moving from patch available to open since the contributed patch has been 
committed and the JIRA is being held open to address other issues.

> Code refactoring: Create optimization out of store/load post processing code
> 
>
> Key: PIG-777
> URL: https://issues.apache.org/jira/browse/PIG-777
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: log_message.patch
>
>
> The postProcessing method in the pig server checks whether a logical graph 
> contains stores to and loads from the same location. If so, it will either 
> connect the store and load, or optimize by throwing out the load and 
> connecting the store predecessor with the successor of the load.
> Ideally the introduction of the store and load connection should happen in 
> the query compiler, while the optimization should then happen in an separate 
> optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-948:
---

Status: Open  (was: Patch Available)

Marking this as open again rather than patch available until issues with job 
tracker URI in the message are resolved.

> [Usability] Relating pig script with MR jobs
> 
>
> Key: PIG-948
> URL: https://issues.apache.org/jira/browse/PIG-948
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Attachments: pig-948.patch
>
>
> Currently its hard to find a way to relate pig script with specific MR job. 
> In a loaded cluster with multiple simultaneous job submissions, its not easy 
> to figure out which specific MR jobs were launched for a given pig script. If 
> Pig can provide this info, it will be useful to debug and monitor the jobs 
> resulting from a pig script.
> At the very least, Pig should be able to provide user the following 
> information
> 1) Job id of the launched job.
> 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-951) Reset parallelism to 1 for indexing job in MergeJoin

2009-09-18 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-951:
---

   Resolution: Fixed
Fix Version/s: 0.6.0
   Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Ashutosh.

> Reset parallelism to 1 for indexing job in MergeJoin
> 
>
> Key: PIG-951
> URL: https://issues.apache.org/jira/browse/PIG-951
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.6.0
>
> Attachments: pig-951.patch
>
>
> After sampling one tuple from every block, one reducer is used to sort the 
> index entries in reduce phase to produce sorted index to be used in actual 
> join job. Thus, parallelism of index job should be explictly set to 1. 
> Currently, its not.
> Currently, this is a non-issue, since we don't allow any blocking operators 
> in pipeline before merge-join. However, later when we do allow blocking 
> operators, then parallelism of indexing job will be that of preceding 
> blocking operator. Even then, job will complete successfully because all 
> tuple will go to only one reducer, because we are grouping on only one key 
> "all". However, it will waste cluster resources by starting all the extra 
> reducers which get no data and thus do nothing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-942) Maps are not implicitly casted

2009-09-18 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-942:
---

Fix Version/s: 0.6.0
 Assignee: Pradeep Kamath
   Status: Patch Available  (was: Open)

> Maps are not implicitly casted
> --
>
> Key: PIG-942
> URL: https://issues.apache.org/jira/browse/PIG-942
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-942.patch
>
>
> A = load 'foo' as (m) throws the following exception when foo has maps.
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
> cast to java.util.Map
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-942) Maps are not implicitly casted

2009-09-18 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-942:
---

Attachment: PIG-942.patch

Attached patch which addresses the issue by introducing implicit cast to Map if 
the input to Map lookup is not a map. As a part of this patch also changed 
LOCast.getFieldSchema() to set lineage information (set its parent to the 
expression on which it is operating). The latter change was needed to fix unit 
tests which were failing due to the first change.

> Maps are not implicitly casted
> --
>
> Key: PIG-942
> URL: https://issues.apache.org/jira/browse/PIG-942
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
> Attachments: PIG-942.patch
>
>
> A = load 'foo' as (m) throws the following exception when foo has maps.
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
> cast to java.util.Map
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-09-18 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757242#action_12757242
 ] 

Thejas M Nair commented on PIG-965:
---

The 'common' use case to which these optimization apply usually has a constant 
string specifying the pattern. It makes sense to use this optimization only 
(specifically optimization 2) in such cases, so that the worst case is not 
worse off.

Another thing to check is if there are alternative faster regex implementations 
.

> PERFORMANCE: optimize common case in matches (PORegex)
> --
>
> Key: PIG-965
> URL: https://issues.apache.org/jira/browse/PIG-965
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Thejas M Nair
>
> Some frequently seen use cases of 'matches' comparison operator have follow 
> properties -
> 1. The rhs is a constant string . eg "c1 matches 'abc%' "
> 2. Regexes such that look for matching prefix , suffix etc are very common. 
> eg - "abc%', "%abc", '%abc%' 
> To optimize for these common cases , PORegex.java can be changed to -
> 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
> not changed. 
> 2. Use string comparisons for simple common regexes (in 2 above).
> The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-964) Handling null in skewed join

2009-09-18 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757237#action_12757237
 ] 

Olga Natkovich commented on PIG-964:


patch committed to the trunk

> Handling null  in skewed join
> -
>
> Key: PIG-964
> URL: https://issues.apache.org/jira/browse/PIG-964
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Sriranjan Manjunath
> Attachments: skewedjoinnull.patch
>
>
> For null tuples, the tuple size is calculated incorrectly and thus  skewed 
> join ends up expecting a large number of reducers. Further, skewed join 
> should not bail out after the second job if the number of reducers specified 
> by the user is low. It should print a warning message and continue execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [VOTE] Release Pig 0.4.0 (candidate 1)

2009-09-18 Thread Olga Natkovich
Yes, I need to roll another version. I resolved the ant test issue but
there is one more with unit tests in piggybank that I am still looking
into.

Olga

-Original Message-
From: Nigel Daley [mailto:nda...@yahoo-inc.com] 
Sent: Thursday, September 17, 2009 10:56 PM
To: pig-dev@hadoop.apache.org
Subject: Re: [VOTE] Release Pig 0.4.0 (candidate 1)

Is anyone else getting javac errors running "ant test"?

compile-sources:
 [javac] Compiling 484 source files to /Users/ndaley/hadoop/verify/ 
pig-0.4.0/build/classes
 [javac] /Users/ndaley/hadoop/verify/pig-0.4.0/src/org/apache/pig/ 
ComparisonFunc.java:22: package org.apache.hadoop.io does not exist
 [javac] import org.apache.hadoop.io.WritableComparable;
 [javac]^
...

Nige

On Sep 17, 2009, at 12:09 PM, Olga Natkovich wrote:

> Hi,
>
> I have fixed the issue causing the failure that Alan reported.
>
> Please test the new release:
> http://people.apache.org/~olga/pig-0.4.0-candidate-1/.
>
> Vote closes on Tuesday, 9/22.
>
> Olga
>
>
> -Original Message-
> From: Olga Natkovich [mailto:ol...@yahoo-inc.com]
> Sent: Monday, September 14, 2009 2:06 PM
> To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org
> Subject: [VOTE] Release Pig 0.4.0 (candidate 0)
>
> Hi,
>
>
>
> I created a candidate build for Pig 0.4.0 release. The highlights of
> this release are
>
>
>
> -  Performance improvements especially in the area of JOIN
> support where we introduced two new join types: skew join to deal with
> data skew and sort merge join to take advantage of the sorted data  
> sets.
>
> -  Support for Outer join.
>
> -  Works with Hadoop 18
>
>
>
> I ran the release audit and rat report looked fine. The relevant  
> part is
> attached below.
>
>
>
> Keys used to sign the release are available at
> http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup.
>
>
>
> Please download the release and try it out:
> http://people.apache.org/~olga/pig-0.4.0-candidate-0.
>
>
>
> Should we release this? Vote closes on Thursday, 9/17.
>
>
>
> Olga
>
>
>
>
>
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/ 
> CHANGES.txt
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/ 
> CHANG
> ES.txt
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken- 
> links.x
> ml
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
> cookbook.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/linkmap.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
> piglatin_refer
> ence.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
> piglatin_users
> .html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
> tutorial.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/ 
> package-li
> st
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes.
> html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> missingS
> inces.txt
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> user_com
> ments_for_pig_0.3.1_to_pig_0.5.0-dev.xml
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> alldiffs_index_additions.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> alldiffs_index_all.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> alldiffs_index_changes.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> alldiffs_index_removals.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> changes-summary.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> classes_index_additions.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> classes_index_all.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> classes_index_changes.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> classes_index_removals.html
> [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
> changes/
> constructors_index_additions.html
> [java]  !?
> /home/olgan/s