[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-09-25 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759848#action_12759848
 ] 

Jeff Hammerbacher commented on PIG-979:
---

One could also cite the SOSP paper from MSR this year comparing the iterator to 
the accumulator interface, though I have a hard time concisely stating their 
conclusions: http://sigops.org/sosp/sosp09/papers/yu-sosp09.pdf.

> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-980) Optimizing nested order bys

2009-09-25 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759815#action_12759815
 ] 

Alan Gates commented on PIG-980:


A common pattern for Pig Latin scripts is:

{code}
A = load 'bla';
B = group A by $0;
C = foreach B {
D = order A by $1;
...
}
{code}

Currently Pig executes this by using POSort on the reduce side, which collects 
all of the records out of the bag produced by POPackage into
a SortedBag.  If this bag is large, it will spill both as part of POPackage 
collecting it and as part of POSort sorting it.

None of this is necessary however.  Hadoop allows users to specify a sort order 
for data going to the reducer in addition to a partition
key.  This can be done by defining the Comparator for the job to compare all 
the fields you want sorted, and the Partitioner to only look
at the field you want to partition on.  So in this case the partitioner would 
be set to look at $0, and the comparator at $0, and $1.

Beyond avoiding unnecessary sorts and spills, this will also allow us to use 
the proposed Accumulator interface (see PIG-979) for these types
of scripts.


> Optimizing nested order bys
> ---
>
> Key: PIG-980
> URL: https://issues.apache.org/jira/browse/PIG-980
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Ying He
>
> Pig needs to take advantage of secondary sort in Hadoop to optimize nested 
> order bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-09-25 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759813#action_12759813
 ] 

David Ciemiewicz commented on PIG-979:
--

This JIRA doesn't quite get the gist of why I believe the Accumulator interface 
is of interest.  It isn't just about performance and avoiding retreading over 
the same data over and over again.

It is also about providing an interface to support CUMMULATIVE_SUM, RANK, and 
other functions of it's ilk.

A better code example for justifying this would be:

{code}
A = load 'data' using PigStorage() as ( query: chararray, int: count );
B = order A by count desc parallel 1;
C = foreach B generate
query,
count,
CUMULATIVE_SUM(count) as cumulative_count,
RANK(count) as rank;
{code}

These functions RANK and CUMULATIVE_SUM would have persistent state and yet 
would emit a value per value or tuple passed.  Bags would not be appropriate as 
coded.

Additionally, the reason for the Accumulator inteface is to avoid multiple 
passes over the same data:

For instance, consider the example:

{code}
A = load 'data' using PigStorage() as ( query: chararray, int: count );
B = group A all;
C = foreach B generate
group,
SUM(A.count),
AVG(A.count),
VAR(A.count),
STDEV(A.count),
MIN(A.count),
MAX(A.count),
MEDIAN(A.count);
{code}

Repeatedly shuffling the same values just isn't an optimal way to process data.



> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-980) Optimizing nested order bys

2009-09-25 Thread Alan Gates (JIRA)
Optimizing nested order bys
---

 Key: PIG-980
 URL: https://issues.apache.org/jira/browse/PIG-980
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Ying He


Pig needs to take advantage of secondary sort in Hadoop to optimize nested 
order bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-09-25 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759804#action_12759804
 ] 

Alan Gates commented on PIG-979:


Consider a Pig script like the following:

{code}
A = load 'bla';
B = group A by $0;
C = foreach B {
D = order A by $1;
generate CUMMULATIVE_SUM(D);
}
{code}

Because the UDF needs to see this data in an ordered fashion, it cannot be done 
using Pig's Algebraic interface.  But it
does not need to see all the contents of the bag together.

One way to address this is to add an Accumulator interface that UDFs could 
implement.

{code}
interface Accumulator {

/**
 * Pass tuples to the UDF.  The passed in bag will contain only records 
from one
 * key.  It may not contain all the records for one key.  This function will
 * be called repeatedly until all records from one key are provided
 * to the UDF.
 * @param 1 or more tuples, all sharing the same key.
 */
void accumulate(Bag b);

/**
 * Called when all records from a key have been passed to accumulate.
 * @return the value for the UDF for this key.
 */
T getValue();
}
{code}

In cases where all UDFs in a given foreach implement this accumulate interface, 
then Pig could choose to use this method to
push records to the UDFs.  Then it would not need to read all records from the 
Reduce iterator and cache them in memory or
on disk.

Before we commit to adding this new level of complexity to the langauge, we 
should performance test it.  Given that we have
recently made a change aimed at addressing Pig's problem of dying during large 
non-algebraic group bys (see PIG-975), this
needs to perform significantly better than that to justify adding it.


> Acummulator Interface for UDFs
> --
>
> Key: PIG-979
> URL: https://issues.apache.org/jira/browse/PIG-979
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Ying He
>
> Add an accumulator interface for UDFs that would allow them to take a set 
> number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-979) Acummulator Interface for UDFs

2009-09-25 Thread Alan Gates (JIRA)
Acummulator Interface for UDFs
--

 Key: PIG-979
 URL: https://issues.apache.org/jira/browse/PIG-979
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Ying He


Add an accumulator interface for UDFs that would allow them to take a set 
number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-978) ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) and ERROR 2999: (Unexpected internal error. null) when using Multi-Query optimization

2009-09-25 Thread Viraj Bhat (JIRA)
ERROR 2100 (hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist) 
and ERROR 2999: (Unexpected internal error. null) when using Multi-Query 
optimization
---

 Key: PIG-978
 URL: https://issues.apache.org/jira/browse/PIG-978
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.6.0


I have  Pig script of this form.. which I execute using Multi-query 
optimization.

{code}
A = load '/user/viraj/firstinput' using PigStorage();
B = group 
C = ..agrregation function
store C into '/user/viraj/firstinputtempresult/days1';
..
Atab = load '/user/viraj/secondinput' using PigStorage();
Btab = group 
Ctab = ..agrregation function
store Ctab into '/user/viraj/secondinputtempresult/days1';
..
E = load '/user/viraj/firstinputtempresult/' using PigStorage();
F = group 
G = aggregation function
store G into '/user/viraj/finalresult1';

Etab = load '/user/viraj/secondinputtempresult/' using PigStorage();
Ftab = group 
Gtab = aggregation function
store Gtab into '/user/viraj/finalresult2';
{code}


2009-07-20 22:05:44,507 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
ERROR 2100: hdfs://localhost/tmp/temp175740929/tmp-1126214010 does not exist. 
Details at logfile: /homes/viraj/pigscripts/pig_1248127173601.log)  

is due to the mismatch of store/load commands. The script first stores files 
into the 'days1' directory (store C into 
'/user/viraj/firstinputtempresult/days1' using PigStorage();), but it later 
loads from the top level directory (E = load 
'/user/viraj/firstinputtempresult/' using PigStorage()) instead of the original 
directory (/user/viraj/firstinputtempresult/days1).

The current multi-query optimizer can't solve the dependency between these two 
commands--they have different load file paths. So the jobs will run 
concurrently and result in the errors.

The solution is to add 'exec' or 'run' command after the first two stores . 
This will force the first two store commands to run before the rest commands.

It would be nice to see this fixed as a part of an enhancement to the 
Multi-query. We either disable the Multi-query or throw a warning/error 
message, so that the user can correct his load/store statements.

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour

2009-09-25 Thread Raghu Angadi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759789#action_12759789
 ] 

Raghu Angadi commented on PIG-949:
--

I just committed this. Thanks Yan for the fix and Jing for the test!

> Zebra Bug: splitting map into multiple column group using storage hint causes 
> unexpected behaviour
> --
>
> Key: PIG-949
> URL: https://issues.apache.org/jira/browse/PIG-949
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
> Environment: linux
>Reporter: Alok Singh
>Assignee: Yan Zhou
> Fix For: 0.5.0
>
> Attachments: Pig_949.patch, Pig_949.patch, Pig_949.patch
>
>
> Hi 
>  The storage hint
> specification plays a important part whether the output table is readable or 
> not
> say if we have have the map 'map'.
> One can split the map into a column group using [map#{k1}, map#{k2}...] 
> however the remaining map field will automatically be added to the default 
> group.
> if user try to create a new column group for the remaining fields as follows
> [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group
> the table writer will create the table.
> however, if one tries to load the created table via pig or via map reduce 
> using TableInputFormat
>  
> then the reader  have problem reading the map
> We get the following stack trace
> 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : 
> attempt_200908191538_33939_m_21_2, Status : FAILED
> java.io.IOException: getValue() failed: null
> at 
> org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775)
> at 
> org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717)
> at 
> org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Alok

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour

2009-09-25 Thread Raghu Angadi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghu Angadi updated PIG-949:
-

   Resolution: Fixed
Fix Version/s: (was: 0.4.0)
   Status: Resolved  (was: Patch Available)

> Zebra Bug: splitting map into multiple column group using storage hint causes 
> unexpected behaviour
> --
>
> Key: PIG-949
> URL: https://issues.apache.org/jira/browse/PIG-949
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
> Environment: linux
>Reporter: Alok Singh
>Assignee: Yan Zhou
> Fix For: 0.5.0
>
> Attachments: Pig_949.patch, Pig_949.patch, Pig_949.patch
>
>
> Hi 
>  The storage hint
> specification plays a important part whether the output table is readable or 
> not
> say if we have have the map 'map'.
> One can split the map into a column group using [map#{k1}, map#{k2}...] 
> however the remaining map field will automatically be added to the default 
> group.
> if user try to create a new column group for the remaining fields as follows
> [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group
> the table writer will create the table.
> however, if one tries to load the created table via pig or via map reduce 
> using TableInputFormat
>  
> then the reader  have problem reading the map
> We get the following stack trace
> 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : 
> attempt_200908191538_33939_m_21_2, Status : FAILED
> java.io.IOException: getValue() failed: null
> at 
> org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775)
> at 
> org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717)
> at 
> org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Alok

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-949) Zebra Bug: splitting map into multiple column group using storage hint causes unexpected behaviour

2009-09-25 Thread Chao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Wang updated PIG-949:
--


Already viewed the patch +1

> Zebra Bug: splitting map into multiple column group using storage hint causes 
> unexpected behaviour
> --
>
> Key: PIG-949
> URL: https://issues.apache.org/jira/browse/PIG-949
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
> Environment: linux
>Reporter: Alok Singh
>Assignee: Yan Zhou
> Fix For: 0.4.0, 0.5.0
>
> Attachments: Pig_949.patch, Pig_949.patch, Pig_949.patch
>
>
> Hi 
>  The storage hint
> specification plays a important part whether the output table is readable or 
> not
> say if we have have the map 'map'.
> One can split the map into a column group using [map#{k1}, map#{k2}...] 
> however the remaining map field will automatically be added to the default 
> group.
> if user try to create a new column group for the remaining fields as follows
> [map#{k1}, map#{k2}, ..][map] i.e create a seperate column group
> the table writer will create the table.
> however, if one tries to load the created table via pig or via map reduce 
> using TableInputFormat
>  
> then the reader  have problem reading the map
> We get the following stack trace
> 09/09/09 00:09:45 INFO mapred.JobClient: Task Id : 
> attempt_200908191538_33939_m_21_2, Status : FAILED
> java.io.IOException: getValue() failed: null
> at 
> org.apache.hadoop.zebra.io.BasicTable$Reader$BTScanner.getValue(BasicTable.java:775)
> at 
> org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:717)
> at 
> org.apache.hadoop.zebra.mapred.TableRecordReader.next(TableInputFormat.java:651)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Alok

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-975:


Attachment: PIG-975.patch4

Add switch to old bag.  Setting property pig.cachedbag.type=default  would 
switch to old default bag. If not specified, use InternalCachedBag.l

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: internalbag.xls, PIG-975.patch, PIG-975.patch2, 
> PIG-975.patch3, PIG-975.patch4
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-958) Splitting output data on key field

2009-09-25 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759742#action_12759742
 ] 

Pradeep Kamath commented on PIG-958:


The release audit warning I think is related to  missing Apache header comment 
- can you add Apache header comment by pasting it from some other source file 
in svn - every file needs to have the apache header as a comment at the 
beginning of the file - you will need to add it to the beginning of source and 
test file. Also if you agree with any of the review comments you can 
incorporate those changes when you submit the next version of the patch.

> Splitting output data on key field
> --
>
> Key: PIG-958
> URL: https://issues.apache.org/jira/browse/PIG-958
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Ankur
> Attachments: 958.v2.patch
>
>
> Pig users often face the need to split the output records into a bunch of 
> files and directories depending on the type of record. Pig's SPLIT operator 
> is useful when record types are few and known in advance. In cases where type 
> is not directly known but is derived dynamically from values of a key field 
> in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-977) exit status does not account for JOB_STATUS.TERMINATED

2009-09-25 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759739#action_12759739
 ] 

Pradeep Kamath commented on PIG-977:


It does look like we only us COMPLETED and FAILED - +1 to remove other unused 
states - we can add them when the need arises.

> exit status does not account for JOB_STATUS.TERMINATED
> --
>
> Key: PIG-977
> URL: https://issues.apache.org/jira/browse/PIG-977
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>
> For determining the exit status of pig query, only JOB_STATUS.FAILED is being 
> used and status TERMINATED is ignored.
> I think the reason for this is that in  ExecJob.JOB_STATUS only FAILED and 
> COMPLETED are being used anywhere. Rest are unused. I think we should comment 
> out the unused parts for now to indicate that, or fix the code  for 
> determining success/failure in GruntParser. executeBatch 
> {code}
> public enum JOB_STATUS {
> QUEUED,
> RUNNING,
> SUSPENDED,
> TERMINATED,
> FAILED,
> COMPLETED,
> }
> {code}
> {code}
> private void executeBatch() throws IOException {
> if (mPigServer.isBatchOn()) {
> if (mExplain != null) {
> explainCurrentBatch();
> }
> if (!mLoadOnly) {
> List jobs = mPigServer.executeBatch();
> for(ExecJob job: jobs) {
> == >  if (job.getStatus() == ExecJob.JOB_STATUS.FAILED) {
> mNumFailedJobs++;
> if (job.getException() != null) {
> LogUtils.writeLog(
>   job.getException(), 
>   
> mPigServer.getPigContext().getProperties().getProperty("pig.logfile"), 
>   log, 
>   
> "true".equalsIgnoreCase(mPigServer.getPigContext().getProperties().getProperty("verbose")),
>   "Pig Stack Trace");
> }
> }
> else {
> mNumSucceededJobs++;
> }
> }
> }
> }
> }
> {code}
> Any opinions ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-09-25 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-970:
---

Attachment: TEST-org.apache.pig.test.TestHBaseStorage.txt
pig-hbase-20-v2.patch

The issue was the missing Zookeeper lib.  I added that, and now I get what 
looks like a real hbase error.  I have no idea what it means, so I'll let you 
take a look.  I've attached both a new patch (with the changes to build.xml to 
pick up the right libs) and the error log from the test run.

> Support of HBase 0.20.0
> ---
>
> Key: PIG-970
> URL: https://issues.apache.org/jira/browse/PIG-970
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Vincent BARAT
> Attachments: build.xml.path, pig-hbase-0.20.0-support.patch, 
> pig-hbase-20-v2.patch, TEST-org.apache.pig.test.TestHBaseStorage.txt
>
>
> The support of HBase is currently very limited and restricted to HBase 0.18.0.
> Because the next releases of PIG will support Hadoop 0.20.0, they should also 
> support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-958) Splitting output data on key field

2009-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759733#action_12759733
 ] 

Hadoop QA commented on PIG-958:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12420264/958.v2.patch
  against trunk revision 818929.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 281 release audit warnings 
(more than the trunk's current 279 warnings).

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/46/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/46/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/46/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/46/console

This message is automatically generated.

> Splitting output data on key field
> --
>
> Key: PIG-958
> URL: https://issues.apache.org/jira/browse/PIG-958
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Ankur
> Attachments: 958.v2.patch
>
>
> Pig users often face the need to split the output records into a bunch of 
> files and directories depending on the type of record. Pig's SPLIT operator 
> is useful when record types are few and known in advance. In cases where type 
> is not directly known but is derived dynamically from values of a key field 
> in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-970) Support of HBase 0.20.0

2009-09-25 Thread Vincent BARAT (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent BARAT updated PIG-970:
--

Attachment: build.xml.path

To show you better what I did on the jar files side, here is the patch I made 
on the build.xml file.

> Support of HBase 0.20.0
> ---
>
> Key: PIG-970
> URL: https://issues.apache.org/jira/browse/PIG-970
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Vincent BARAT
> Attachments: build.xml.path, pig-hbase-0.20.0-support.patch
>
>
> The support of HBase is currently very limited and restricted to HBase 0.18.0.
> Because the next releases of PIG will support Hadoop 0.20.0, they should also 
> support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-970) Support of HBase 0.20.0

2009-09-25 Thread Vincent BARAT (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759727#action_12759727
 ] 

Vincent BARAT commented on PIG-970:
---

Yes, but I was unable to make the TestHBaseStorage work. I guess it was just a 
matter of environement, since the errors were related to a classes not found.
I didn't waste too much time on that actually...
I will try again.

> Support of HBase 0.20.0
> ---
>
> Key: PIG-970
> URL: https://issues.apache.org/jira/browse/PIG-970
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Vincent BARAT
> Attachments: pig-hbase-0.20.0-support.patch
>
>
> The support of HBase is currently very limited and restricted to HBase 0.18.0.
> Because the next releases of PIG will support Hadoop 0.20.0, they should also 
> support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-977) exit status does not account for JOB_STATUS.TERMINATED

2009-09-25 Thread Thejas M Nair (JIRA)
exit status does not account for JOB_STATUS.TERMINATED
--

 Key: PIG-977
 URL: https://issues.apache.org/jira/browse/PIG-977
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair


For determining the exit status of pig query, only JOB_STATUS.FAILED is being 
used and status TERMINATED is ignored.
I think the reason for this is that in  ExecJob.JOB_STATUS only FAILED and 
COMPLETED are being used anywhere. Rest are unused. I think we should comment 
out the unused parts for now to indicate that, or fix the code  for determining 
success/failure in GruntParser. executeBatch 

{code}
public enum JOB_STATUS {
QUEUED,
RUNNING,
SUSPENDED,
TERMINATED,
FAILED,
COMPLETED,
}
{code}
{code}
private void executeBatch() throws IOException {
if (mPigServer.isBatchOn()) {
if (mExplain != null) {
explainCurrentBatch();
}

if (!mLoadOnly) {
List jobs = mPigServer.executeBatch();
for(ExecJob job: jobs) {
== >  if (job.getStatus() == ExecJob.JOB_STATUS.FAILED) {
mNumFailedJobs++;
if (job.getException() != null) {
LogUtils.writeLog(
  job.getException(), 
  
mPigServer.getPigContext().getProperties().getProperty("pig.logfile"), 
  log, 
  
"true".equalsIgnoreCase(mPigServer.getPigContext().getProperties().getProperty("verbose")),
  "Pig Stack Trace");
}
}
else {
mNumSucceededJobs++;
}
}
}
}
}

{code}

Any opinions ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [VOTE] Release Pig 0.4.0 (candidate 2)

2009-09-25 Thread Olga Natkovich
With 3 +1s from Hadoop PMC (Alan Gates, Raghu Angadi, and Olga
Natkovich) and no -1s, the release passed the vote. I will be working on
rolling it out next.

Olga 

-Original Message-
From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
Sent: Tuesday, September 22, 2009 4:12 PM
To: priv...@hadoop.apache.org
Cc: pig-dev@hadoop.apache.org
Subject: Re: [VOTE] Release Pig 0.4.0 (candidate 2)


+1. ran 'ant test-core'.

contrib/zebra: 'ant test' passed after following directions as suggested

: got a patch from PIG-660, and hadoop20.jar from PIG-833. For clarity 
we might attach patch suitable for PIG-660 for 0.4.

Raghu.

Olga Natkovich wrote:
> Hi,
> 
> The new version is available in
> http://people.apache.org/~olga/pig-0.4.0-candidate-2/.
> 
> I see one failure in a unit test in piggybank (contrib.) but it is not
> related to the functions themselves but seems to be an issue with
> MiniCluster and I don't feel we need to chase this down. I made sure
> that the same test runs ok with Hadoop 20.
> 
> Please, vote by end of day on Thursday, 9/24.
> 
> Olga
> 
> -Original Message-
> From: Olga Natkovich [mailto:ol...@yahoo-inc.com] 
> Sent: Thursday, September 17, 2009 12:09 PM
> To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org
> Subject: [VOTE] Release Pig 0.4.0 (candidate 1)
> 
> Hi,
> 
> I have fixed the issue causing the failure that Alan reported.
> 
> Please test the new release:
> http://people.apache.org/~olga/pig-0.4.0-candidate-1/.
> 
> Vote closes on Tuesday, 9/22.
> 
> Olga
> 
> 
> -Original Message-
> From: Olga Natkovich [mailto:ol...@yahoo-inc.com] 
> Sent: Monday, September 14, 2009 2:06 PM
> To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org
> Subject: [VOTE] Release Pig 0.4.0 (candidate 0)
> 
> Hi,
> 
>  
> 
> I created a candidate build for Pig 0.4.0 release. The highlights of
> this release are
> 
>  
> 
> -  Performance improvements especially in the area of JOIN
> support where we introduced two new join types: skew join to deal with
> data skew and sort merge join to take advantage of the sorted data
sets.
> 
> -  Support for Outer join.
> 
> -  Works with Hadoop 18
> 
>  
> 
> I ran the release audit and rat report looked fine. The relevant part
is
> attached below.
> 
>  
> 
> Keys used to sign the release are available at
> http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup.
> 
>  
> 
> Please download the release and try it out:
> http://people.apache.org/~olga/pig-0.4.0-candidate-0.
> 
>  
> 
> Should we release this? Vote closes on Thursday, 9/17.
> 
>  
> 
> Olga
> 
>  
> 
>  
> 
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/CHANGES.txt
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/CHANG
> ES.txt
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken-links.x
> ml
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/cookbook.html
>  [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html
>  [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/linkmap.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/piglatin_refer
> ence.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/piglatin_users
> .html
>  [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/tutorial.html
>  [java]  !?
> /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/package-li
> st
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes.
> html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/missingS
> inces.txt
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/user_com
> ments_for_pig_0.3.1_to_pig_0.5.0-dev.xml
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
> alldiffs_index_additions.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
> alldiffs_index_all.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
> alldiffs_index_changes.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
> alldiffs_index_removals.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
> changes-summary.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/changes/
> classes_index_additions.html
>  [java]  !?
>
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jd

[jira] Commented: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759689#action_12759689
 ] 

Olga Natkovich commented on PIG-975:


Ying, what Pradeep is asking for is more like a safety switch - to give users a 
way to go back to the old implementation if they run into problem with new. 
Once we verify that the new code is as stable as the old, we would remove the 
switch. We would also not expose it to users unless they do run into trouble.

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: internalbag.xls, PIG-975.patch, PIG-975.patch2, 
> PIG-975.patch3
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759681#action_12759681
 ] 

Ying He commented on PIG-975:
-

I think this is too implementation specific to expose to end user. Frankly, I 
don't think user cares which class we use for the data bags. 

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: internalbag.xls, PIG-975.patch, PIG-975.patch2, 
> PIG-975.patch3
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-942) Maps are not implicitly casted

2009-09-25 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-942:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Unit test was present in the original patch. 

Patch committed to trunk.

> Maps are not implicitly casted
> --
>
> Key: PIG-942
> URL: https://issues.apache.org/jira/browse/PIG-942
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-942-2.patch, PIG-942.patch
>
>
> A = load 'foo' as (m) throws the following exception when foo has maps.
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
> cast to java.util.Map
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-958) Splitting output data on key field

2009-09-25 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-958:
---

Status: Open  (was: Patch Available)

> Splitting output data on key field
> --
>
> Key: PIG-958
> URL: https://issues.apache.org/jira/browse/PIG-958
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Ankur
> Attachments: 958.v2.patch
>
>
> Pig users often face the need to split the output records into a bunch of 
> files and directories depending on the type of record. Pig's SPLIT operator 
> is useful when record types are few and known in advance. In cases where type 
> is not directly known but is derived dynamically from values of a key field 
> in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-958) Splitting output data on key field

2009-09-25 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-958:
---

Status: Patch Available  (was: Open)

> Splitting output data on key field
> --
>
> Key: PIG-958
> URL: https://issues.apache.org/jira/browse/PIG-958
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Ankur
> Attachments: 958.v2.patch
>
>
> Pig users often face the need to split the output records into a bunch of 
> files and directories depending on the type of record. Pig's SPLIT operator 
> is useful when record types are few and known in advance. In cases where type 
> is not directly known but is derived dynamically from values of a key field 
> in the output tuple, a custom store function is a better solution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759645#action_12759645
 ] 

Pradeep Kamath commented on PIG-975:


I think it might be a good idea to have a config parameter (maybe a java -D 
property) which can allow users to choose between spillableBagForReduce and 
NonSpillableBagForReduce with the Non spillable one being the default. This way 
if for some reason users find the spillablebag better for their query they can 
use it.

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: internalbag.xls, PIG-975.patch, PIG-975.patch2, 
> PIG-975.patch3
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-970) Support of HBase 0.20.0

2009-09-25 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759626#action_12759626
 ] 

Alan Gates commented on PIG-970:


In addition to adding hbase-0.20.0.jar to the lib directory did you add 
hbase-0.20.0-test?  

> Support of HBase 0.20.0
> ---
>
> Key: PIG-970
> URL: https://issues.apache.org/jira/browse/PIG-970
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Vincent BARAT
> Attachments: pig-hbase-0.20.0-support.patch
>
>
> The support of HBase is currently very limited and restricted to HBase 0.18.0.
> Because the next releases of PIG will support Hadoop 0.20.0, they should also 
> support HBase 0.20.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-975:


Attachment: internalbag.xls

performance numbers 

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: internalbag.xls, PIG-975.patch, PIG-975.patch2, 
> PIG-975.patch3
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-975) Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

2009-09-25 Thread Ying He (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ying He updated PIG-975:


Attachment: PIG-975.patch3

remove synchronization

> Need a databag that does not register with SpillableMemoryManager and spill 
> data pro-actively
> -
>
> Key: PIG-975
> URL: https://issues.apache.org/jira/browse/PIG-975
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Ying He
>Assignee: Ying He
> Fix For: 0.2.0
>
> Attachments: PIG-975.patch, PIG-975.patch2, PIG-975.patch3
>
>
> POPackage uses DefaultDataBag during reduce process to hold data. It is 
> registered with SpillableMemoryManager and prone to OutOfMemoryException.  
> It's better to pro-actively managers the usage of the memory. The bag fills 
> in memory to a specified amount, and dump the rest the disk.  The amount of 
> memory to hold tuples is configurable. This can avoid out of memory error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-942) Maps are not implicitly casted

2009-09-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12759509#action_12759509
 ] 

Hadoop QA commented on PIG-942:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12420393/PIG-942-2.patch
  against trunk revision 818175.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/45/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/45/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/45/console

This message is automatically generated.

> Maps are not implicitly casted
> --
>
> Key: PIG-942
> URL: https://issues.apache.org/jira/browse/PIG-942
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-942-2.patch, PIG-942.patch
>
>
> A = load 'foo' as (m) throws the following exception when foo has maps.
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
> cast to java.util.Map
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-976) Multi-query optimization throws ClassCastException

2009-09-25 Thread Ankur (JIRA)
Multi-query optimization throws ClassCastException
--

 Key: PIG-976
 URL: https://issues.apache.org/jira/browse/PIG-976
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Ankur


Multi-query optimization fails to merge 2 branches when 1 is a result of Group 
By ALL and another is a result of Group By field1 where field 1 is of type 
long. Here is the script that fails with multi-query on.

data = LOAD 'test' USING PigStorage('\t') AS (a:long, b:double, c:double); 
A = GROUP data ALL;
B = FOREACH A GENERATE SUM(data.b) AS sum1, SUM(data.c) AS sum2;
C = FOREACH B GENERATE (sum1/sum2) AS rate; 
STORE C INTO 'result1';

D = GROUP data BY a; 
E = FOREACH D GENERATE group AS a, SUM(data.b), SUM(data.c);
STORE E into 'result2';
 
Here is the exception from the logs

java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast 
to org.apache.pig.data.DataBag
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:399)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:180)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:145)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:197)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.runPipeline(PODemux.java:264)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux.getNext(PODemux.java:254)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:196)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:174)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:63)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:906)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:786)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-942) Maps are not implicitly casted

2009-09-25 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-942:
---

Status: Patch Available  (was: Open)

> Maps are not implicitly casted
> --
>
> Key: PIG-942
> URL: https://issues.apache.org/jira/browse/PIG-942
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-942-2.patch, PIG-942.patch
>
>
> A = load 'foo' as (m) throws the following exception when foo has maps.
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
> cast to java.util.Map
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-942) Maps are not implicitly casted

2009-09-25 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-942:
---

Status: Open  (was: Patch Available)

> Maps are not implicitly casted
> --
>
> Key: PIG-942
> URL: https://issues.apache.org/jira/browse/PIG-942
> Project: Pig
>  Issue Type: Bug
>Reporter: Sriranjan Manjunath
>Assignee: Pradeep Kamath
> Fix For: 0.6.0
>
> Attachments: PIG-942-2.patch, PIG-942.patch
>
>
> A = load 'foo' as (m) throws the following exception when foo has maps.
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
> cast to java.util.Map
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.