[jira] Updated: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-05 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1537:


 Assignee: Daniel Dai
Fix Version/s: 0.8.0

Daniel, can we test if this is a problem with 0.8

Viraj, is this data specific and if so can you provide data tp reproduce. Also, 
do you know which one produces correct results.

> Column pruner causes wrong results when using both Custom Store UDF and 
> PigStorage
> --
>
> Key: PIG-1537
> URL: https://issues.apache.org/jira/browse/PIG-1537
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> I have script which is of this pattern and it uses 2 StoreFunc's:
> {code}
> register loader.jar
> register piggy-bank/java/build/storage.jar;
> %DEFAULT OUTPUTDIR /user/viraj/prunecol/
> ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
> ss_sc_filtered_0 = FILTER ss_sc_0 BY
> a#'id' matches '1.*' OR
> a#'id' matches '2.*' OR
> a#'id' matches '3.*' OR
> a#'id' matches '4.*';
> ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
> ss_sc_filtered_1 = FILTER ss_sc_1 BY
> a#'id' matches '65.*' OR
> a#'id' matches '466.*' OR
> a#'id' matches '043.*' OR
> a#'id' matches '044.*' OR
> a#'id' matches '0650.*' OR
> a#'id' matches '001.*';
> ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
> ss_sc_all_proj = FOREACH ss_sc_all GENERATE
> a#'query' as query,
> a#'testid' as testid,
> a#'timestamp' as timestamp,
> a,
> b,
> c;
> ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
> ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
> STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
> ss_sc_all_map_count = group ss_sc_all_map all;
> count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
> record_count,COUNT($1);
> STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
> {code}
> I run this script using:
> a) java -cp pig0.7.jar script.pig
> b) java -cp pig0.7.jar -t PruneColumns script.pig
> What I observe is that the alias "count" produces the same number of records 
> but "ss_sc_all_map" have different sizes when run with above 2 options.
> Is due to the fact that there are 2 store func's used?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1199) help includes obsolete options

2010-08-05 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1199:


Attachment: PIG-1199_2.patch

wording cleanup, thanks Corinne!

> help includes obsolete options
> --
>
> Key: PIG-1199
> URL: https://issues.apache.org/jira/browse/PIG-1199
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.8.0
>
> Attachments: PIG-1199.patch, PIG-1199_2.patch
>
>
> This is confusing to users

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1199) help includes obsolete options

2010-08-05 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1199:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

patch committed

> help includes obsolete options
> --
>
> Key: PIG-1199
> URL: https://issues.apache.org/jira/browse/PIG-1199
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.8.0
>
> Attachments: PIG-1199.patch, PIG-1199_2.patch
>
>
> This is confusing to users

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-346) Grunt (help) commands

2010-08-05 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-346:
---

Attachment: PIG-346_2.patch

Changes based on review from Corinne, thanks!

> Grunt (help) commands 
> --
>
> Key: PIG-346
> URL: https://issues.apache.org/jira/browse/PIG-346
> Project: Pig
>  Issue Type: Bug
>Reporter: Corinne Chandel
>Assignee: Olga Natkovich
> Fix For: 0.8.0
>
> Attachments: PIG-346.patch, PIG-346_2.patch
>
>
> I think there are 22 grunt commands  and 2 different lists of the 
> commands can be displayed.
> I. Grunt commands displayed with "grunt> help"
> (1) put 22 grunt commands in alphabetical order
> (2) fix double entry for cd ... cd  and cd   keep cd 
> (3) fix notation for set key value ... set  ''
> (4) add explain
> (5) add illustrate
> (6) add help
> II. Grunt commands display with "grunt> asdf" 
> The "asdf" is a mistake and generates msg "Was expecting one of:" and list of 
> grunt commands
> (1) put 22 grunt commands in alphabetical order
> (2) add define
> (3) add du
> 
> 22 Grunt commands in aphabetical order:
> cat 
> cd 
> copyFromLocal  
> copyToLocal  
> cp  
> define  
> describe 
> dump 
> du 
> explain
> help
> illustrate
> kill 
> ls 
> mkdir 
> mv  
> pwd
> quit
> register 
> rm 
> set  ''
> store  into  [using ]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-346) Grunt (help) commands

2010-08-05 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-346:
---

Status: Resolved  (was: Patch Available)
Resolution: Fixed

patch committed.

> Grunt (help) commands 
> --
>
> Key: PIG-346
> URL: https://issues.apache.org/jira/browse/PIG-346
> Project: Pig
>  Issue Type: Bug
>Reporter: Corinne Chandel
>Assignee: Olga Natkovich
> Fix For: 0.8.0
>
> Attachments: PIG-346.patch, PIG-346_2.patch
>
>
> I think there are 22 grunt commands  and 2 different lists of the 
> commands can be displayed.
> I. Grunt commands displayed with "grunt> help"
> (1) put 22 grunt commands in alphabetical order
> (2) fix double entry for cd ... cd  and cd   keep cd 
> (3) fix notation for set key value ... set  ''
> (4) add explain
> (5) add illustrate
> (6) add help
> II. Grunt commands display with "grunt> asdf" 
> The "asdf" is a mistake and generates msg "Was expecting one of:" and list of 
> grunt commands
> (1) put 22 grunt commands in alphabetical order
> (2) add define
> (3) add du
> 
> 22 Grunt commands in aphabetical order:
> cat 
> cd 
> copyFromLocal  
> copyToLocal  
> cp  
> define  
> describe 
> dump 
> du 
> explain
> help
> illustrate
> kill 
> ls 
> mkdir 
> mv  
> pwd
> quit
> register 
> rm 
> set  ''
> store  into  [using ]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-08-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895806#action_12895806
 ] 

Olga Natkovich commented on PIG-1334:
-

This will be supported with all releases of 0.8 and later. For 0.7, we need a 
volunteer to backport it to 0.7 branch

> Make pig artifacts available through maven
> --
>
> Key: PIG-1334
> URL: https://issues.apache.org/jira/browse/PIG-1334
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
> mvn_pig_4.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1334) Make pig artifacts available through maven

2010-08-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895808#action_12895808
 ] 

Olga Natkovich commented on PIG-1334:
-

Sounds great!

> Make pig artifacts available through maven
> --
>
> Key: PIG-1334
> URL: https://issues.apache.org/jira/browse/PIG-1334
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, 
> mvn_pig_4.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-565) Several builting functions no longer support bytearray

2010-08-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895826#action_12895826
 ] 

Olga Natkovich commented on PIG-565:


ARITY has been depricated for a while and the code looks completely wrong so I 
am not gointg to fix that. SIZE that replaced it does the right thing.


> Several builting functions no longer support bytearray
> --
>
> Key: PIG-565
> URL: https://issues.apache.org/jira/browse/PIG-565
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.8.0
>
>
> ARITY
> DIFF
> TOKENIZE
> All we need to do is to add lookup tables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-565) Several builting functions no longer support bytearray

2010-08-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895835#action_12895835
 ] 

Olga Natkovich commented on PIG-565:


DIFF handles all the types as expected. However, the documentation in 0.7.0 
release is slightly of. 

Current docs: "The DIFF function compares two fields in a tuple. If the field 
values match, null is returned. If the field values do not match, the 
non-matching elements are returned."

Should say something like: 

"DIFF takes two bags as arguments and compares them.   Any tuples that are in 
one bag but not the other are returned in a bag. If the bags match an empty bag 
is returned.  If the fields are not bags then they will be wrapped in tuples 
and returned in a bag if they do not match, or an empty bag will be returned if 
the two records match. The implementation assumes that both bags being passed 
to this function will fit entirely into memory simultaneously.  If that is not 
the case the UDF will still function, but it will be very 
slow."

I will reassign this bug to Corinne once I am done with it.




> Several builting functions no longer support bytearray
> --
>
> Key: PIG-565
> URL: https://issues.apache.org/jira/browse/PIG-565
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.8.0
>
>
> ARITY
> DIFF
> TOKENIZE
> All we need to do is to add lookup tables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-565) Several builting functions no longer support bytearray

2010-08-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895843#action_12895843
 ] 

Olga Natkovich commented on PIG-565:


I also verified that TOKENIZE works as expected with bytearrays.

> Several builting functions no longer support bytearray
> --
>
> Key: PIG-565
> URL: https://issues.apache.org/jira/browse/PIG-565
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Olga Natkovich
> Fix For: 0.8.0
>
>
> ARITY
> DIFF
> TOKENIZE
> All we need to do is to add lookup tables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-565) Several builting functions no longer support bytearray

2010-08-05 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-565:
--

Assignee: Corinne Chandel  (was: Olga Natkovich)

Corinne, please, update DIFF description, thanks

> Several builting functions no longer support bytearray
> --
>
> Key: PIG-565
> URL: https://issues.apache.org/jira/browse/PIG-565
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Corinne Chandel
> Fix For: 0.8.0
>
>
> ARITY
> DIFF
> TOKENIZE
> All we need to do is to add lookup tables.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1538) isTwoLevelAccessRequired() returns false for nested relation

2010-08-06 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896100#action_12896100
 ] 

Olga Natkovich commented on PIG-1538:
-

This ticket has insufficient information. We need to understand the use case to 
decide how to solve user problem.

> isTwoLevelAccessRequired() returns false for nested relation
> 
>
> Key: PIG-1538
> URL: https://issues.apache.org/jira/browse/PIG-1538
> Project: Pig
>  Issue Type: Wish
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Justin Hu
>Priority: Minor
>
> Some user depends isTwoLevelAccessRequired() method in his UDF, and wishes 
> the method returns TRUE for nested schema (for example, the relation with 
> nested tuple).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-08-16 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1543:


Fix Version/s: 0.8.0

> IsEmpty returns the wrong value after using LIMIT
> -
>
> Key: PIG-1543
> URL: https://issues.apache.org/jira/browse/PIG-1543
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Hu
> Fix For: 0.8.0
>
>
> 1. Two input files:
> 1a: limit_empty.input_a
> 1
> 1
> 1
> 1b: limit_empty.input_b
> 2
> 2
> 2.
> The pig script: limit_empty.pig
> -- A contains only 1's & B contains only 2's
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_a' as (b1:int);
> C =COGROUP A by a1, B by b1;
> D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
> COUNT(B);
> store D into 'limit_empty.output/d';
> -- After the script done, we see the right results:
> -- {(1),(1),(1)}   {}  1   0   3   0
> -- {} {(2),(2)}  0   1   0   2
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> store D1 into 'limit_empty.output/d1';
> -- After the script done, we see the unexpected results:
> -- {(1)}   {}1   1   1   0
> -- {}  {(2)} 1   1   0   1
> dump D;
> dump D1;
> 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
> The major one:
> IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
> IsEmpty() returns correctly in limit_empty.output/d/*.
> The difference is that one has been applied with "LIMIT" before using 
> IsEmpty().
> The minor one:
> The redirected output only contains the first dump:
> ({(1),(1),(1)},{},1,0,3L,0L)
> ({},{(2),(2)},0,1,0L,2L)
> We expect two more lines like:
> ({(1)},{},1,1,1L,0L)
> ({},{(2)},1,1,0L,1L)
> Besides, there is error says:
> [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

2010-08-16 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899162#action_12899162
 ] 

Olga Natkovich commented on PIG-1544:
-

One way to do this is to only use InternalCacheBags for the bags that we are 
aware off upfront. Then we can have a visitor on the plan that counts the 
number of bags needed and divides memory accordingly.

> proactive-spill bags should share the memory alloted for it
> ---
>
> Key: PIG-1544
> URL: https://issues.apache.org/jira/browse/PIG-1544
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group 
> (InternalCacheBag) and they knew the total number of proactive bags that were 
> present, and shared the memory limit specified using the property 
> pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - 
> InternalDistinctBag and InternalSortedBag are not aware of actual number of 
> bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the 
> memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

2010-08-16 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899227#action_12899227
 ] 

Olga Natkovich commented on PIG-1544:
-

We should not be using these bags for the cases like UDF for exactly the reason 
you are mentioning

> proactive-spill bags should share the memory alloted for it
> ---
>
> Key: PIG-1544
> URL: https://issues.apache.org/jira/browse/PIG-1544
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group 
> (InternalCacheBag) and they knew the total number of proactive bags that were 
> present, and shared the memory limit specified using the property 
> pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - 
> InternalDistinctBag and InternalSortedBag are not aware of actual number of 
> bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the 
> memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1466) Improve log messages for memory usage

2010-08-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899489#action_12899489
 ] 

Olga Natkovich commented on PIG-1466:
-

Thejas, your proposal looks good

> Improve log messages for memory usage
> -
>
> Key: PIG-1466
> URL: https://issues.apache.org/jira/browse/PIG-1466
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Thejas M Nair
>Priority: Minor
> Fix For: 0.8.0
>
>
> For anything more then a moderately sized dataset Pig usually spits following 
> messages:
> {code}
> 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
> low memory handler called (Usage
> threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed 
> = 954466304(932096K) max =
> 954466304(932096K)
> 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
> low memory handler called (Collection
> threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed 
> = 954466304(932096K) max =
> 954466304(932096K)
> {code}
> This seems to confuse users a lot. Once these messages are printed, users 
> tend to believe that Pig is having hard time with memory, is spilling to disk 
> etc. but in fact Pig might be cruising along at ease. We should be little 
> more careful what to print in logs. Currently these are printed when a 
> notification is sent by JVM and some other conditions are met which may not 
> necessarily indicate low memory condition. Furthermore, with 
> {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these 
> messages have lost their usefulness. At the every least, we should lower the 
> log level at which these are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1544) proactive-spill bags should share the memory alloted for it

2010-08-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899529#action_12899529
 ] 

Olga Natkovich commented on PIG-1544:
-

So we should not use them in this case either. We should only use internal bags 
for things we no upfront

> proactive-spill bags should share the memory alloted for it
> ---
>
> Key: PIG-1544
> URL: https://issues.apache.org/jira/browse/PIG-1544
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>
> Initially proactive spill bags were designed for use in (co)group 
> (InternalCacheBag) and they knew the total number of proactive bags that were 
> present, and shared the memory limit specified using the property 
> pig.cachedbag.memusage .
> But the two proactive bag implementations were added later - 
> InternalDistinctBag and InternalSortedBag are not aware of actual number of 
> bags being used - their users always assume total-numbags = 3. 
> This needs to be fixed and all proactive-spill bags should share the 
> memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-08-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899537#action_12899537
 ] 

Olga Natkovich commented on PIG-1420:
-

I could not figure out how to re-open this issue. However, the code does not 
work in pig script. The main reason is that the code that selects which 
function to use does not deal yet with non-fixed number of arguments. 

grunt> A = load 'studentab10k' as (name: chararray, age: chararray, gpa: 
chararray);
grunt> B = foreach A generate CONCAT(name, age, gpa);
grunt> C = limit B 10;
grunt> dump C;
2010-08-17 12:17:41,635 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1045: Could not infer the matching function for org.apache.pig.builtin.CONCAT 
as multiple or none of them fit. Please use an explicit cast.
Details at logfile: /homes/olgan/pig_1282072550328.log
grunt>


> Make CONCAT act on all fields of a tuple, instead of just the first two 
> fields of a tuple
> -
>
> Key: PIG-1420
> URL: https://issues.apache.org/jira/browse/PIG-1420
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.8.0
>
> Attachments: addconcat2.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
> org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
> act on the first two fields of a tuple.  This results in ugly nested CONCAT 
> calls like:
> CONCAT(CONCAT(A, ' '), B)
> The more desirable form is:
> CONCAT(A, ' ', B)
> This change will be backwards compatible, provided that no one was relying on 
> the fact that CONCAT ignores fields after the first two in a tuple.  This 
> seems a reasonable assumption to make, or at least a small break in 
> compatibility for a sizable improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-08-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899580#action_12899580
 ] 

Olga Natkovich commented on PIG-1420:
-

This will make it work with bytearrays but not strings

> Make CONCAT act on all fields of a tuple, instead of just the first two 
> fields of a tuple
> -
>
> Key: PIG-1420
> URL: https://issues.apache.org/jira/browse/PIG-1420
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.8.0
>
> Attachments: addconcat2.patch, PIG-1420.2.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
> org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
> act on the first two fields of a tuple.  This results in ugly nested CONCAT 
> calls like:
> CONCAT(CONCAT(A, ' '), B)
> The more desirable form is:
> CONCAT(A, ' ', B)
> This change will be backwards compatible, provided that no one was relying on 
> the fact that CONCAT ignores fields after the first two in a tuple.  This 
> seems a reasonable assumption to make, or at least a small break in 
> compatibility for a sizable improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag

2010-08-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899581#action_12899581
 ] 

Olga Natkovich commented on PIG-1447:
-

Did you see any perf improvement?

> Tune memory usage of InternalCachedBag
> --
>
> Key: PIG-1447
> URL: https://issues.apache.org/jira/browse/PIG-1447
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: L15_modified.pig
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-08-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899592#action_12899592
 ] 

Olga Natkovich commented on PIG-1420:
-

The only way I know is to actually make the code deal with var number of 
arguments but I think it is too late for 0.8. Perhaps we can revisit this for 
0.9?

> Make CONCAT act on all fields of a tuple, instead of just the first two 
> fields of a tuple
> -
>
> Key: PIG-1420
> URL: https://issues.apache.org/jira/browse/PIG-1420
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.8.0
>
> Attachments: addconcat2.patch, PIG-1420.2.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
> org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
> act on the first two fields of a tuple.  This results in ugly nested CONCAT 
> calls like:
> CONCAT(CONCAT(A, ' '), B)
> The more desirable form is:
> CONCAT(A, ' ', B)
> This change will be backwards compatible, provided that no one was relying on 
> the fact that CONCAT ignores fields after the first two in a tuple.  This 
> seems a reasonable assumption to make, or at least a small break in 
> compatibility for a sizable improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-08-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899608#action_12899608
 ] 

Olga Natkovich commented on PIG-1420:
-

2 and 3 are backward incompatible with 0.7 and we really don't want to break 
compatibility in this release. So I would propose option 1 and proper fix in 0.9

> Make CONCAT act on all fields of a tuple, instead of just the first two 
> fields of a tuple
> -
>
> Key: PIG-1420
> URL: https://issues.apache.org/jira/browse/PIG-1420
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.8.0
>
> Attachments: addconcat2.patch, PIG-1420.2.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
> org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
> act on the first two fields of a tuple.  This results in ugly nested CONCAT 
> calls like:
> CONCAT(CONCAT(A, ' '), B)
> The more desirable form is:
> CONCAT(A, ' ', B)
> This change will be backwards compatible, provided that no one was relying on 
> the fact that CONCAT ignores fields after the first two in a tuple.  This 
> seems a reasonable assumption to make, or at least a small break in 
> compatibility for a sizable improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1524) 'Proactive spill count' is misleading

2010-08-18 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900089#action_12900089
 ] 

Olga Natkovich commented on PIG-1524:
-

I am reviewing this patch

> 'Proactive spill count' is misleading
> -
>
> Key: PIG-1524
> URL: https://issues.apache.org/jira/browse/PIG-1524
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1524.2.patch, PIG-1524.3.patch, PIG-1524.patch
>
>
> InternalCacheBag, InternalSortedBag, InternalDistinctBag increment this 
> counter for every record that it writes to disk, once it exceeds the memory 
> limit. This number is misleading.
> Instead, this counter should be increment it by 1 for each instance of these 
> bags that has spilled to disk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1524) 'Proactive spill count' is misleading

2010-08-18 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900107#action_12900107
 ] 

Olga Natkovich commented on PIG-1524:
-

+1 with a couple comment cleanups:

(1) Locking comment is misleading because we don't actually lock anything :)
(2) Comment regarding moving data from list to array for sorting needs to be 
also clarified.

Other than that, looks good. Please, commit

> 'Proactive spill count' is misleading
> -
>
> Key: PIG-1524
> URL: https://issues.apache.org/jira/browse/PIG-1524
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1524.2.patch, PIG-1524.3.patch, PIG-1524.patch
>
>
> InternalCacheBag, InternalSortedBag, InternalDistinctBag increment this 
> counter for every record that it writes to disk, once it exceeds the memory 
> limit. This number is misleading.
> Instead, this counter should be increment it by 1 for each instance of these 
> bags that has spilled to disk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1466) Improve log messages for memory usage

2010-08-19 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900379#action_12900379
 ] 

Olga Natkovich commented on PIG-1466:
-

I will review this patch today

> Improve log messages for memory usage
> -
>
> Key: PIG-1466
> URL: https://issues.apache.org/jira/browse/PIG-1466
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Thejas M Nair
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1466.patch
>
>
> For anything more then a moderately sized dataset Pig usually spits following 
> messages:
> {code}
> 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
> low memory handler called (Usage
> threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed 
> = 954466304(932096K) max =
> 954466304(932096K)
> 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
> low memory handler called (Collection
> threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed 
> = 954466304(932096K) max =
> 954466304(932096K)
> {code}
> This seems to confuse users a lot. Once these messages are printed, users 
> tend to believe that Pig is having hard time with memory, is spilling to disk 
> etc. but in fact Pig might be cruising along at ease. We should be little 
> more careful what to print in logs. Currently these are printed when a 
> notification is sent by JVM and some other conditions are met which may not 
> necessarily indicate low memory condition. Furthermore, with 
> {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these 
> messages have lost their usefulness. At the every least, we should lower the 
> log level at which these are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1466) Improve log messages for memory usage

2010-08-19 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900413#action_12900413
 ] 

Olga Natkovich commented on PIG-1466:
-

Patch looks good. Just one comment:

"memory handler call- Usage threshold exceeded "  and "memory handler call - 
Collection threshold exceeded " need to be made more neutral so that users do 
not think it is a problem. Also, I think we want to log this at info level so 
that we get it by default.

> Improve log messages for memory usage
> -
>
> Key: PIG-1466
> URL: https://issues.apache.org/jira/browse/PIG-1466
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Thejas M Nair
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1466.patch
>
>
> For anything more then a moderately sized dataset Pig usually spits following 
> messages:
> {code}
> 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
> low memory handler called (Usage
> threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed 
> = 954466304(932096K) max =
> 954466304(932096K)
> 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
> low memory handler called (Collection
> threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed 
> = 954466304(932096K) max =
> 954466304(932096K)
> {code}
> This seems to confuse users a lot. Once these messages are printed, users 
> tend to believe that Pig is having hard time with memory, is spilling to disk 
> etc. but in fact Pig might be cruising along at ease. We should be little 
> more careful what to print in logs. Currently these are printed when a 
> notification is sent by JVM and some other conditions are met which may not 
> necessarily indicate low memory condition. Furthermore, with 
> {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these 
> messages have lost their usefulness. At the every least, we should lower the 
> log level at which these are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-08-19 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1434:


Release Note: 
PIG-1434 adds functionality that allows to cast elements of a single-tuple 
relation into a scalar value. The primary use case for this is using values of 
global aggregates in the follow up computations. For instance,

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular 
user. Note that if the SUM as not given a name, a position can be used as well 
(userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an 
implict cast would be inserted according to regular Pig rules.

 

The relation can be used in any place where an expression of the type would 
make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks > C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: 
"Scalar has more than one row in the output"



> Allow casting relations to scalars
> --
>
> Key: PIG-1434
> URL: https://issues.apache.org/jira/browse/PIG-1434
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Aniket Mokashi
> Fix For: 0.8.0
>
> Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, 
> ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described 
> in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .
> X = 
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be 
> reported
> (2) Name resolution is needed since relation X might have field named C in 
> which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. 
> I believe we already have a UDF that Ben Reed contributed for this purpose. 
> Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1550) better error handling in casting relations to scalars

2010-08-19 Thread Olga Natkovich (JIRA)
better error handling in casting relations to scalars
-

 Key: PIG-1550
 URL: https://issues.apache.org/jira/browse/PIG-1550
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Thejas M Nair
 Fix For: 0.8.0


I ran the following script:

Input data:

joe 100
sam 20
bob 134

Script:

A = load 'user_clicks' as (user: chararray, clicks: int);
B = group A by user;
C = foreach A generate group, SUM(A.clicks);
D = foreach A generate clicks/(double)C.$1;
dump C;

Since C contains more than 1 tuple, I expected to get an error which I did. 
However, the error was not very clear. When the job failed, I did see a valid 
error (however it lacked the error code): 210630 [main] ERROR 
org.apache.pig.tools.pigstats.PigStats  - ERROR 0: Scalar has more than one row 
in the output
 However at the end of processing, I saw a misleading error:

210709 [main] ERROR org.apache.pig.tools.grunt.Grunt  - ERROR 2088: Unable to 
get results for: 
hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage
10/08/19 17:16:22 ERROR grunt.Grunt: ERROR 2088: Unable to get results for: 
hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp818551960/tmp1063730945:org.apache.pig.impl.io.InterStorage


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1353) Map-side outer joins

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1353:


Summary: Map-side outer joins  (was: Map-side joins)

> Map-side outer joins
> 
>
> Key: PIG-1353
> URL: https://issues.apache.org/jira/browse/PIG-1353
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1353.patch, pig-1353.patch
>
>
> Pig already has couple of map-side join implementations: Merge Join and 
> Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
> Join can only join two tables and that too can only do inner join. FR Join 
> can join multiple relations, but it can also only do inner and left outer 
> joins. Further it restricts the sizes of side relations. It will be nice if 
> we can do map side joins on multiple tables as well do inner, left outer, 
> right outer and full outer joins. 
> Lot of groundwork for this has already been done in PIG-1309. Remaining will 
> be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1353) Map-side outer joins

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1353:


Release Note: 
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and one of the loader implements {{CollectableLoader}} 
interface. Primary algorithm is based on sort-merge join. 

Additional implementation details:
1) No other operations can be done between load and join statements.
2) Data must be sorted in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc.
5) All other loaders must implement IndexableLoadFunc.   

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.
Similiar conditions apply to map-side cogroups (PIG-1309) as well.  

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
C = join A by id left, B by id using 'merge';
.

  was:
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and one of the loader implements {{CollectableLoader}} 
interface. Primary algorithm is based on sort-merge join. 

Additional implementation details:
1) No other operations can be done between load and join statements.
2) Data must be sorted in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc.
5) All other loaders must implement IndexableLoadFunc.   

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.
Similiar conditions apply to map-side cogroups (PIG-1309) as well.  


> Map-side outer joins
> 
>
> Key: PIG-1353
> URL: https://issues.apache.org/jira/browse/PIG-1353
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1353.patch, pig-1353.patch
>
>
> Pig already has couple of map-side join implementations: Merge Join and 
> Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
> Join can only join two tables and that too can only do inner join. FR Join 
> can join multiple relations, but it can also only do inner and left outer 
> joins. Further it restricts the sizes of side relations. It will be nice if 
> we can do map side joins on multiple tables as well do inner, left outer, 
> right outer and full outer joins. 
> Lot of groundwork for this has already been done in PIG-1309. Remaining will 
> be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1309:


Release Note: 
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and one of the loader implements {{CollectableLoader}} interface. 
Primary algorithm is based on sort-merge join. 

Additional implementation details: 
1) No other operations can be done between load and join statements. 
2) Data must be sorted in ASC order. 
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else. 
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc. 
5) All other loaders must implement IndexableLoadFunc. 

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box. 
Similiar conditions apply to map-side cogroups (PIG-1309) as well. 

Example: 
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); 
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); 
C = COGROUP A by id, B by id using 'merge'; 


> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0, 0.8.0
>
> Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
> PIG_1309_7.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1406) Allow to run shell commands from grunt

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1406:


Release Note: 
This JIRA allows to run shell commands from within grun by using new sh command 
which has the following format:

sh 

For instance, the command below will allow you to run ls command on the 
location from which you started pig:

grunt> sh ls
bigdata.conf
nightly.conf
.
grunt>

> Allow to run shell commands from grunt
> --
>
> Key: PIG-1406
> URL: https://issues.apache.org/jira/browse/PIG-1406
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Jeff Zhang
> Fix For: 0.8.0
>
> Attachments: Pig-1406.patch, Pig-1406_2.patch
>
>
> We had several users asking to be able to run arbitrary shell commands from 
> within grunt. This would work similarly to fs command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1249:


Release Note: 
In the previous versions of Pig, if the number of reducers was not specified 
(via PARALLEL or default_parallelism), the value of 1 was used which in many 
cases was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based 
on the size of the input data. There are several parameters that the user can 
control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; 
default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; 
default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per 
reducer.

This is a very simplistic formula that we would need to improve over time. 
Note, that the computed value takes all inputs within the script into account 
and applies the computed value to all the jobs within Pig script.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> --
>
> Key: PIG-1249
> URL: https://issues.apache.org/jira/browse/PIG-1249
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Arun C Murthy
>Assignee: Jeff Zhang
>Priority: Critical
> Fix For: 0.8.0
>
> Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, 
> PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts 
> which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge 
> data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1249:


Release Note: 
In the previous versions of Pig, if the number of reducers was not specified 
(via PARALLEL or default_parallel), the value of 1 was used which in many cases 
was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based 
on the size of the input data. There are several parameters that the user can 
control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; 
default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; 
default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per 
reducer.

This is a very simplistic formula that we would need to improve over time. 
Note, that the computed value takes all inputs within the script into account 
and applies the computed value to all the jobs within Pig script.

Note that this is not a backward compatible change and set default_parallel to 
restore the value to 1

  was:
In the previous versions of Pig, if the number of reducers was not specified 
(via PARALLEL or default_parallelism), the value of 1 was used which in many 
cases was not a good choice and caused severe performance problems.

In Pig 0.8.0, a simple heuristic is used to come up with a better number based 
on the size of the input data. There are several parameters that the user can 
control:

pig.exec.reducers.bytes.per.reducer - define number of input bytes per reduce; 
default value is 1000*1000*1000 (1GB)
pig.exec.reducers.max - defines the upper bound on the number of reducers; 
default is 999

The formula is very simple:

#reducers = MIN (pig.exec.reducers.max, total input size (in bytes) / bytes per 
reducer.

This is a very simplistic formula that we would need to improve over time. 
Note, that the computed value takes all inputs within the script into account 
and applies the computed value to all the jobs within Pig script.


> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> --
>
> Key: PIG-1249
> URL: https://issues.apache.org/jira/browse/PIG-1249
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Arun C Murthy
>Assignee: Jeff Zhang
>Priority: Critical
> Fix For: 0.8.0
>
> Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch, 
> PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts 
> which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge 
> data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-972) Make describe work with nested foreach

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-972:
---

Release Note: 
In the earlier version of Pig, describe could only be applied to outer 
relations. With Pig 0.8.0, describe could also be applied to the relations 
defined in nested foreach. 

Example:

grunt> A = load 'studentab10k' as (name, age, gpa);
grunt> B = group A by name;
grunt> C = foreach B {
>> D = distinct A.age;
>> generate COUNT(D), group;}
grunt> describe C::D;
D: {age: bytearray}

Note that you access the inner relation via the outer one using :: operator.

> Make describe work with nested foreach
> --
>
> Key: PIG-972
> URL: https://issues.apache.org/jira/browse/PIG-972
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Aniket Mokashi
> Fix For: 0.8.0
>
> Attachments: NestedDescribeFinale.patch, NestedDescribeFinale1.patch, 
> NestedDescribeProp1.patch, NestedDescribeProp2Initial.patch
>
>
> Currently Parser can't deal with that. This is because describe is part of 
> Grunt parser while the rest of nested foreach is handled by the QueryParser

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1434:


Release Note: 
PIG-1434 adds functionality that allows to cast elements of a single-tuple 
relation into a scalar value. The primary use case for this is using values of 
global aggregates in the follow up computations. For instance,

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular 
user. Note that if the SUM as not given a name, a position can be used as well 
(userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an 
implict cast would be inserted according to regular Pig rules. Also, please, 
note that when the schema can't be inferred chararray rather than bytearray is 
used.

 

The relation can be used in any place where an expression of the type would 
make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks > C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: 
"Scalar has more than one row in the output"



  was:
PIG-1434 adds functionality that allows to cast elements of a single-tuple 
relation into a scalar value. The primary use case for this is using values of 
global aggregates in the follow up computations. For instance,

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total;

D = foreach A generate userid,  clicks/(double)C.total;

dump D;

 

This example allows computing the % of the clicks belonging to a particular 
user. Note that if the SUM as not given a name, a position can be used as well 
(userid,  clicks/(double)C.$0); Also, note that if explicit cast is not used an 
implict cast would be inserted according to regular Pig rules.

 

The relation can be used in any place where an expression of the type would 
make sense. This includes FOREACH, FILTER, and SPLIT.

 

A multi field tuple can also be used:

 

A = load 'mydata' as (userid, clicks);

B = group A all;

C = foreach B genertate SUM(A.clicks) as total, COUNT(A) as cnt;

D = FILTER A by clicks > C.total/3

E = foreach D generate userid,  clicks/(double)C.total, cnt;

Dump E;

 

If a relation contains more than single tuple, a runtime error is generated: 
"Scalar has more than one row in the output"




> Allow casting relations to scalars
> --
>
> Key: PIG-1434
> URL: https://issues.apache.org/jira/browse/PIG-1434
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Aniket Mokashi
> Fix For: 0.8.0
>
> Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, 
> ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch
>
>
> This jira is to implement a simplified version of the functionality described 
> in https://issues.apache.org/jira/browse/PIG-801.
> The proposal is to allow casting relations to scalar types in foreach.
> Example:
> A = load 'data' as (x, y, z);
> B = group A all;
> C = foreach B generate COUNT(A);
> .
> X = 
> Y = foreach X generate $1/(long) C;
> Couple of additional comments:
> (1) You can only cast relations including a single value or an error will be 
> reported
> (2) Name resolution is needed since relation X might have field named C in 
> which case that field takes precedence.
> (3) Y will look for C closest to it.
> Implementation thoughts:
> The idea is to store C into a file and then convert it into scalar via a UDF. 
> I believe we already have a UDF that Ben Reed contributed for this purpose. 
> Most of the work would be to update the logical plan to
> (1) Store C
> (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-282) Custom Partitioner

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-282:
---

Release Note: 
This feature allows to specify Hadoop Partitioner for the following operations: 
GROUP/COGROUP, CROSS, DISTINCT, JOIN (except 'skewed'  join). Partitioner 
controls the partitioning of the keys of the intermediate map-outputs. See 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/Partitioner.html
 for more details.

To use this feature you can add PARTITION BY clause to the appropriate operator:
A = load 'input_data';
B = group A by $0 PARTITION BY 
org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
.
Here is the code for SimpleCustomPartitioner

public class SimpleCustomPartitioner extends Partitioner {
 //@Override
public int getPartition(PigNullableWritable key, Writable value, int 
numPartitions) {
if(key.getValueAsPigType() instanceof Integer) {
int ret = (((Integer)key.getValueAsPigType()).intValue() % 
numPartitions);
return ret;
   }
   else {
return (key.hashCode()) % numPartitions;
}
}
}

> Custom Partitioner
> --
>
> Key: PIG-282
> URL: https://issues.apache.org/jira/browse/PIG-282
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.7.0
>Reporter: Amir Youssefi
>Assignee: Aniket Mokashi
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: CustomPartitioner.patch, CustomPartitionerFinale.patch, 
> CustomPartitionerTest.patch
>
>
> By adding custom partitioner we can give control over which output partition 
> a key (/value) goes to. We can add keywords to language e.g. 
> PARTITION BY UDF(...)
> or a similar syntax. UDF returns a number between 0 and n-1 where n is number 
> of output partitions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1321) Logical Optimizer: Merge cascading foreach

2010-08-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1321:


Release Note: 
This rule allows to merge togther two foreach statements if the following 
preconditions are met:

- foreach statements are consecutive
- the second foreach is not nested
- the first foreach statement does not contain flatten 

Example:

(1) Original code:

A = load 'file.txt' as (a, b, c);
B = foreach A generate a+b as u, c-b as v;
C = foreach B generate $0+5, v;
.

(2) Optimized code:

A = load 'file.txt' as (a, b, c);
C = foreach A generate a+b+5, c-b;
..


> Logical Optimizer: Merge cascading foreach
> --
>
> Key: PIG-1321
> URL: https://issues.apache.org/jira/browse/PIG-1321
> Project: Pig
>  Issue Type: Sub-task
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
> Attachments: pig-1321.patch
>
>
> We can merge consecutive foreach statement.
> Eg:
> b = foreach a generate a0#'key1' as b0, a0#'key2' as b1, a1;
> c = foreach b generate b0#'kk1', b0#'kk2', b1, a1;
> => c = foreach a generate a0#'key1'#'kk1', a0#'key1'#'kk2', a0#'key2', a1;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-908) Need a way to correlate MR jobs with Pig statements

2010-08-23 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-908:
---


With Pig 0.8.0 we print a summary of the execution that contains (among other 
things) how aliases mapped to jobs. Example:

JobId   MapsReduces MaxMapTime  MinMapTIme  AvgMapTime  
MaxReduceTime   MinReduceTime   AvgReduceTime   Alias   Feature Outputs
job_201004271216_12712  1   1   3   3   3   12  12  
12  B,C GROUP_BY,COMBINER
job_201004271216_12713  1   1   3   3   3   12  12  
12  D   SAMPLER
job_201004271216_12714  1   1   3   3   3   12  12  
12  D   ORDER_BY,COMBINER   
hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp743703298/tmp-2019944040,


> Need a way to correlate MR jobs with Pig statements
> ---
>
> Key: PIG-908
> URL: https://issues.apache.org/jira/browse/PIG-908
> Project: Pig
>  Issue Type: Wish
>Reporter: Dmitriy V. Ryaboy
>Assignee: Richard Ding
> Fix For: 0.8.0
>
>
> Complex Pig Scripts often generate many Map-Reduce jobs, especially with the 
> recent introduction of multi-store capabilities.
> For example, the first script in the Pig tutorial produces 5 MR jobs.
> There is currently very little support for debugging resulting jobs; if one 
> of the MR jobs fails, it is hard to figure out which part of the script it 
> was responsible for. Explain plans help, but even with the explain plan, a 
> fair amount of effort (and sometimes, experimentation) is required to 
> correlate the failing MR job with the corresponding PigLatin statements.
> This ticket is created to discuss approaches to alleviating this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1488) Make HDFS temp dir configurable

2010-08-23 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1488:


Release Note: Pig stores intermediate data generated between MR jobs in a 
temp location on HDFS. In Pig 0.8.0 this location is configurable by using 
pig.temp.dir property. The default is /tmp which is the same as hardcoded 
location in Pig 0.7.0 and earlier versions

> Make HDFS temp dir configurable
> ---
>
> Key: PIG-1488
> URL: https://issues.apache.org/jira/browse/PIG-1488
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
> Fix For: 0.8.0
>
>
> Currently it is hardcoded to /tmp. It should be made into a property.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1484) BinStorage should support comma seperated path

2010-08-23 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1484:


Release Note: 
In Pig 0.7.0 only a single location is supported as input to BinStorage. (This 
location can be a file, a directory or a glob). With Pig 0.8.0 we are making 
BinSTorage  (similar to PigStorage) support a list of locations.

Example:

a = load '1.bin,2.bin' using BinStorage();



> BinStorage should support comma seperated path
> --
>
> Key: PIG-1484
> URL: https://issues.apache.org/jira/browse/PIG-1484
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.7.0, 0.8.0
>
> Attachments: PIG-1484-1.patch, PIG-1484-2.patch, PIG-1484-3.patch
>
>
> BinStorage does not take comma seperated path. The following script fail:
> a = load '1.bin,2.bin' using BinStorage();
> dump a;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-23 Thread Olga Natkovich (JIRA)
couple of issue mapping aliases to jobs
---

 Key: PIG-1557
 URL: https://issues.apache.org/jira/browse/PIG-1557
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Richard Ding


I have a simple script:

A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
B = group A by name;
C = foreach B generate group, COUNT(A);
D = order C by $1;
E = limit D 10;
dump E;

I noticed a couple of issues with alias to job mapping: neither load(A) nor 
limit(E) shows in the output


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1447) Tune memory usage of InternalCachedBag

2010-08-23 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901576#action_12901576
 ] 

Olga Natkovich commented on PIG-1447:
-

This is probably the smallest patch I have reviewed recently :). +1

> Tune memory usage of InternalCachedBag
> --
>
> Key: PIG-1447
> URL: https://issues.apache.org/jira/browse/PIG-1447
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: L15_modified.pig, L15_modified2.pig, PIG-1447.1.patch
>
>
> We need to find a better value for "pig.cachedbag.memusage".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods

2010-08-23 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901577#action_12901577
 ] 

Olga Natkovich commented on PIG-1354:
-

Dmitry, Could you add release notes on how to use this?

> UDFs for dynamic invocation of simple Java methods
> --
>
> Key: PIG-1354
> URL: https://issues.apache.org/jira/browse/PIG-1354
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch
>
>
> The need to create wrapper UDFs for simple Java functions creates unnecessary 
> work for Pig users, slows down the development process, and produces a lot of 
> trivial classes. We can use Java's reflection to allow invoking a number of 
> methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods

2010-08-23 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901585#action_12901585
 ] 

Olga Natkovich commented on PIG-1354:
-

Sounds good, Dmitry. Richard will review and commit the patch and then please 
paste the release notes.

> UDFs for dynamic invocation of simple Java methods
> --
>
> Key: PIG-1354
> URL: https://issues.apache.org/jira/browse/PIG-1354
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch
>
>
> The need to create wrapper UDFs for simple Java functions creates unnecessary 
> work for Pig users, slows down the development process, and produces a lot of 
> trivial classes. We can use Java's reflection to allow invoking a number of 
> methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1311) Pig interfaces should be clearly classified in terms of scope and stability

2010-08-23 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901587#action_12901587
 ] 

Olga Natkovich commented on PIG-1311:
-

+1, please, commit

> Pig interfaces should be clearly classified in terms of scope and stability
> ---
>
> Key: PIG-1311
> URL: https://issues.apache.org/jira/browse/PIG-1311
> Project: Pig
>  Issue Type: Improvement
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.8.0
>
> Attachments: PIG-1311.patch
>
>
> Clearly marking Pig interfaces (Java interfaces but also things like config 
> files, CLIs, Pig Latin syntax and semantics, etc.) to show scope 
> (public/private) and stability (stable/evolving/unstable) will help users 
> understand how to interact with Pig and developers to understand what things 
> they can and cannot change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1558) build.xml for site directory does not work

2010-08-23 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901612#action_12901612
 ] 

Olga Natkovich commented on PIG-1558:
-

+1

> build.xml for site directory does not work
> --
>
> Key: PIG-1558
> URL: https://issues.apache.org/jira/browse/PIG-1558
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1558.patch
>
>
> Going to the site directory and running ant produces:  
> {code}
> ant 
> Buildfile: build.xml
> clean:
>[delete] Deleting directory /Users/gates/src/pig/apache/site/author/build
> update:
> BUILD FAILED
> /Users/gates/src/pig/apache/site/build.xml:6: Execute failed: 
> java.io.IOException: Cannot run program "forrest" (in directory 
> "/Users/gates/src/pig/apache/site/author"): error=2, No such file or directory
> {code}
> Also, forrest here still requires Java 1.5, which can be fixed (see PIG-1508).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1559) Several things stated in Pig philosophy page are out of date

2010-08-23 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901673#action_12901673
 ] 

Olga Natkovich commented on PIG-1559:
-

+1, looks good

> Several things stated in Pig philosophy page are out of date
> 
>
> Key: PIG-1559
> URL: https://issues.apache.org/jira/browse/PIG-1559
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.7.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1559.patch
>
>
> The Pig philosophy page says several things that are no longer true (such as 
> that Pig does not have an optimizer (it does now), that we someday hope to 
> support streaming (we already do), that we some day hope to control splits 
> (we don't, we just use what Hadoop gives us now)).  These need to be updated 
> to reflect the current situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven

2010-08-24 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1562:


Fix Version/s: 0.8.0

> Fix the version for the dependent packages for the maven 
> -
>
> Key: PIG-1562
> URL: https://issues.apache.org/jira/browse/PIG-1562
> Project: Pig
>  Issue Type: Bug
>Reporter: niraj rai
>Assignee: niraj rai
> Fix For: 0.8.0
>
>
> We need to fix the set version so that, version is properly set for the 
> dependent packages in the maven repository.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1560) Build target 'checkstyle' fails

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901975#action_12901975
 ] 

Olga Natkovich commented on PIG-1560:
-

please, commit

> Build target 'checkstyle' fails
> ---
>
> Key: PIG-1560
> URL: https://issues.apache.org/jira/browse/PIG-1560
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Richard Ding
>Assignee: Giridharan Kesavan
> Fix For: 0.8.0
>
> Attachments: pig-1560.patch
>
>
> Stack trace:
> {code}
> /trunk/build.xml:894: java.lang.NoClassDefFoundError: 
> org/apache/commons/logging/LogFactory
> at 
> org.apache.commons.beanutils.ConvertUtilsBean.(ConvertUtilsBean.java:130)
> at 
> com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73)
> at 
> com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222)
> at 
> com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372)
> at 
> com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304)
> at 
> com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265)
> at 
> org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at 
> org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
> at org.apache.tools.ant.Task.perform(Task.java:348)
> at org.apache.tools.ant.Target.execute(Target.java:390)
> at org.apache.tools.ant.Target.performTasks(Target.java:411)
> at 
> org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
> at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
> at 
> org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
> at org.apache.tools.ant.Project.executeTargets(Project.java:1212)
> at org.apache.tools.ant.Main.runBuild(Main.java:801)
> at org.apache.tools.ant.Main.startAnt(Main.java:218)
> at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
> at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.commons.logging.LogFactory
> at 
> org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386)
> at 
> org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336)
> at 
> org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
> ... 22 more
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1559) Several things stated in Pig philosophy page are out of date

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901979#action_12901979
 ] 

Olga Natkovich commented on PIG-1559:
-

Looks like limit issue I was seeing has been addressed in the latest trunk. 

I think we need to add unit tests to catch this things in the future.

> Several things stated in Pig philosophy page are out of date
> 
>
> Key: PIG-1559
> URL: https://issues.apache.org/jira/browse/PIG-1559
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.7.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1559.patch
>
>
> The Pig philosophy page says several things that are no longer true (such as 
> that Pig does not have an optimizer (it does now), that we someday hope to 
> support streaming (we already do), that we some day hope to control splits 
> (we don't, we just use what Hadoop gives us now)).  These need to be updated 
> to reflect the current situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1559) Several things stated in Pig philosophy page are out of date

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901984#action_12901984
 ] 

Olga Natkovich commented on PIG-1559:
-

sorry, wrong JIRA

> Several things stated in Pig philosophy page are out of date
> 
>
> Key: PIG-1559
> URL: https://issues.apache.org/jira/browse/PIG-1559
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.7.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1559.patch
>
>
> The Pig philosophy page says several things that are no longer true (such as 
> that Pig does not have an optimizer (it does now), that we someday hope to 
> support streaming (we already do), that we some day hope to control splits 
> (we don't, we just use what Hadoop gives us now)).  These need to be updated 
> to reflect the current situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901985#action_12901985
 ] 

Olga Natkovich commented on PIG-1557:
-

Looks like limit issue I was seeing has been addressed in the latest trunk. 

I think we need to add unit tests to catch this things in the future.



> couple of issue mapping aliases to jobs
> ---
>
> Key: PIG-1557
> URL: https://issues.apache.org/jira/browse/PIG-1557
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1557.patch
>
>
> I have a simple script:
> A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> B = group A by name;
> C = foreach B generate group, COUNT(A);
> D = order C by $1;
> E = limit D 10;
> dump E;
> I noticed a couple of issues with alias to job mapping: neither load(A) nor 
> limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1563) SUBSTRING function is broken

2010-08-24 Thread Olga Natkovich (JIRA)
SUBSTRING function is broken


 Key: PIG-1563
 URL: https://issues.apache.org/jira/browse/PIG-1563
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


Script:

A = load 'studenttab10k' as (name, age, gpa);
C = foreach A generate SUBSTRING(name, 0,5);
E = limit C 10;
dump E;

Output is always empty:

()
()
()
()
()
()
()
()
()
()


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1557) couple of issue mapping aliases to jobs

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902196#action_12902196
 ] 

Olga Natkovich commented on PIG-1557:
-

Looks good. Please, commit

> couple of issue mapping aliases to jobs
> ---
>
> Key: PIG-1557
> URL: https://issues.apache.org/jira/browse/PIG-1557
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1557.patch, PIG-1557_1.patch
>
>
> I have a simple script:
> A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> B = group A by name;
> C = foreach B generate group, COUNT(A);
> D = order C by $1;
> E = limit D 10;
> dump E;
> I noticed a couple of issues with alias to job mapping: neither load(A) nor 
> limit(E) shows in the output

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902201#action_12902201
 ] 

Olga Natkovich commented on PIG-1563:
-

I think you just need to add the arg mapping function and the pig will insert 
the casts.


> SUBSTRING function is broken
> 
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-24 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902211#action_12902211
 ] 

Olga Natkovich commented on PIG-1563:
-

The same needs to be done (and we need unit tests) for the following string 
manipulation functions:

INDEXOF
LAST_INDEX_OF
REPLACE
SPLIT
TRIM

> SUBSTRING function is broken
> 
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903491#action_12903491
 ] 

Olga Natkovich commented on PIG-1483:
-

+1, please, commit

> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> ---
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
>  Issue Type: New Feature
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus 
> it's now possible to use Pig for querying Hadoop job history/xml files to get 
> script-level usage statistics. What we need is a Pig loader that can parse 
> these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as 
> (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) 
> j#'USER' as user, (Chararray) j#'JOBID' as job; 
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as 
> max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], 
> m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' 
> as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as 
> end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - 
> MIN(b.start)/1000;
> dump d;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903501#action_12903501
 ] 

Olga Natkovich commented on PIG-1518:
-

After discussion with Ashutosh and Yan tha agreement is that in addition to 
checking interfaces we also need to check if we are taking advantage of the 
loader properties before deciding whether to combine or not.

For instance, even if the loader implements OrderLoadFunc but there is no merge 
join in the script, we can still combine.

Yan, please, compile the list of valid combinations and update the patch, 
thanks.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903546#action_12903546
 ] 

Olga Natkovich commented on PIG-1563:
-

I am looking into this to see if I can make it work without double wrapping. So 
far I got the easy case of trim to work. Will update the JIRA once I have more 
results


> SUBSTRING function is broken
> 
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903578#action_12903578
 ] 

Olga Natkovich commented on PIG-1563:
-

I was able to make it successfully working (without wrapping) for the functions 
that have fixed number of arguments:

LAST_INDEX_OF
REPLACE
TRIM

I don't believe there is currently a way to make it work with variable number 
of args (even if the number of combinations is fixed.) Moreover, if we add the 
mapping table in this case, it breaks the case of typed data which is bad. This 
is the case with the remaining functions - INDEXOF and SPLIT.

So my suggestion is only to fix the first set of function and delay the rest to 
0.9 when we fix the mapping code.

Dmitry and others, are you ok with this? If so, I can update the patch to 
reflect this.




> SUBSTRING function is broken
> 
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1502) Document and track system limits

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1502:


Fix Version/s: 0.9.0
   (was: 0.8.0)

> Document and track system limits
> 
>
> Key: PIG-1502
> URL: https://issues.apache.org/jira/browse/PIG-1502
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
> Fix For: 0.9.0
>
>
> We need to be able to publsih what system limitations are to make sure that 
> Pig is used in the way it was intended and tested. For instance, if you 
> combine 30 joins in a single MR job (via multiquery) this might not work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903581#action_12903581
 ] 

Olga Natkovich commented on PIG-1150:
-

Dmitry, are you planning to add unit tests? Do we still want this in for 0.8? 
(Since it is going into piggybank, we can do this post branching but then we 
need to test in 2 places.)

> VAR() Variance UDF
> --
>
> Key: PIG-1150
> URL: https://issues.apache.org/jira/browse/PIG-1150
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.5.0
> Environment: UDF, written in Pig 0.5 contrib/
>Reporter: Russell Jurney
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: var.patch
>
>
> I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
> variance in a distributed manner, based on the AVG() builtin.  It works by 
> calculating the count, sum and sum of squares, as described here: 
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
> Is this a worthwhile contribution?  Taking the square root of this value 
> using the contrib SQRT() function gives Standard Deviation, which is missing 
> from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1549) Provide utility to construct CNF form of predicates

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903591#action_12903591
 ] 

Olga Natkovich commented on PIG-1549:
-

I don't think this patch applies. can you regenerate the patch with svn diff 
from the latest code and also add unit tests, thanks

> Provide utility to construct CNF form of predicates
> ---
>
> Key: PIG-1549
> URL: https://issues.apache.org/jira/browse/PIG-1549
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Swati Jain
>Assignee: Swati Jain
> Fix For: 0.8.0
>
> Attachments: 0001-Add-CNF-utility-class.patch
>
>
> Provide utility to construct CNF form of predicates

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903593#action_12903593
 ] 

Olga Natkovich commented on PIG-1494:
-

Can this be moved from 0.8 to 0.9 release since we are about to branch for 0.9?

> PIG Logical Optimization: Use CNF in PushUpFilter
> -
>
> Key: PIG-1494
> URL: https://issues.apache.org/jira/browse/PIG-1494
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Swati Jain
>Assignee: Swati Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> The PushUpFilter rule is not able to handle complicated boolean expressions.
> For example, SplitFilter rule is splitting one LOFilter into two by "AND". 
> However it will not be able to split LOFilter if the top level operator is 
> "OR". For example:
> *ex script:*
> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
> J1 = JOIN B by b1, C by c1;
> J2 = JOIN J1 by $0, A by a1;
> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
> explain D;
> In the above example, the PushUpFilter is not able to push any filter 
> condition across any join as it contains columns from all branches (inputs). 
> But if we convert this expression into "Conjunctive Normal Form" (CNF) then 
> we would be able to push filter condition c1< 10 and c2 == 5 below both join 
> conditions. Here is the CNF expression for highlighted line:
> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
> *Suggestion:* It would be a good idea to convert LOFilter's boolean 
> expression into CNF, it would then be easy to push parts (conjuncts) of the 
> LOFilter boolean expression selectively. We would also not require rule 
> SplitFilter anymore if we were to add this utility to rule PushUpFilter 
> itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1566) Support globbing for registering jars in pig script.

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1566:


Fix Version/s: 0.9.0
   (was: 0.8.0)

It is too late to do this for 0.8 since we are about to branch. We can consider 
this for 0.9 especially if we have volunteers for this work

> Support globbing for registering jars in pig script.
> 
>
> Key: PIG-1566
> URL: https://issues.apache.org/jira/browse/PIG-1566
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Ravi Phulari
> Fix For: 0.9.0
>
>
> Currently user can not register pig jars with globing.
> For example following register script will fail.
> {quote}
> register /etc/jars/*.jar  
> {quote}
> It will be great if we can support such globing for registering jars.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1542) log level not propogated to MR task loggers

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1542:
---

Assignee: niraj rai

This will be looked at after the branch since this is a regression and we don't 
have time to do it now.

> log level not propogated to MR task loggers
> ---
>
> Key: PIG-1542
> URL: https://issues.apache.org/jira/browse/PIG-1542
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: niraj rai
> Fix For: 0.8.0
>
>
> Specifying "-d DEBUG" does not affect the logging of the MR tasks .
> This was fixed earlier in PIG-882 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1543:
---

Assignee: Daniel Dai

Daniel can you check if this is related to limit optimizer and if it was 
addressed with new optimizer. (This can be done post branch since it is a bug 
split.)

> IsEmpty returns the wrong value after using LIMIT
> -
>
> Key: PIG-1543
> URL: https://issues.apache.org/jira/browse/PIG-1543
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Hu
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> 1. Two input files:
> 1a: limit_empty.input_a
> 1
> 1
> 1
> 1b: limit_empty.input_b
> 2
> 2
> 2.
> The pig script: limit_empty.pig
> -- A contains only 1's & B contains only 2's
> A = load 'limit_empty.input_a' as (a1:int);
> B = load 'limit_empty.input_a' as (b1:int);
> C =COGROUP A by a1, B by b1;
> D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
> COUNT(B);
> store D into 'limit_empty.output/d';
> -- After the script done, we see the right results:
> -- {(1),(1),(1)}   {}  1   0   3   0
> -- {} {(2),(2)}  0   1   0   2
> C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
> D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 
> 0:1), COUNT(Alim), COUNT(Blim);
> store D1 into 'limit_empty.output/d1';
> -- After the script done, we see the unexpected results:
> -- {(1)}   {}1   1   1   0
> -- {}  {(2)} 1   1   0   1
> dump D;
> dump D1;
> 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:
> The major one:
> IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
> IsEmpty() returns correctly in limit_empty.output/d/*.
> The difference is that one has been applied with "LIMIT" before using 
> IsEmpty().
> The minor one:
> The redirected output only contains the first dump:
> ({(1),(1),(1)},{},1,0,3L,0L)
> ({},{(2),(2)},0,1,0L,2L)
> We expect two more lines like:
> ({(1)},{},1,1,1L,0L)
> ({},{(2)},1,1,0L,1L)
> Besides, there is error says:
> [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.pig.data.Tuple

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1567) Optimization rule FilterAboveForeach is too restrictive and doesn't handle project * correctly

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1567:
---

Assignee: Xuefu Zhang

> Optimization rule FilterAboveForeach is too restrictive and doesn't handle 
> project * correctly
> --
>
> Key: PIG-1567
> URL: https://issues.apache.org/jira/browse/PIG-1567
> Project: Pig
>  Issue Type: Bug
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.8.0
>
>
> FilterAboveForeach rule is to optimize the plan by pushing up filter above 
> previous foreach operator. However, during code review, two major problems 
> were found:
> 1. Current implementation assumes that if no projection is found in the 
> filter condition then all columns from foreach are projected. This issue 
> prevents the following optimization:
>   A = LOAD 'file.txt' AS (a(u,v), b, c);
>   B = FOREACH A GENERATE $0, b;
>   C = FILTER B BY 8 > 5;
>   STORE C INTO 'empty';
> 2. Current implementation doesn't handle * probjection, which means project 
> all columns. As a result, it wasn't able to optimize the following:
>   A = LOAD 'file.txt' AS (a(u,v), b, c);
>   B = FOREACH A GENERATE $0, b;
>   C = FILTER B BY Identity.class.getName(*) > 5;
>   STORE C INTO 'empty';
>   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1570:
---

Assignee: Thejas M Nair

> native mapreduce operator MR job does not follow same failure handling logic 
> as other pig MR jobs
> -
>
> Key: PIG-1570
> URL: https://issues.apache.org/jira/browse/PIG-1570
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> The code path for handling failure in MR job corresponding to native MR is 
> different and does not have the same behavior.
> For example, even if the MR job for mapreduce operator fails, the number of 
> jobs that failed is being reported as 0 in PigStats log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1572:
---

Assignee: Thejas M Nair

> change default datatype when relations are used as scalar to bytearray
> --
>
> Key: PIG-1572
> URL: https://issues.apache.org/jira/browse/PIG-1572
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> When relations are cast to scalar, the current default type is chararray. 
> This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903637#action_12903637
 ] 

Olga Natkovich commented on PIG-1150:
-

So should we unlink this from the release?

> VAR() Variance UDF
> --
>
> Key: PIG-1150
> URL: https://issues.apache.org/jira/browse/PIG-1150
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.5.0
> Environment: UDF, written in Pig 0.5 contrib/
>Reporter: Russell Jurney
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: var.patch
>
>
> I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
> variance in a distributed manner, based on the AVG() builtin.  It works by 
> calculating the count, sum and sum of squares, as described here: 
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
> Is this a worthwhile contribution?  Taking the square root of this value 
> using the contrib SQRT() function gives Standard Deviation, which is missing 
> from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903640#action_12903640
 ] 

Olga Natkovich commented on PIG-1563:
-

which JIRA is that?

I will just get this in - I think that's all I have time today but I can look 
at the other one as well next week

> SUBSTRING function is broken
> 
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1150) VAR() Variance UDF

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1150:


Fix Version/s: 0.9.0
   (was: 0.8.0)

> VAR() Variance UDF
> --
>
> Key: PIG-1150
> URL: https://issues.apache.org/jira/browse/PIG-1150
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.5.0
> Environment: UDF, written in Pig 0.5 contrib/
>Reporter: Russell Jurney
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.9.0
>
> Attachments: var.patch
>
>
> I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
> variance in a distributed manner, based on the AVG() builtin.  It works by 
> calculating the count, sum and sum of squares, as described here: 
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
> Is this a worthwhile contribution?  Taking the square root of this value 
> using the contrib SQRT() function gives Standard Deviation, which is missing 
> from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-529) Want support for loading CSV files

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-529.


Resolution: Duplicate

This is duplicate of PIG-1555 which has been resolved for Pig 0.8

> Want support for loading CSV files
> --
>
> Key: PIG-529
> URL: https://issues.apache.org/jira/browse/PIG-529
> Project: Pig
>  Issue Type: New Feature
>  Components: data
>Reporter: Tom White
>
> Want to be able to load CSV data into Pig. This needs to handle quoting 
> correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-771.


Fix Version/s: 0.7.0
   Resolution: Fixed

PigDump is no longer supported

> PigDump does not properly output Chinese UTF8 characters - they are displayed 
> as question marks ??
> --
>
> Key: PIG-771
> URL: https://issues.apache.org/jira/browse/PIG-771
> Project: Pig
>  Issue Type: Bug
>Reporter: David Ciemiewicz
> Fix For: 0.7.0
>
>
> PigDump does not properly output Chinese UTF8 characters.
> The reason for this is that the function Tuple.toString() is called.
> DefaultTuple implements Tuple.toString() and it calls Object.toString() on 
> the opaque object d.
> Instead, I think that the code should be changed instead to call the new 
> DataType.toString() function.
> {code}
> @Override
> public String toString() {
> StringBuilder sb = new StringBuilder();
> sb.append('(');
> for (Iterator it = mFields.iterator(); it.hasNext();) {
> Object d = it.next();
> if(d != null) {
> if(d instanceof Map) {
> sb.append(DataType.mapToString((Map)d));
> } else {
> sb.append(DataType.toString(d));  // <<< Change this one 
> line
> if(d instanceof Long) {
> sb.append("L");
> } else if(d instanceof Float) {
> sb.append("F");
> }
> }
> } else {
> sb.append("");
> }
> if (it.hasNext())
> sb.append(",");
> }
> sb.append(')');
> return sb.toString();
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1577) support to variable number of arguments in UDF

2010-08-27 Thread Olga Natkovich (JIRA)
support to variable number of arguments in UDF
--

 Key: PIG-1577
 URL: https://issues.apache.org/jira/browse/PIG-1577
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Olga Natkovich
 Fix For: 0.9.0


In the current implementation, functionality that allows to map arguments to 
classes does not support functions with variable number of arguments. Also it 
does not support funtions that can have variable (but fixed in number) number 
of arguments. 

This causes problems for string UDFs such as CONCAT that can take an arbitrary 
number of arguments or TRIM that can take 1,2, or 3 arguments

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1563) SUBSTRING function is broken

2010-08-27 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1563:


Attachment: PIG_1563_v2.patch

> SUBSTRING function is broken
> 
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch, PIG_1563_v2.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-27 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903744#action_12903744
 ] 

Olga Natkovich commented on PIG-1563:
-

Uploaded new patch which does the following:

(1) Adds mapping function for functions with fixed number of arguments: 
SUBSTRING, LAST_INDEX_OF, REPLACE,TRIM
(2) Left the rest of the functions alone which means that until 0.9 they will 
only work on typed data. CONCAT is in the same category
(3) Re-used applicable tests that Dmitry create, thanks!
(3) Added a couple of e2e tests to make sure that we test the mapping function 
as well

Please, review. 

We will keep the open till we address (2) in 0.9.



> SUBSTRING function is broken
> 
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch, PIG_1563_v2.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904450#action_12904450
 ] 

Olga Natkovich commented on PIG-1563:
-

Dmitry, thanks for the review. I did not discard your function - it was part of 
the patch. I did not change the code to use it just because I already finished 
testing the changes and did not have time to redo the code.

I am fixing some javadoc and release audit failures and will commit the code 
shortly.

> SUBSTRING function is broken
> 
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch, PIG_1563_v2.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1563) Some string functions don't work with bytearray arguments

2010-08-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1563:


Summary: Some string functions don't work with bytearray arguments  (was: 
SUBSTRING function is broken)

> Some string functions don't work with bytearray arguments
> -
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch, PIG_1563_v2.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) Some string functions don't work with bytearray arguments

2010-08-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904462#action_12904462
 ] 

Olga Natkovich commented on PIG-1563:
-

 +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 13 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec]


> Some string functions don't work with bytearray arguments
> -
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch, PIG_1563_v2.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) Some string functions don't work with bytearray arguments

2010-08-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904467#action_12904467
 ] 

Olga Natkovich commented on PIG-1563:
-

I made one additional change and renamed SPLIT into STRSPLIT to avoid conflict 
with SPLIT operator

> Some string functions don't work with bytearray arguments
> -
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch, PIG_1563_v2.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1563) Some string functions don't work with bytearray arguments

2010-08-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1563:


Attachment: PIG_1563_v3.patch

latest patch

> Some string functions don't work with bytearray arguments
> -
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch, PIG_1563_v2.patch, PIG_1563_v3.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1563) Some string functions don't work with bytearray arguments

2010-08-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1563:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

patch committed. Thanks Dmitry for the help and review

> Some string functions don't work with bytearray arguments
> -
>
> Key: PIG-1563
> URL: https://issues.apache.org/jira/browse/PIG-1563
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Olga Natkovich
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG_1563.patch, PIG_1563_v2.patch, PIG_1563_v3.patch
>
>
> Script:
> A = load 'studenttab10k' as (name, age, gpa);
> C = foreach A generate SUBSTRING(name, 0,5);
> E = limit C 10;
> dump E;
> Output is always empty:
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()
> ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1314) Add DateTime Support to Pig

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1314:


Fix Version/s: (was: 0.8.0)

Unlinking from 0.8 since we are branching today

> Add DateTime Support to Pig
> ---
>
> Key: PIG-1314
> URL: https://issues.apache.org/jira/browse/PIG-1314
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Affects Versions: 0.7.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hadoop/Pig are primarily used to parse log data, and most logs have a 
> timestamp component.  Therefore Pig should support dates as a primitive.
> Can someone familiar with adding types to pig comment on how hard this is?  
> We're looking at doing this, rather than use UDFs.  Is this a patch that 
> would be accepted?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1429) Add Boolean Data Type to Pig

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1429:


Fix Version/s: (was: 0.8.0)

Unlinking because we are branching for release today

> Add Boolean Data Type to Pig
> 
>
> Key: PIG-1429
> URL: https://issues.apache.org/jira/browse/PIG-1429
> Project: Pig
>  Issue Type: New Feature
>  Components: data
>Affects Versions: 0.7.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Attachments: working_boolean.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Pig needs a Boolean data type.  Pig-1097 is dependent on doing this.  
> I volunteer.  Is there anything beyond the work in src/org/apache/pig/data/ 
> plus unit tests to make this work?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1549) Provide utility to construct CNF form of predicates

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1549:


Fix Version/s: (was: 0.8.0)

Unlinking from 0.8 release since we are about to branch

> Provide utility to construct CNF form of predicates
> ---
>
> Key: PIG-1549
> URL: https://issues.apache.org/jira/browse/PIG-1549
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Swati Jain
>Assignee: Swati Jain
> Attachments: 0001-Add-CNF-utility-class.patch
>
>
> Provide utility to construct CNF form of predicates

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1530) PIG Logical Optimization: Push LOFilter above LOCogroup

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1530.
-

Resolution: Duplicate

Xuefu is addressing this issue as part of 
https://issues.apache.org/jira/browse/PIG-1575.

>  PIG Logical Optimization: Push LOFilter above LOCogroup
> 
>
> Key: PIG-1530
> URL: https://issues.apache.org/jira/browse/PIG-1530
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Swati Jain
>Assignee: Swati Jain
>Priority: Minor
> Fix For: 0.8.0
>
>
> Consider the following:
> {noformat}
> A = load '' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load '' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> G = COGROUP A by (a1,a2) , B by (b1,b2);
> D = Filter G by group.$0 + 5 > group.$1;
> explain D;
> {noformat}
> In the above example, LOFilter can be pushed above LOCogroup. Note there are 
> some tricky NULL issues to think about when the Cogroup is not of type INNER 
> (Similar to issues that need to be thought through when pushing LOFilter on 
> the right side of a LeftOuterJoin).
> Also note that typically the LOFilter in user programs will be below a 
> ForEach-Cogroup pair. To make this really useful, we need to also implement 
> LOFilter pushed across ForEach. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1494:



Unlinking from 0.8 since we are about to branch for release

> PIG Logical Optimization: Use CNF in PushUpFilter
> -
>
> Key: PIG-1494
> URL: https://issues.apache.org/jira/browse/PIG-1494
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Swati Jain
>Assignee: Swati Jain
>Priority: Minor
>
> The PushUpFilter rule is not able to handle complicated boolean expressions.
> For example, SplitFilter rule is splitting one LOFilter into two by "AND". 
> However it will not be able to split LOFilter if the top level operator is 
> "OR". For example:
> *ex script:*
> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
> J1 = JOIN B by b1, C by c1;
> J2 = JOIN J1 by $0, A by a1;
> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
> explain D;
> In the above example, the PushUpFilter is not able to push any filter 
> condition across any join as it contains columns from all branches (inputs). 
> But if we convert this expression into "Conjunctive Normal Form" (CNF) then 
> we would be able to push filter condition c1< 10 and c2 == 5 below both join 
> conditions. Here is the CNF expression for highlighted line:
> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
> *Suggestion:* It would be a good idea to convert LOFilter's boolean 
> expression into CNF, it would then be easy to push parts (conjuncts) of the 
> LOFilter boolean expression selectively. We would also not require rule 
> SplitFilter anymore if we were to add this utility to rule PushUpFilter 
> itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1494:


Fix Version/s: (was: 0.8.0)

> PIG Logical Optimization: Use CNF in PushUpFilter
> -
>
> Key: PIG-1494
> URL: https://issues.apache.org/jira/browse/PIG-1494
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Swati Jain
>Assignee: Swati Jain
>Priority: Minor
>
> The PushUpFilter rule is not able to handle complicated boolean expressions.
> For example, SplitFilter rule is splitting one LOFilter into two by "AND". 
> However it will not be able to split LOFilter if the top level operator is 
> "OR". For example:
> *ex script:*
> A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
> B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
> C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
> J1 = JOIN B by b1, C by c1;
> J2 = JOIN J1 by $0, A by a1;
> D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
> explain D;
> In the above example, the PushUpFilter is not able to push any filter 
> condition across any join as it contains columns from all branches (inputs). 
> But if we convert this expression into "Conjunctive Normal Form" (CNF) then 
> we would be able to push filter condition c1< 10 and c2 == 5 below both join 
> conditions. Here is the CNF expression for highlighted line:
> ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
> *Suggestion:* It would be a good idea to convert LOFilter's boolean 
> expression into CNF, it would then be easy to push parts (conjuncts) of the 
> LOFilter boolean expression selectively. We would also not require rule 
> SplitFilter anymore if we were to add this utility to rule PushUpFilter 
> itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904785#action_12904785
 ] 

Olga Natkovich commented on PIG-1506:
-

This is what we need to document:

In the case of GROUP/COGROUP, the data with NULL key from the same input is 
grouped together. For instance:

Input data:

joe 5   2.5
sam 3.0
bob 3.5

script:

A = load 'small' as (name, age, gpa);
B = group A by age;
dump B;

Output:

(5,{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)})

Note that both records with null age are grouped together.

However, data with null keys from different inputs is considered different and 
will generate multiple tuples in case of cogroup. For instance:

Input: Self cogroup on the same input.

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = cogroup A by age, B by age;
dump C;

Output:

(5,{(joe,5,2.5)},{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})

Note that there are 2 tuples in the output corresponding to the null key: one 
that contains tuples from the first input (with no much from the second) and 
one the other way around.

JOIN adds another interesting twist to this because it follows SQL standard 
which means that JOIN by default represents inner join which through away all 
the nulls.

Input: the same as for COGROUP

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by age, B by age;
dump C;

Output:

(joe,5,2.5,joe,5,2.5)

Note that all tuples that had NULL key got filtered out.


> Need to clarify the difference between null handling in JOIN and COGROUP
> 
>
> Key: PIG-1506
> URL: https://issues.apache.org/jira/browse/PIG-1506
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Olga Natkovich
>Assignee: Corinne Chandel
> Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1584) deal with inner cogroup

2010-08-31 Thread Olga Natkovich (JIRA)
deal with inner cogroup
---

 Key: PIG-1584
 URL: https://issues.apache.org/jira/browse/PIG-1584
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
 Fix For: 0.9.0


The current implementation of inner in case of cogroup is in conflict with 
join. We need to decide of whether to fix inner cogroup or just remove the 
functionality if it is not widely used

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904829#action_12904829
 ] 

Olga Natkovich commented on PIG-1506:
-

I verified that 0.8 code does deal correctly with multi-column keys with nulls

> Need to clarify the difference between null handling in JOIN and COGROUP
> 
>
> Key: PIG-1506
> URL: https://issues.apache.org/jira/browse/PIG-1506
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Olga Natkovich
>Assignee: Corinne Chandel
> Fix For: 0.8.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1585) Add new properties to help and documentation

2010-08-31 Thread Olga Natkovich (JIRA)
Add new properties to help and documentation


 Key: PIG-1585
 URL: https://issues.apache.org/jira/browse/PIG-1585
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0


New properties:

Compression:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not. If true, then 
pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. 

Combining small files:

pig.noSplitCombination - disables combining multiple small files to the block 
size


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904848#action_12904848
 ] 

Olga Natkovich commented on PIG-1501:
-

Ashutosh,

The reason it is off by default is because the default compression is gzip 
which is really slow and most of the time not what you want. Because of the 
licensing issue with lzo, users need to setup it on their own. Once they do the 
setup, they can enable the compression.

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1586:
---

Assignee: Viraj Bhat

Viraj volunteered to print the line that pig gets as part of parameter 
substitution to see if the escapes and quotes are eaten by the shell. Thanks 
Viraj

> Parameter subsitution using -param option runs into problems when substituing 
> entire pig statements in a shell script (maybe this is a bash problem)
> 
>
> Key: PIG-1586
> URL: https://issues.apache.org/jira/browse/PIG-1586
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Viraj Bhat
>Assignee: Viraj Bhat
>
> I have a Pig script as a template:
> {code}
> register Countwords.jar;
> A = $INPUT;
> B = FOREACH A GENERATE
> examples.udf.SubString($0,0,1),
> $1 as num;
> C = GROUP B BY $0;
> D = FOREACH C GENERATE group, SUM(B.num);
> STORE D INTO $OUTPUT;
> {code}
> I attempt to do Parameter substitutions using the following:
> Using Shell script:
> {code}
> #!/bin/bash
> java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r 
> -file sub.pig \
>  -param INPUT="(foreach (COGROUP(load '/user/viraj/dataset1' 
> USING PigStorage() AS (word:chararray,num:int)) by (word),(load 
> '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
> (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2)))" \
>  -param OUTPUT="\'/user/viraj/output\' USING PigStorage()"
> {code}
> {code}
> register Countwords.jar;
> A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
> (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
> PigStorage() AS (word:chararray,num:int)) by (word)) generate 
> flatten(examples.udf.CountWords(runsub.sh,,)));
> B = FOREACH A GENERATE
> examples.udf.SubString($0,0,1),
> $1 as num;
> C = GROUP B BY $0;
> D = FOREACH C GENERATE group, SUM(B.num);
> STORE D INTO /user/viraj/output;
> {code}
> The shell substitutes the $0 before passing it to java. 
> a) Is there a workaround for this?  
> b) Is this is Pig param problem?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1588) Parameter pre-processing of values containing pig positional variables ($0, $1 etc)

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1588.
-

Resolution: Duplicate

This is duplicate of https://issues.apache.org/jira/browse/PIG-1586 and at this 
point we do not believe that either is a bug in pig. Viraj is verifying that 
but we think that shell removes the escapes before giving it to Pig

> Parameter pre-processing of values containing pig positional variables ($0, 
> $1 etc)
> ---
>
> Key: PIG-1588
> URL: https://issues.apache.org/jira/browse/PIG-1588
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Laukik Chitnis
> Fix For: 0.7.0
>
>
> Pig 0.7 requires the positional variables to be escaped by a \\ when passed 
> as part of a parameter value (either through cmd line param or through 
> param_file), which was not the case in Pig 0.6 Assuming that this was not an 
> intended breakage of backward compatibility (could not find it in release 
> notes), this would be a bug.
> For example, We need to pass
> INPUT=CountWords(\\$0,\\$1,\\$2)
> instead of simply
> INPUT=CountWords($0,$1,$2)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1537.
-

Resolution: Fixed

> Column pruner causes wrong results when using both Custom Store UDF and 
> PigStorage
> --
>
> Key: PIG-1537
> URL: https://issues.apache.org/jira/browse/PIG-1537
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
>
> I have script which is of this pattern and it uses 2 StoreFunc's:
> {code}
> register loader.jar
> register piggy-bank/java/build/storage.jar;
> %DEFAULT OUTPUTDIR /user/viraj/prunecol/
> ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
> ss_sc_filtered_0 = FILTER ss_sc_0 BY
> a#'id' matches '1.*' OR
> a#'id' matches '2.*' OR
> a#'id' matches '3.*' OR
> a#'id' matches '4.*';
> ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
> ss_sc_filtered_1 = FILTER ss_sc_1 BY
> a#'id' matches '65.*' OR
> a#'id' matches '466.*' OR
> a#'id' matches '043.*' OR
> a#'id' matches '044.*' OR
> a#'id' matches '0650.*' OR
> a#'id' matches '001.*';
> ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
> ss_sc_all_proj = FOREACH ss_sc_all GENERATE
> a#'query' as query,
> a#'testid' as testid,
> a#'timestamp' as timestamp,
> a,
> b,
> c;
> ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
> ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
> STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
> ss_sc_all_map_count = group ss_sc_all_map all;
> count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
> record_count,COUNT($1);
> STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
> {code}
> I run this script using:
> a) java -cp pig0.7.jar script.pig
> b) java -cp pig0.7.jar -t PruneColumns script.pig
> What I observe is that the alias "count" produces the same number of records 
> but "ss_sc_all_map" have different sizes when run with above 2 options.
> Is due to the fact that there are 2 store func's used?
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-747) Logical to Physical Plan Translation fails when temporary alias are created within foreach

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-747:
---

Fix Version/s: 0.9.0
   (was: 0.8.0)

> Logical to Physical Plan Translation fails when temporary alias are created 
> within foreach
> --
>
> Key: PIG-747
> URL: https://issues.apache.org/jira/browse/PIG-747
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Viraj Bhat
>Assignee: Daniel Dai
> Fix For: 0.9.0
>
> Attachments: physicalplan.txt, physicalplanprob.pig, PIG-747-1.patch
>
>
> Consider a the pig script which calculates a new column F inside the foreach 
> as:
> {code}
> A = load 'physicalplan.txt' as (col1,col2,col3);
> B = foreach A {
>D = col1/col2;
>E = col3/col2;
>F = E - (D*D);
>generate
>F as newcol;
> };
> dump B;
> {code}
> This gives the following error:
> ===
> Caused by: 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
>  ERROR 2015: Invalid physical operators in the physical plan
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:377)
> at 
> org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:63)
> at 
> org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:29)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:68)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:908)
> at 
> org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:122)
> at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:41)
> at 
> org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
> at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:246)
> ... 10 more
> Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to give 
> operator of type 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide
>  multiple outputs.  This operator does not support multiple outputs.
> at 
> org.apache.pig.impl.plan.OperatorPlan.connect(OperatorPlan.java:158)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.connect(PhysicalPlan.java:89)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:373)
> ... 19 more
> ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   6   7   8   9   10   >