[jira] Commented: (PIG-1661) Add alternative search-provider to Pig site

2010-10-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917130#action_12917130
 ] 

Ashutosh Chauhan commented on PIG-1661:
---

+1 for experimenting with search-hadoop.
Patch itself is small enough, so even if we find otherwise, it can easily be 
reverted.

> Add alternative search-provider to Pig site
> ---
>
> Key: PIG-1661
> URL: https://issues.apache.org/jira/browse/PIG-1661
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Alex Baranau
>Priority: Minor
> Attachments: PIG-1661.patch
>
>
> Use search-hadoop.com service to make available search in Pig sources, MLs, 
> wiki, etc.
> This was initially proposed on user mailing list. The search service was 
> already added in site's skin (common for all Hadoop related projects) via 
> AVRO-626 so this issue is about enabling it for Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-10-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Committed to both trunk and 0.8. Thanks, Niraj!

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG-1531_5.patch, 
> PIG_1531.patch, PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1641) Incorrect counters in local mode

2010-09-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915408#action_12915408
 ] 

Ashutosh Chauhan commented on PIG-1641:
---

Tested manually for local mode. Messages were same as proposed above. +1 for 
the commit. One minor suggestion is to put a line at the start saying something 
like: "Detected Local mode. Stats reported below may be incomplete." This will 
reinforce the message to users that stats reporting is not transparent across 
different modes (local Vs map-reduce).

> Incorrect counters in local mode
> 
>
> Key: PIG-1641
> URL: https://issues.apache.org/jira/browse/PIG-1641
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Ashutosh Chauhan
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1641.patch
>
>
> User report, not verified.
> 
> HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
> 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
> 21:58:42ORDER_BY
> Success!
> Job Stats (time in seconds):
> JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
> MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
> job_local_000100000000rawMAP_ONLY
> job_local_000200000000rank_sort
> SAMPLER
> job_local_000300000000rank_sort
> ORDER_BYProcessed/user_visits_table,
> Input(s):
> Successfully read 0 records from: "Data/Raw/UserVisits.dat"
> Output(s):
> Successfully stored 0 records in: "Processed/user_visits_table"
> However, when I look in the output:
> $ ls -lh Processed/user_visits_table/CG0/
> total 15250760
> -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
> It read a 20G input file and generated some output...
> 
> Is it that in local mode counters are not available? If so, instead of 
> printing zeros we should print "Information Unavailable" or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1641) Incorrect counters in local mode

2010-09-22 Thread Ashutosh Chauhan (JIRA)
Incorrect counters in local mode


 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan


User report, not verified.



HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42  
  ORDER_BY

Success!

Job Stats (time in seconds):
JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
job_local_000100000000rawMAP_ONLY
job_local_000200000000rank_sortSAMPLER  
  
job_local_000300000000rank_sortORDER_BY 
   Processed/user_visits_table,

Input(s):
Successfully read 0 records from: "Data/Raw/UserVisits.dat"

Output(s):
Successfully stored 0 records in: "Processed/user_visits_table"


However, when I look in the output:

$ ls -lh Processed/user_visits_table/CG0/
total 15250760
-rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*

It read a 20G input file and generated some output...



Is it that in local mode counters are not available? If so, instead of printing 
zeros we should print "Information Unavailable" or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1531) Pig gobbles up error messages

2010-09-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913048#action_12913048
 ] 

Ashutosh Chauhan commented on PIG-1531:
---

Oh Hudson, oh well...

Ran the full suite of 400 minutes of unit tests; all passed. Patch is ready for 
review.

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>    Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, 
> PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-09-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Attachment: pig-1531_4.patch

Added a test-case which fails on trunk. Pig still gobbles up error messages. 
Fix is to rethrow the message in the hierarchy. Attached patch containis the 
test case and the fix.

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, 
> PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-09-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Status: Patch Available  (was: Reopened)

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, 
> PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1531) Pig gobbles up error messages

2010-09-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reopened PIG-1531:
---


Peril of not writing unit test : Resurrection of bug. Argh..


> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'

2010-09-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905995#action_12905995
 ] 

Ashutosh Chauhan commented on PIG-1590:
---

Also inner merge join on more then 2 then tables also translates into 
POMergeCogroup + FE + Flatten.  Here also it can be translated to use 
POMergeJoin and enjoy the benefits which comes along with it. Though I suspect 
it will introduce much more complexity in POMergeJoin then the case for left 
outer merge join. So, may not be worth doing. 

> Use POMergeJoin for Left Outer Join when join using 'merge'
> ---
>
> Key: PIG-1590
> URL: https://issues.apache.org/jira/browse/PIG-1590
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Ashutosh Chauhan
>Priority: Minor
>
> C = join A by $0 left, B by $0 using 'merge';
> will result in map-side sort merge join. Internally, it will translate to use 
> POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few 
> restrictions on its loaders (A and B in this case) which is cumbersome. 
> Currently, only Zebra is known to satisfy all those requirements. It will be 
> better to use POMergeJoin in this case, since it has far fewer requirements 
> on its loader. Importantly, it works with PigStorage.  Plus, POMergeJoin will 
> be faster then POMergeCogroup + FE-Flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1309) Sort Merge Cogroup

2010-09-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905935#action_12905935
 ] 

Ashutosh Chauhan commented on PIG-1309:
---

Correct. Condition(1) is implied only for user specified statements.

> Sort Merge Cogroup
> --
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>    Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0, 0.8.0
>
> Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
> PIG_1309_7.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1598) Pig gobbles up error messages - Part 2

2010-09-02 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905771#action_12905771
 ] 

Ashutosh Chauhan commented on PIG-1598:
---

grunt> c = group a by $0 using 'collected';
grunt> dump c;
2010-09-02 19:24:28,765 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: c: 
Store(hdfs://server.com:9020/tmp/temp893971773/tmp-1357568439:org.apache.pig.builtin.BinStorage)
 - 1-364 Operator Key: 1-364)
2010-09-02 19:24:28,767 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2043: Unexpected error during execution.
Details at logfile: /Users/chauhana/workspace/pig-fix-bags/pig_1283478800827.log
grunt> sh tail -n 12 pig_1283478800827.log
... 7 more
Caused by: 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
 ERROR 0: While using 'collected' on group; data must be loaded via loader 
implementing CollectableLoadFunc.

Similar cases for other error conditions in group, cogroup and using 'merge' 
and 'collected' when all the conditions are not met. 

> Pig gobbles up error messages - Part 2
> --
>
> Key: PIG-1598
> URL: https://issues.apache.org/jira/browse/PIG-1598
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ashutosh Chauhan
>
> Another case of PIG-1531 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1598) Pig gobbles up error messages - Part 2

2010-09-02 Thread Ashutosh Chauhan (JIRA)
Pig gobbles up error messages - Part 2
--

 Key: PIG-1598
 URL: https://issues.apache.org/jira/browse/PIG-1598
 Project: Pig
  Issue Type: Improvement
Reporter: Ashutosh Chauhan


Another case of PIG-1531 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'

2010-09-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905207#action_12905207
 ] 

Ashutosh Chauhan commented on PIG-1590:
---

It will entail changes in POMergeJoin and LogToPhyTranslationVisitor.

> Use POMergeJoin for Left Outer Join when join using 'merge'
> ---
>
> Key: PIG-1590
> URL: https://issues.apache.org/jira/browse/PIG-1590
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Ashutosh Chauhan
>Priority: Minor
>
> C = join A by $0 left, B by $0 using 'merge';
> will result in map-side sort merge join. Internally, it will translate to use 
> POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few 
> restrictions on its loaders (A and B in this case) which is cumbersome. 
> Currently, only Zebra is known to satisfy all those requirements. It will be 
> better to use POMergeJoin in this case, since it has far fewer requirements 
> on its loader. Importantly, it works with PigStorage.  Plus, POMergeJoin will 
> be faster then POMergeCogroup + FE-Flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'

2010-09-01 Thread Ashutosh Chauhan (JIRA)
Use POMergeJoin for Left Outer Join when join using 'merge'
---

 Key: PIG-1590
 URL: https://issues.apache.org/jira/browse/PIG-1590
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Priority: Minor


C = join A by $0 left, B by $0 using 'merge';

will result in map-side sort merge join. Internally, it will translate to use 
POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few restrictions 
on its loaders (A and B in this case) which is cumbersome. Currently, only 
Zebra is known to satisfy all those requirements. It will be better to use 
POMergeJoin in this case, since it has far fewer requirements on its loader. 
Importantly, it works with PigStorage.  Plus, POMergeJoin will be faster then 
POMergeCogroup + FE-Flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904843#action_12904843
 ] 

Ashutosh Chauhan commented on PIG-1501:
---

If its not backward-incompatible then is there any specific reason to default 
pig.tmpfilecompression to false. This seems to be a useful feature, so it 
should be true by default, no ?

> need to investigate the impact of compression on pig performance
> 
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>  Issue Type: Test
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-08-30 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1531) Pig gobbles up error messages

2010-08-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904497#action_12904497
 ] 

Ashutosh Chauhan commented on PIG-1531:
---

Niraj ran all the unit tests. All passed. No complaints from test-patch either. 
Committed to the trunk.
Thanks, Niraj !

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>    Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-08-29 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Status: Patch Available  (was: Open)

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-08-29 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Attachment: pig-1531_3.patch

I took a look of the latest patch. There are two minor problems. Firstly, 
pigExec was always null and never assigned a value, so it resulted in NPE in 
certain code path. Second, the boolean logic in PigInputFormat needs && instead 
of ||. I thought of correcting it and committing. But then realized hudson 
hasnt come back with results yet. So, I am uploading a new patch with those 
corrections and submitting to Hudson again. In this patch, I also refactored a 
code a bit, so its easier to read. Have a look and if its look fine to you. Can 
you run test-patch and unit tests and paste results here, so I can commit it.

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>    Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-08-29 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Status: Open  (was: Patch Available)

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903283#action_12903283
 ] 

Ashutosh Chauhan commented on PIG-1518:
---

Yan, 
Sorry for being late on this now thats its committed. But I think you have 
gotten it other way around. A CollectableLoadFunc is combinable but 
OrderedLoadFunc is not. Lets go over all three interfaces:

* h4. CollectableLoadFunc: A loader implementing it must make sure that all 
instances of a particular key is present in one split. If you combine splits of 
such a loader, it will still remain CollectableLoadFunc because all instances 
of keys will still be in same split after combination. It is dictating a 
property *within* a split. Thus, its combinable.
* h4. OrderedLoadFunc: OrderedLoadFunc insists that loader implementing it must 
read splits in a well defined order. If you combine the splits, that order may 
not hold. You cant combine splits for this loader. Its defining a property 
*across* multiple splits.
* h4. IndexableLoadFunc: Says that loader is indexable meaning given a key it 
will get you as close as possible to that key. It inherently assumes data is 
sorted and index is built for it. Your combined splits may not remain sorted 
anymore. You cant combine splits for this interface either. Its defining a 
property *across* multiple splits.

If you agree with above then PigStorage isnt combinable because 
{code}
public class PigStorage extends FileInputLoadFunc implements 
StoreFuncInterface,  LoadPushDown{}
and 
public abstract class FileInputLoadFunc extends LoadFunc implements 
OrderedLoadFunc  {}
{code}

I also didnt get your logic for *CollectableLoadFunc AND a OrderedLoadFunc* It 
will help if you can explain that a bit.


> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1531) Pig gobbles up error messages

2010-08-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902576#action_12902576
 ] 

Ashutosh Chauhan commented on PIG-1531:
---

* In addition to error Msg, you also need to set error code on the exception 
you are throwing.
* Since you are catching exceptions thrown by user code (StoreFunc Interface) 
it is not safe to assume that e.getMessage() will be non-null or non-empty 
string. This will result in NPE. You need to check for it and provide a generic 
error Msg in those cases.
* Generic error msg should also contain output location String. Since if user 
didnt provide it, that wont get printed. So, you can reword the message as 
"Output location validation failed for: . More Information to 
follow:" 
* Since, PigException extends from IOException. The IOException you are 
catching can also be a PigException, you need to test if it is and then set the 
message and error code.
* In case of non-existent input location I am still seeing the generic message 
"ERROR 2997: Unable to recreate exception from backend error: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for: file:///Users/chauhana/workspace/pig-1531/a" Though 
the full stack trace is printed at the end which contains the underlying error 
String. Its more confusing because now there are three different error messages 
amid a java stack trace.
* This warrants a testcase for regression purposes. (Infact error reporting 
behavior already changed since the time I opened this bug.)

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>    Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
> Attachments: PIG_1531.patch
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-08-20 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Release Note: 
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and loader implements certain interfaces. Primary algorithm is based on 
sort-merge join with additional restrictions. 

Following preconditions must be met to use this feature: 
1) No other operations can be done between load and cogroup statements. 
2) Data must be sorted on join keys for all tables in ASC order. 
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else. 
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}. 
5) All other loaders must implement IndexableLoadFunc. 
6) Type information must be provided in schema for all the loaders.

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box. 

Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as 
well. 

Example: 
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
C = COGROUP A by id, B by id using 'merge'; 


  was:
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and loader implements certain interfaces. Primary algorithm is based on 
sort-merge join with additional restrictions. 

Following preconditions must be met to use this feature: 
1) No other operations can be done between load and join statements. 
2) Data must be sorted on join keys for all tables in ASC order. 
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else. 
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}. 
5) All other loaders must implement IndexableLoadFunc. 
6) Type information must be provided in schema for all the loaders.

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box. 

Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as 
well. 

Example: 
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
C = COGROUP A by id, B by id using 'merge'; 



> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
>     Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0, 0.8.0
>
> Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
> PIG_1309_7.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1353) Map-side outer joins

2010-08-20 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1353:
--

Release Note: 
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and loaders implement required interfaces. Primary algorithm is 
based on sort-merge join. 

Following preconditions should be met in order to use this feature:
1) No other operations can be done between load and join statements.
2) Data must be sorted on join keys in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}.
5) All other loaders must implement {IndexableLoadFunc}.   
6) Type information must be provided in schema for all the loaders. 

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.

Similar conditions apply to map-side cogroups (PIG-1309) as well.  

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
C = join A by id left, B by id using 'merge';
.

  was:
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and one of the loader implements {{CollectableLoader}} 
interface. Primary algorithm is based on sort-merge join. 

Additional implementation details:
1) No other operations can be done between load and join statements.
2) Data must be sorted in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc.
5) All other loaders must implement IndexableLoadFunc.   

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.
Similiar conditions apply to map-side cogroups (PIG-1309) as well.  

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
C = join A by id left, B by id using 'merge';
.


> Map-side outer joins
> 
>
> Key: PIG-1353
> URL: https://issues.apache.org/jira/browse/PIG-1353
>         Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-1353.patch, pig-1353.patch
>
>
> Pig already has couple of map-side join implementations: Merge Join and 
> Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
> Join can only join two tables and that too can only do inner join. FR Join 
> can join multiple relations, but it can also only do inner and left outer 
> joins. Further it restricts the sizes of side relations. It will be nice if 
> we can do map side joins on multiple tables as well do inner, left outer, 
> right outer and full outer joins. 
> Lot of groundwork for this has already been done in PIG-1309. Remaining will 
> be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-08-20 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Release Note: 
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and loader implements certain interfaces. Primary algorithm is based on 
sort-merge join with additional restrictions. 

Following preconditions must be met to use this feature: 
1) No other operations can be done between load and join statements. 
2) Data must be sorted on join keys for all tables in ASC order. 
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else. 
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}. 
5) All other loaders must implement IndexableLoadFunc. 
6) Type information must be provided in schema for all the loaders.

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box. 

Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as 
well. 

Example: 
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
C = COGROUP A by id, B by id using 'merge'; 


  was:
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and one of the loader implements {{CollectableLoader}} interface. 
Primary algorithm is based on sort-merge join. 

Additional implementation details: 
1) No other operations can be done between load and join statements. 
2) Data must be sorted in ASC order. 
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else. 
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc. 
5) All other loaders must implement IndexableLoadFunc. 

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box. 
Similiar conditions apply to map-side cogroups (PIG-1309) as well. 

Example: 
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); 
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); 
C = COGROUP A by id, B by id using 'merge'; 



> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
>     Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Ashutosh Chauhan
>Assignee: Ashutosh Chauhan
> Fix For: 0.7.0, 0.8.0
>
> Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
> PIG_1309_7.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path

2010-08-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900768#action_12900768
 ] 

Ashutosh Chauhan commented on PIG-1486:
---

meh.. Before testing thou shalt apply the patch !
After I applied the patch, works like a charm. +1

> update ant eclipse-files target to include new jar and remove contrib dirs 
> from build path
> --
>
> Key: PIG-1486
> URL: https://issues.apache.org/jira/browse/PIG-1486
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1486.1.patch, PIG-1486.2.patch, PIG-1486.patch
>
>
>  .eclipse.templates/.classpath needs to be updated to address following -
> 1. There is a new jar that is used by the code - guava-r03.jar
> 2. The jar "ANT_HOME/lib/ant.jar" gives an 'unbounded jar' error in eclipse.
> 3. Removing the contrib projects from class path as discussed in PIG-1390, 
> until all libs necessary for the contribs are included in classpath.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-533) DBloader UDF (initial prototype)

2010-08-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan resolved PIG-533.
--

Fix Version/s: 0.8.0
   Resolution: Fixed

PIG-1229 makes this redundant.

> DBloader UDF (initial prototype)
> 
>
> Key: PIG-533
> URL: https://issues.apache.org/jira/browse/PIG-533
> Project: Pig
>  Issue Type: New Feature
>Reporter: Ian Holsman
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: DbStorage.java
>
>
> This is an initial prototype of a UDF that can insert data into a database 
> directly from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path

2010-08-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900568#action_12900568
 ] 

Ashutosh Chauhan commented on PIG-1486:
---

I did 
svn co https://svn.apache.org/repos/asf/hadoop/pig/trunk/ pig-1486
ant eclipse-files

and then imported pig-1486 as existing project in eclipse. I presume thats all 
I need to do.
Patch needs more updates after PIG-1520 . Essentially needs to remove owl from 
eclipse's build path. Further, eclipse also reported
* Unbound classpath variable: 'ANT_HOME/lib/ant.jar' in project 'pig-1486'
* Project 'pig-1486' is missing required library: 'lib/hadoop20.jar'



> update ant eclipse-files target to include new jar and remove contrib dirs 
> from build path
> --
>
> Key: PIG-1486
> URL: https://issues.apache.org/jira/browse/PIG-1486
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1486.1.patch, PIG-1486.2.patch, PIG-1486.patch
>
>
>  .eclipse.templates/.classpath needs to be updated to address following -
> 1. There is a new jar that is used by the code - guava-r03.jar
> 2. The jar "ANT_HOME/lib/ant.jar" gives an 'unbounded jar' error in eclipse.
> 3. Removing the contrib projects from class path as discussed in PIG-1390, 
> until all libs necessary for the contribs are included in classpath.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-08-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900566#action_12900566
 ] 

Ashutosh Chauhan commented on PIG-1420:
---

> I could not figure out how to re-open this issue.

Issues marked as resolved cannot be reopened. Once the patch is committed, 
commiter should mark issue as resolved, since resolved issues can be reopened 
before release is rolled out. When the release is rolled out, resolved issues 
should be marked as closed, since there is no point in reopening an issue which 
has already been released. If more work needs to be done on that issue new jira 
should be created for it for future releases.

> Make CONCAT act on all fields of a tuple, instead of just the first two 
> fields of a tuple
> -
>
> Key: PIG-1420
> URL: https://issues.apache.org/jira/browse/PIG-1420
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Russell Jurney
>Assignee: Russell Jurney
> Fix For: 0.8.0
>
> Attachments: addconcat2.patch, PIG-1420.2.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
> org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
> act on the first two fields of a tuple.  This results in ugly nested CONCAT 
> calls like:
> CONCAT(CONCAT(A, ' '), B)
> The more desirable form is:
> CONCAT(A, ' ', B)
> This change will be backwards compatible, provided that no one was relying on 
> the fact that CONCAT ignores fields after the first two in a tuple.  This 
> seems a reasonable assumption to make, or at least a small break in 
> compatibility for a sizable improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898648#action_12898648
 ] 

Ashutosh Chauhan commented on PIG-1518:
---

This feature of combining multiple splits should honor OrderedLoadFunc 
interface. If loadfunc is implementing that interface, then splits generated by 
it should not be combined. However, its not clear why FileInputLoadFunc 
implements this interface. AFAIK, split[] returned by getsplits() on 
FileInputFormat makes no guarantees that underlying splits will be returned in 
ordered fashion. Though, it is a default behavior right now and thus making it 
implement OrderedLoadFunc doesnt result in any problem in current 
implementation. But it seems there is no real benefit of FileInputLoadFunc 
needing to implement it (there is one exception to which I will come later on). 
So, I will argue that FileInputLoadFunc stop implementing OrderedLoadFunc. This 
will result in immediate benefit of making this change useful to all the 
fundamental storage mechanisms of Pig like PigStorage, BinStorage, InterStorage 
etc. Dropping of an interface by an implementing class  can be seen as backward 
incompatible change, but I really doubt if any one cares if PigStorage is 
reading splits in an ordered fashion. 
Only real victim of this change will be MergeJoin which will stop working with 
PigStorage by default. But we have not seen MergeJoin being used with 
PigStorage at many places. Second, its anyway is based on assumption of 
FileInputFormat which may choose to change behavior in future. Third, solution 
of this problem will be straight forward that having other Loader which extends 
PigStorage and implements OrderedLoadFunc which can be used to load data for 
merge join. 

In essence I am arguing to drop OrderedLoadFunc interface from 
FileInputLoadFunc so that this feature is useful for large number of usecases.

Yan, you also need to watch out for ReadToEndLoader which is also making 
assumptions which may break in presence of this feature.

> multi file input format for loaders
> ---
>
> Key: PIG-1518
> URL: https://issues.apache.org/jira/browse/PIG-1518
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-08-04 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895330#action_12895330
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

Tested and it worked. Committed. Thanks Aaron and Ankur for help in fixing the 
issue.

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, 
> jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1404) PigUnit - Pig script testing simplified.

2010-08-04 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895318#action_12895318
 ] 

Ashutosh Chauhan commented on PIG-1404:
---

bq. 3. (This one is for other pig developers) Is Piggybank the right place for 
this or should we put it under test? I think this will be really useful for Pig 
users in setting up automated tests of their Pig Latin scripts. Should we 
support it outright rather than put it in piggybank and risk having it go 
unmaintained?

I think it deserves to be put in under test. Having written few end-to-end test 
cases of pig in junit, I can see its really useful for Pig itself. Usefulness 
for pig users is pretty obvious.

> PigUnit - Pig script testing simplified. 
> -
>
> Key: PIG-1404
> URL: https://issues.apache.org/jira/browse/PIG-1404
> Project: Pig
>  Issue Type: New Feature
>Reporter: Romain Rigaux
>Assignee: Romain Rigaux
> Fix For: 0.8.0
>
> Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, 
> PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, 
> PIG-1404-4.patch, PIG-1404.patch
>
>
> The goal is to provide a simple xUnit framework that enables our Pig scripts 
> to be easily:
>   - unit tested
>   - regression tested
>   - quickly prototyped
> No cluster set up is required.
> For example:
> TestCase
> {code}
>   @Test
>   public void testTop3Queries() {
> String[] args = {
> "n=3",
> };
> test = new PigTest("top_queries.pig", args);
> String[] input = {
> "yahoo\t10",
> "twitter\t7",
> "facebook\t10",
> "yahoo\t15",
> "facebook\t5",
> 
> };
> String[] output = {
> "(yahoo,25L)",
> "(facebook,15L)",
> "(twitter,7L)",
> };
> test.assertOutput("data", input, "queries_limit", output);
>   }
> {code}
> top_queries.pig
> {code}
> data =
> LOAD '$input'
> AS (query:CHARARRAY, count:INT);
>  
> ... 
> 
> queries_sum = 
> FOREACH queries_group 
> GENERATE 
> group AS query, 
> SUM(queries.count) AS count;
> 
> ...
> 
> queries_limit = LIMIT queries_ordered $n;
> STORE queries_limit INTO '$output';
> {code}
> They are 3 modes:
> * LOCAL (if "pigunit.exectype.local" properties is present)
> * MAPREDUCE (use the cluster specified in the classpath, same as 
> HADOOP_CONF_DIR)
> ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in 
> the class path will be: ~/pigtest/conf)
> ** pointing to an existing cluster (if "pigunit.exectype.cluster" properties 
> is present)
> For now, it would be nice to see how this idea could be integrated in 
> Piggybank and if PigParser/PigServer could improve their interfaces in order 
> to make PigUnit simple.
> Other components based on PigUnit could be built later:
>   - standalone MiniCluster
>   - notion of workspaces for each test
>   - standalone utility that reads test configuration and generates a test 
> report...
> It is a first prototype, open to suggestions and can definitely take 
> advantage of feedbacks.
> How to test, in pig_trunk:
> {code}
> Apply patch
> $pig_trunk ant compile-test
> $pig_trunk ant
> $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99
> {code}
> (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the 
> future between 'unit' and 'integration')
> Many examples are in:
> {code}
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java
> {code}
> When used as a standalone, do not forget commons-lang-2.4.jar and the 
> HADOOP_CONF_DIR to your cluster in your CLASSPATH.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-08-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894963#action_12894963
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

I am still getting the same exception 
{code}
java.io.IOException: JDBC Error
at 
org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:291)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.(PigOutputFormat.java:124)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:85)
at 
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.(MapTask.java:488)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:610)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.sql.SQLException: Table not found in statement [insert into ttt 
(id, name, ratio) values (?,?,?)]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.(Unknown Source)
at org.hsqldb.jdbc.jdbcConnection.prepareStatement(Unknown Source)
at 
org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:288)
... 6 more
{code}

Reading through few internet forums it seems that there are subtle differences 
in "stand-alone" mode Vs "server" mode of hsqldb . May be starting hsqldb 
instance in server mode would alleviate the problem.

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, 
> jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

2010-08-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894945#action_12894945
 ] 

Ashutosh Chauhan commented on PIG-1516:
---

+1. Changes look good.

> finalize in bag implementations causes pig to run out of memory in reduce 
> --
>
> Key: PIG-1516
> URL: https://issues.apache.org/jira/browse/PIG-1516
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1516.2.patch, PIG-1516.patch
>
>
> *Problem:*
> pig bag implementations that are subclasses of DefaultAbstractBag, have 
> finalize methods implemented. As a result, the garbage collector moves them 
> to a finalization queue, and the memory used is freed only after the 
> finalization happens on it.
> If the bags are not finalized fast enough, a lot of memory is consumed by the 
> finalization queue, and pig runs out of memory. This can happen if large 
> number of small bags are being created.
> *Solution:*
> The finalize function exists for the purpose of deleting the spill files that 
> are created when the bag is too large. But if the bags are small enough, no 
> spill files are created, and there is no use of the finalize function.
>  A new class that holds a list of files will be introduced (FileList). This 
> class will have a finalize method that deletes the files. The bags will no 
> longer have finalize methods, and the bags will use FileList instead of 
> ArrayList.
> *Possible workaround for earlier releases:*
> Since the fix is going into 0.8, here is a workaround -
> Disabling the combiner will reduce the number of bags getting created, as 
> there will not be the stage of combining intermediate merge results. But I 
> would recommend disabling it only if you have this problem as it is likely to 
> slow down the query .
> To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1531) Pig gobbles up error messages

2010-08-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894935#action_12894935
 ] 

Ashutosh Chauhan commented on PIG-1531:
---

Another instance where it happens is when input location doesnt exists, error 
message shown is 
{code}
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for tmp_emtpy_1280539088
{code}
Whereas underlying exception did have more useful String which gets lost in log 
file
{code}
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist:
hdfs://machine.server.edu/tmp/pig/tmp_tables/tmp_empty_1280539088
{code}

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>    Reporter: Ashutosh Chauhan
>Assignee: niraj rai
> Fix For: 0.8.0
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1531) Pig gobbles up error messages

2010-07-31 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894320#action_12894320
 ] 

Ashutosh Chauhan commented on PIG-1531:
---

This is because in InputOutputFileVisitor#visit() Pig defines its own 
{{errMsg}} String and uses that to throw PlanValidationException. It should use 
the error String of the Exception it has caught. 
I have not checked at other places. But I have a hunch that it happens at few 
other places in Pig  as well.  This is a real usability issue since generic 
message is usually useless and Pig misses an opportunity to provide an useful 
bit of information in error scenarios. From that point on, user has to go open 
the log file and scroll among tens of lines of stack trace and only if she is 
familiar with Pig will spot that error String. 

> Pig gobbles up error messages
> -
>
> Key: PIG-1531
> URL: https://issues.apache.org/jira/browse/PIG-1531
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>    Reporter: Ashutosh Chauhan
> Fix For: 0.8.0
>
>
> Consider the following. I have my own Storer implementing StoreFunc and I am 
> throwing FrontEndException (and other Exceptions derived from PigException) 
> in its various methods. I expect those error messages to be shown in error 
> scenarios. Instead Pig gobbles up my error messages and shows its own generic 
> error message like: 
> {code}
> 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2116: Unexpected error. Could not validate the output specification for: 
> default.partitoned
> Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
> {code}
> Instead I expect it to display my error messages which it stores away in that 
> log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1531) Pig gobbles up error messages

2010-07-31 Thread Ashutosh Chauhan (JIRA)
Pig gobbles up error messages
-

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


Consider the following. I have my own Storer implementing StoreFunc and I am 
throwing FrontEndException (and other Exceptions derived from PigException) in 
its various methods. I expect those error messages to be shown in error 
scenarios. Instead Pig gobbles up my error messages and shows its own generic 
error message like: 
{code}
010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2116: Unexpected error. Could not validate the output specification for: 
default.partitoned
Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log

{code}
Instead I expect it to display my error messages which it stores away in that 
log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1528) Enable use of similar aliases when doing a join :(ERROR 1108: Duplicate schema alias:)

2010-07-30 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan resolved PIG-1528.
---

Resolution: Duplicate

Duplicate of PIG-859

> Enable use of similar aliases when doing a join :(ERROR 1108: Duplicate 
> schema alias:)
> --
>
> Key: PIG-1528
> URL: https://issues.apache.org/jira/browse/PIG-1528
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Viraj Bhat
>
> I am doing a self join:
> Input file is tab separated:
> {code}
> 1   one
> 1   uno
> 2   two
> 2   dos
> 3   three
> 3   tres
> {code}
> vi...@machine~/pigscripts >pig -x local script.pig
> {code}
> A = load 'Adataset.txt' as (key:int, value:chararray);
> C = join A by key, A by key;
> dump C;
> {code} 
> 2010-07-30 23:09:05,422 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1108: Duplicate schema alias: A::key in "C"
> Details at logfile: /homes/viraj/pigscripts/pig_1280531249235.log

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-07-27 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1229:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Changes look good. Core test failures looks irrelevant as there are no changes 
in main src/ tree of Pig only in contrib. Thanks Ian for your initial work. 
Thanks, Ankur for your persistence in getting this committed . 

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-final.patch, jira-1229-v2.patch, 
> jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-07-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892378#action_12892378
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

Since fix to PIG-1424 doesnt look straight forward and I dont think anyone is 
working on it, I will suggest to unblock this useful piggy bank functionality 
from Pig's issues. We can take the original approach suggested in the first 
patch of passing jdbc url string as constructor argument instead of store 
location. 
Ankur, do you have cycles to generate the patch which we will commit now so it 
makes into 0.8.

> allow pig to write output into a JDBC db
> 
>
> Key: PIG-1229
> URL: https://issues.apache.org/jira/browse/PIG-1229
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Reporter: Ian Holsman
>Assignee: Ankur
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: jira-1229-v2.patch, jira-1229-v3.patch, 
> pig-1229.2.patch, pig-1229.patch
>
>
> UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-07-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890840#action_12890840
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Thanks, Aniket for making those changes. Its getting closer.
* I am still not convinced about the changes required in POUserFunc. That logic 
should really be a part of pythonToPig(pyObject). If python UDF is returning 
byte[], it should be turned into DataByteArray before it gets back into Pig's 
pipeline. And if we do that conversion in pythonToPig() (which is a right place 
to do it) we will need no changes in POUserFunc. 
* As I suggested in previous comment in the same method you should avoid first 
creating Array and then turning that Array in list, you can rather create a 
list upfront and use it.
* Instead of instanceof, doing class equality test will be a wee-bit faster. 
Like instead of (pyObject instanceof PyDictionary) do pyobject.getClass() == 
PyDictionary.class. Obviously, it will work when you know exact target class 
and not for the derived ones.
* parseSchema(String schema) already exist in  org.apache.pig.impl.util.Utils 
class. So, no need for that in ScriptEngine
* For register command, we need to test not only for functionality but for 
regressions as well. Look at TestGrunt.java in test package to get an idea how 
to write test for it.

> UDFs in scripting languages
> ---
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Aniket Mokashi
> Fix For: 0.8.0
>
> Attachments: calltrace.png, package.zip, PIG-928.patch, 
> pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, 
> RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, 
> RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, 
> RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, 
> RegisterPythonUDFLatest.patch, RegisterScriptUDFDefineParse.patch, 
> scripting.tgz, scripting.tgz, test.zip
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-07-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890845#action_12890845
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Addendum:

* Also what will happen if user returned a nil python object  (null equivalent 
of Java) from UDF. It looks to me that will result in NPE. Can you add a test 
for that and similar test case from pigToPython() 

> UDFs in scripting languages
> ---
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Aniket Mokashi
> Fix For: 0.8.0
>
> Attachments: calltrace.png, package.zip, PIG-928.patch, 
> pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, 
> RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, 
> RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, 
> RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, 
> RegisterPythonUDFLatest.patch, RegisterScriptUDFDefineParse.patch, 
> scripting.tgz, scripting.tgz, test.zip
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1487) Replace "bz" with ".bz" in all the LoadFunc

2010-07-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888182#action_12888182
 ] 

Ashutosh Chauhan commented on PIG-1487:
---

+1 

> Replace "bz" with ".bz"  in all the LoadFunc
> 
>
> Key: PIG-1487
> URL: https://issues.apache.org/jira/browse/PIG-1487
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.8.0
>
> Attachments: PIG_1487.patch
>
>
> This issue relates with PIG-1463. Thank Ashutosh find another place in 
> PigStorage should be corrected. I check all the LoadFunc and found that 
> TextLoader also has same problem. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-07-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888100#action_12888100
 ] 

Ashutosh Chauhan commented on PIG-928:
--

* Do you want to allow: {{register myJavaUDFs.jar using 'java' as 
'javaNameSpace'}} ? Use-case could be that if we are allowing namespaces for 
non-java, why not allow for Java udfs as well. But then {{define}} is exactly 
for this purpose. So, it may make sense to throw exception for such a case.
* In ScriptEngine.getJarPath() shouldn't you throw a FileNotFoundException 
instead of returning null.
* Don't gobble up Checked Exceptions and then rethrow RuntimeExceptions. Throw 
checked exceptions, if you need to.
* ScriptEngine.getInstance() should be a singleton, no?
* In JythonScriptEngine.getFunction() I think you should check if 
interpreter.get(functionName) != null and then return it and call 
Interpreter.init(path) only if its null.
* In JythonUtils, for doing type conversion you should make use of both input 
and output schemas (whenever they are available) and avoid doing reflection for 
every element. You can get hold of input schema through outputSchema() of 
EvalFunc and then do UDFCOntext magic to use it. If schema == null || schema == 
bytearray, you need to resort to reflections. Similarily if outputSchema is 
available via decorators, use it to do type conversions.  
* In jythonUtils.pythonToPig() in case of Tuple, you first create Object[] then 
do Arrays.asList(), you can directly create List and avoid unnecessary 
casting. In the same method, you are only checking for long, dont you need to 
check for int, String  etc. and then do casting appropriately. Also, in default 
case I think we cant let object pass as it is using Object.class, it could be 
object of any type and may cause cryptic errors in Pipeline, if let through. We 
should throw an exception if we dont know what type of object it is. Similar 
argument for default case of pigToPython() 
* I didn't get why the changes are required in POUserFunc. Can you explain and 
also add it as comments in the code.

Testing:

* This is a big enough feature to warrant its own test file. So, consider 
adding a new test file (may be TestNonJavaUDF). Additionally, we see frequent 
timeouts on TestEvalPipeline, we dont want it to run any longer.
* Instead of adding query through pigServer.registerCode() api, add it through 
pigServer.registerQuery(register myscript.py using "jython"). This will make 
sure we are testing changes in QueryParser.jjt as well.
* Add more tests. Specifically, for complex types passed to the udfs (like bag) 
and returning a bag. You can get bags after doing a group-by. You can also take 
a look at original Julien's patch which contained a python script. Those I 
guess were at right level of complexity to be added as test-cases in our junit 
tests.

Nit-picks:

* Unnecessary import in JythonFunction.java
* In PigContext.java, you are using Vector and LinkedList, instead of usual 
ArrayList. Any particular reason for it, just curious?
* More documentation (in QuerParser.jjt, ScriptEngine, JythonScriptEngine 
(specifically for outputSchema, outputSchemaFunction, schemafunction))
* Also keep an eye of recent "mavenization" efforts of Pig, depending on when 
it gets checked-in you may (or may not) need to make changes to ivy

> UDFs in scripting languages
> ---
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
>Assignee: Aniket Mokashi
> Fix For: 0.8.0
>
> Attachments: calltrace.png, package.zip, PIG-928.patch, 
> pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, 
> RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, 
> RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, 
> RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, 
> RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Add "deepCopy" in LogicalExpression

2010-07-13 Thread Ashutosh Chauhan
You are now charting into danger zone :) I tried doing something
similar when we were trying to add  similar kind of logical
optimizations to Pig but was not able to get my head around it that
time. My attempt to it is at:
https://issues.apache.org/jira/browse/PIG-1073
But then I might be too naive at that time.

Ashutosh
On Tue, Jul 13, 2010 at 10:48, Swati Jain  wrote:
> Hi Alan,
>
> By default clone creates a shallow copy of the object in the sense that it
> will create a new instance of the object but reference will be the same. Any
> change applied to anyone of the object will reflect in both.
>
> The way I am proposing deep copy will create a completely new object in the
> sense that changes made to anyone of the object will not reflect in another
> one.
> We can also override to do the same as above, however it may be better to
> use "deepCopy" since the copy semantics are explicit (since deepCopy may be
> expensive).
>
> A second important reason for the way I defined deepCopy is that I can pass
> a plan as an argument which will be updated as the expression is copied
> (through plan.add() and plan.connect() ).
>
> Please let me know what you think.
>
> Thanks,
> Swati
>
> On Tue, Jul 13, 2010 at 8:46 AM, Alan Gates  wrote:
>
>> How does deepCopy differ from clone?
>>
>> Alan.
>>
>>
>> On Jul 12, 2010, at 11:19 PM, Swati Jain wrote:
>>
>>  Hi,
>>>
>>> I am working on ticket PIG -1494 (
>>> https://issues.apache.org/jira/browse/PIG-1494 ).
>>>
>>> While implementing this functionality (conversion of logical expression
>>> into
>>> CNF), I need to construct the OperatorPlan for the base expressions of the
>>> CNF. For example, given an expression "(c1 < 10) AND (c3+b3 > 10)", the
>>> CNF
>>> form will result in expressions "(c1 < 10)" and "(c3+b3 > 10)". However,
>>> each of these expressions would be referencing the original OperatorPlan
>>> (that of expression "(c1 < 10) AND (c3+b3 > 10)" ) whereas they should
>>> really be referencing their local OperatorPlan post CNF conversion.
>>>
>>> To ensure correctness of the above approach, I am planning to add a
>>> "deepCopy" method to LogicalExpression to create a copy of expressions. In
>>> my opinion, "deepCopy" will be a useful construct to have in general. It
>>> would be used as follows:
>>>
>>> LogicalExpressionPlan logPlan = new LogicalExpressionPlan();
>>> LogicalExpression copyExpression = origExpression->deepcopy( logPlan );
>>>
>>> Please provide feedback if any on the above approach.
>>>
>>> Note that I considered writing a deepCopy visitor but found that approach
>>> flawed because a valid plan is required for a visitor to work correctly,
>>> and
>>> in this case we need to construct that plan as we copy the expression.
>>>
>>> Thanks
>>> Swati
>>>
>>
>>
>


[jira] Commented: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path

2010-07-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887885#action_12887885
 ] 

Ashutosh Chauhan commented on PIG-1486:
---

Took a look at the patch. Changes look good. But, because of PIG-1452 some 
additional changes are required. Need to remove lib/hadoop20.jar from eclipse 
build-path and hadoop-core.jar, hadoop-test.jar, apache-commons-* and few other 
jars needed to be added in, which now are pulled in from maven repos and put in 
build/ivy/lib/Pig

> update ant eclipse-files target to include new jar and remove contrib dirs 
> from build path
> --
>
> Key: PIG-1486
> URL: https://issues.apache.org/jira/browse/PIG-1486
> Project: Pig
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1486.patch
>
>
>  .eclipse.templates/.classpath needs to be updated to address following -
> 1. There is a new jar that is used by the code - guava-r03.jar
> 2. The jar "ANT_HOME/lib/ant.jar" gives an 'unbounded jar' error in eclipse.
> 3. Removing the contrib projects from class path as discussed in PIG-1390, 
> until all libs necessary for the contribs are included in classpath.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-07-11 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887283#action_12887283
 ] 

Ashutosh Chauhan commented on PIG-1249:
---

Map-reduce framework has a jira related to this issue.  
https://issues.apache.org/jira/browse/MAPREDUCE-1521 It has two implications 
for Pig:

1) We need to reconsider whether we still want Pig to set number of reducers on 
user's behalf. We can choose not to "intelligently" choose # of reducers and 
let framework fail the  job which doesn't "correctly" specify # of reducers. 
Then, Pig is out of this guessing game and users are forced by framework to 
correctly specify # of reducers. 

2) Now that MR framework will fail the job based on configured limits, 
operators where Pig does compute and set number of reducers (like skewed join 
etc.) should now be aware of those limits so that # of reducers computed by 
them fall within those limits.

> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> --
>
> Key: PIG-1249
> URL: https://issues.apache.org/jira/browse/PIG-1249
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Arun C Murthy
>Assignee: Jeff Zhang
>Priority: Critical
> Fix For: 0.8.0
>
> Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
> PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts 
> which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge 
> data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1491) Failure planning nested FOREACH with DISTINCT, POLoad cannot be cast to POLocalRearrange

2010-07-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886906#action_12886906
 ] 

Ashutosh Chauhan commented on PIG-1491:
---

Scott,

It will be useful if you can also paste the Pig script which produced this 
exception.

> Failure planning nested FOREACH with DISTINCT, POLoad cannot be cast to 
> POLocalRearrange
> 
>
> Key: PIG-1491
> URL: https://issues.apache.org/jira/browse/PIG-1491
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Scott Carey
>
> I have a failure that occurs during planning while using DISTINCT in a nested 
> FOREACH. 
> Caused by: java.lang.ClassCastException: 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad
>  cannot be cast to 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SecondaryKeyOptimizer.visitMROp(SecondaryKeyOptimizer.java:352)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:218)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:40)
> at 
> org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-07-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886904#action_12886904
 ] 

Ashutosh Chauhan commented on PIG-1389:
---

+1 

Discussed about 3) with Richard offline. Though theoretically it will be better 
to find out the features on the fully compiled and optimized MR plan, it will 
be hard and may not be worth the complexity doing it. So, in this first pass it 
is fine to mark those features while MR plan's compilation is in progress. As a 
result in few corner cases, features marked for MR Oper may not be correct. We 
will fix up those cases as and when they come up.

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch, 
> PIG-1389_2.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-07-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch checked-in to 0.7 branch as well.

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>        Reporter: Ashutosh Chauhan
>    Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
> PIG_1309_7.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1463) Replace "bz" with ".bz" in setStoreLocation in PigStorage

2010-07-07 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885997#action_12885997
 ] 

Ashutosh Chauhan commented on PIG-1463:
---

Jeff,

Similar problem exists in getInputFormat() of PigStorage. There ain't no 
leading '.' before bz2 or bz. As a result Pig may attempt to load filenames 
ending with bz (such as myfilebz) as compressed bzip file. Would you like to 
take a look?

> Replace "bz" with ".bz" in setStoreLocation in PigStorage 
> --
>
> Key: PIG-1463
> URL: https://issues.apache.org/jira/browse/PIG-1463
> Project: Pig
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 0.8.0
>
> Attachments: PIG_1463.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-07-02 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Attachment: PIG_1309_7.patch

Backport of merge cogroup for 0.7 branch. Since, hudson can test only for 
trunk. Manually ran all the tests, all passed.

> Map-side Cogroup
> 
>
> Key: PIG-1309
> URL: https://issues.apache.org/jira/browse/PIG-1309
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>        Reporter: Ashutosh Chauhan
>    Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
> PIG_1309_7.patch
>
>
> In never ending quest to make Pig go faster, we want to parallelize as many 
> relational operations as possible. Its already possible to do Group-by( 
> PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
> is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884552#action_12884552
 ] 

Ashutosh Chauhan commented on PIG-1449:
---

@Christian,

It would definitely be useful to get the execution time for running the tests 
down. It takes a while currently to run all Pig tests.

> RegExLoader hangs on lines that don't match the regular expression
> --
>
> Key: PIG-1449
> URL: https://issues.apache.org/jira/browse/PIG-1449
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Sanders
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
> RegExLoader.patch
>
>
> In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
> will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
> lines would be skipped if they didn't match the regular expression.  The 
> result is the mapper will not respond and will time out with "Task attempt_X 
> failed to report status for 600 seconds. Killing!".
> Here are the steps to recreate the bug:
> Create a text file in HDFS with the following lines:
> test1
> testA
> test2
> Run the following pig script:
> REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
> test = LOAD '/path/to/test.txt' using 
> org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
> dump test;
> Expected result:
> (test1)
> (test3)
> Actual result:
> Job fails to complete after 600 second timeout waiting on the mapper to 
> complete.  The mapper hangs at 33% since it can process the first line but 
> gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1449:
--

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.8.0
   Resolution: Fixed

> RegExLoader hangs on lines that don't match the regular expression
> --
>
> Key: PIG-1449
> URL: https://issues.apache.org/jira/browse/PIG-1449
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Sanders
>Priority: Minor
> Fix For: 0.8.0
>
> Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
> RegExLoader.patch
>
>
> In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
> will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
> lines would be skipped if they didn't match the regular expression.  The 
> result is the mapper will not respond and will time out with "Task attempt_X 
> failed to report status for 600 seconds. Killing!".
> Here are the steps to recreate the bug:
> Create a text file in HDFS with the following lines:
> test1
> testA
> test2
> Run the following pig script:
> REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
> test = LOAD '/path/to/test.txt' using 
> org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
> dump test;
> Expected result:
> (test1)
> (test3)
> Actual result:
> Job fails to complete after 600 second timeout waiting on the mapper to 
> complete.  The mapper hangs at 33% since it can process the first line but 
> gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884551#action_12884551
 ] 

Ashutosh Chauhan commented on PIG-1449:
---

Reran the contrib tests. All passed. Patch committed. Thanks, Christian and 
Justin for working on this !

> RegExLoader hangs on lines that don't match the regular expression
> --
>
> Key: PIG-1449
> URL: https://issues.apache.org/jira/browse/PIG-1449
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Sanders
>Priority: Minor
> Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
> RegExLoader.patch
>
>
> In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
> will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
> lines would be skipped if they didn't match the regular expression.  The 
> result is the mapper will not respond and will time out with "Task attempt_X 
> failed to report status for 600 seconds. Killing!".
> Here are the steps to recreate the bug:
> Create a text file in HDFS with the following lines:
> test1
> testA
> test2
> Run the following pig script:
> REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
> test = LOAD '/path/to/test.txt' using 
> org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
> dump test;
> Expected result:
> (test1)
> (test3)
> Actual result:
> Job fails to complete after 600 second timeout waiting on the mapper to 
> complete.  The mapper hangs at 33% since it can process the first line but 
> gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1449:
--

Status: Open  (was: Patch Available)

> RegExLoader hangs on lines that don't match the regular expression
> --
>
> Key: PIG-1449
> URL: https://issues.apache.org/jira/browse/PIG-1449
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Sanders
>Priority: Minor
> Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
> RegExLoader.patch
>
>
> In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
> will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
> lines would be skipped if they didn't match the regular expression.  The 
> result is the mapper will not respond and will time out with "Task attempt_X 
> failed to report status for 600 seconds. Killing!".
> Here are the steps to recreate the bug:
> Create a text file in HDFS with the following lines:
> test1
> testA
> test2
> Run the following pig script:
> REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
> test = LOAD '/path/to/test.txt' using 
> org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
> dump test;
> Expected result:
> (test1)
> (test3)
> Actual result:
> Job fails to complete after 600 second timeout waiting on the mapper to 
> complete.  The mapper hangs at 33% since it can process the first line but 
> gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1449:
--

Status: Patch Available  (was: Open)

Running through Hudson.

> RegExLoader hangs on lines that don't match the regular expression
> --
>
> Key: PIG-1449
> URL: https://issues.apache.org/jira/browse/PIG-1449
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Sanders
>Priority: Minor
> Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
> RegExLoader.patch
>
>
> In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
> will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
> lines would be skipped if they didn't match the regular expression.  The 
> result is the mapper will not respond and will time out with "Task attempt_X 
> failed to report status for 600 seconds. Killing!".
> Here are the steps to recreate the bug:
> Create a text file in HDFS with the following lines:
> test1
> testA
> test2
> Run the following pig script:
> REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
> test = LOAD '/path/to/test.txt' using 
> org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
> dump test;
> Expected result:
> (test1)
> (test3)
> Actual result:
> Job fails to complete after 600 second timeout waiting on the mapper to 
> complete.  The mapper hangs at 33% since it can process the first line but 
> gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1424) Error logs of streaming should not be placed in output location

2010-07-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884365#action_12884365
 ] 

Ashutosh Chauhan commented on PIG-1424:
---

This turns out to be much more involved then I initially thought. Assumption 
that output/input location is a file based path exists at more then one place 
in Pig. In particular, Streaming kind of make this explicit assumption and has 
it in the semantics. We need to be careful about streaming semantics before we 
fix this. More at: http://wiki.apache.org/pig/PigStreamingFunctionalSpec

> Error logs of streaming should not be placed in output location
> ---
>
> Key: PIG-1424
> URL: https://issues.apache.org/jira/browse/PIG-1424
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
> Fix For: 0.8.0
>
>
> This becomes a problem when output location is anything other then a 
> filesystem. Output will be written to DB but where the logs generated by 
> streaming should go? Clearly, they cant be written into DB. This blocks 
> PIG-1229 which introduces writing to DB from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884116#action_12884116
 ] 

Ashutosh Chauhan commented on PIG-1389:
---

1.
{code}
+/**
+ * Returns the counter name for the given input file name
+ * 
+ * @param fname the input file name
+ * @return the counter name
+ */
+public static String getMultiInputsCounterName(String fname) {
+return MULTI_INPUTS_RECORD_COUNTER +
+new Path(fname).getName();
+}

{code}

Its dangerous to assume that input is a file name. It may not be. It can be a 
jdbc location string. In particular, new Path(fname) parses fname and throws 
exception if String is not the way it expects it to be. So, at various places 
in the patch, dont assume the path will refer to a file location and 
particularly avoid using Path() and deal in Strings.

2. In PigRecordReader, initialization of Counters should be done in 
initialize() instead of getCurrentValue() that will avoid branching for every 
call of getCurrentValue.

3. Marking of features in MRCompiler while compilation is still in progress may 
lead to incorrect results. We do bunch of optimizations *after* MR plan is 
constructed. During which plan may get readjusted and whatever features were 
there in that particular MROper may get pushed around into different MR Oper. 
Better way to do this marking is post-construction of the MRPlan. Have a 
visitor which walks on the final MR Plan and marks the feature in those 
operator.

4. As an extension of 1. I think having a test for non-file based input/output 
location would really be useful. PIG-1229 would have made that super-easy.

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883247#action_12883247
 ] 

Ashutosh Chauhan commented on PIG-1389:
---

In cases of Merge Join and Merge Cogroup there is a possibility of 
double-counting and under-counting the records from the side loaders inherently 
due to design. So, in those cases reported numbers may confuse users.

> Implement Pig counter to track number of rows for each input files 
> ---
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of 
> multiquery) but also can have multiple inputs (in the case of join or 
> cogroup). In both cases, the existing Hadoop counters (e.g. 
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
> of records in the given input or output.  PIG-1299 addressed the case of 
> multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)

2010-06-28 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883205#action_12883205
 ] 

Ashutosh Chauhan commented on PIG-1470:
---

This is actually a bug in G1. 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6815790 Towards the bottom 
of page there is a comment: 
{code}
Evaluation  The monitoring and management support for G1 is yet to be 
implemented
{code}

I think until it gets fixed in G1, we should recommend users not to use G1.

> map/red jobs fail using G1 GC (Couldn't find heap)
> --
>
> Key: PIG-1470
> URL: https://issues.apache.org/jira/browse/PIG-1470
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.6.0
> Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 
> x86_64 x86_64 x86_64 GNU/Linux
> Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
> Hadoop: 0.20.1
>Reporter: Randy Prager
>
> Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails
> {noformat}
>  
> mapred.child.java.opts
> -Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops 
> -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC
> 
> {noformat}
> Here is the hadoop map/red configuration that succeeds
> {noformat}
>  
> mapred.child.java.opts
> -Xmx300m -XX:+DoEscapeAnalysis 
> -XX:+UseCompressedOops
> 
> {noformat}
> Here is the exception from the pig script.
> {noformat}
> Backend error message
> -
> org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
> set up the load function.
> at 
> org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' 
> with arguments '[,]'
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519)
> at 
> org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85)
> ... 5 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487)
> ... 6 more
> Caused by: java.lang.RuntimeException: Couldn't find heap
> at 
> org.apache.pig.impl.util.SpillableMemoryManager.(SpillableMemoryManager.java:95)
> at org.apache.pig.data.BagFactory.(BagFactory.java:106)
> at 
> org.apache.pig.data.DefaultBagFactory.(DefaultBagFactory.java:71)
> at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76)
> at 
> org.apache.pig.builtin.Utf8StorageConverter.(Utf8StorageConverter.java:49)
> at org.apache.pig.builtin.PigStorage.(PigStorage.java:69)
> at org.apache.pig.builtin.PigStorage.(PigStorage.java:79)
> ... 11 more
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1466) Improve log messages for memory usage

2010-06-25 Thread Ashutosh Chauhan (JIRA)
Improve log messages for memory usage
-

 Key: PIG-1466
 URL: https://issues.apache.org/jira/browse/PIG-1466
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Minor


For anything more then a moderately sized dataset Pig usually spits following 
messages:
{code}
2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
low memory handler called (Usage
threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed = 
954466304(932096K) max =
954466304(932096K)

2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
low memory handler called (Collection
threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed = 
954466304(932096K) max =
954466304(932096K)
{code}

This seems to confuse users a lot. Once these messages are printed, users tend 
to believe that Pig is having hard time with memory, is spilling to disk etc. 
but in fact Pig might be cruising along at ease. We should be little more 
careful what to print in logs. Currently these are printed when a notification 
is sent by JVM and some other conditions are met which may not necessarily 
indicate low memory condition. Furthermore, with {{InternalCachedBag}} embraced 
everywhere in favor of {{DefaultBag}}, these messages have lost their 
usefulness. At the every least, we should lower the log level at which these 
are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1463) Replace "bz" with ".bz" in setStoreLocation in PigStorage

2010-06-23 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882017#action_12882017
 ] 

Ashutosh Chauhan commented on PIG-1463:
---

+1

> Replace "bz" with ".bz" in setStoreLocation in PigStorage 
> --
>
> Key: PIG-1463
> URL: https://issues.apache.org/jira/browse/PIG-1463
> Project: Pig
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: PIG_1463.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1462) No informative error message on parse problem

2010-06-22 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881550#action_12881550
 ] 

Ashutosh Chauhan commented on PIG-1462:
---

This has come up before. As noted on PIG-798 correct way to achieve this is
{code}
grunt> in = load 'data' using PigStorage() as (m:map[]); 
grunt> tags = foreach in generate (tuple(chararray)) m#'k1' as tagtuple;

grunt> dump tags;

{code}
 
We probably need to add a note about casting in cookbook. Also, need to 
generate better error message.

> No informative error message on parse problem
> -
>
> Key: PIG-1462
> URL: https://issues.apache.org/jira/browse/PIG-1462
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ankur
>
> Consider the following script
> in = load 'data' using PigStorage() as (m:map[]);
> tags = foreach in generate m#'k1' as (tagtuple: tuple(chararray));
> dump tags;
> This throws the following error message that does not really say that this is 
> a bad declaration
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
> parsing. Encountered "" at line 2, column 38.
> Was expecting one of:
> 
>   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
>   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
>   at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
>   at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
>   at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
>   at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
>   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
>   at org.apache.pig.Main.main(Main.java:391)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1461) support union operation that merges based on column names

2010-06-22 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881474#action_12881474
 ] 

Ashutosh Chauhan commented on PIG-1461:
---

w.r.t language I think
{code}
 U = union L1, L2 using 'merge';
{code}
is better then 
{code}
U = unionschema L1,L2;
{code}

Because U is indeed union with duplicated columns eliminated. User doesn't need 
to learn about a new operator. 
Internally for Pig, its better to avoid introducing new physical operator if we 
can.


> support union operation that merges based on column names
> -
>
> Key: PIG-1461
> URL: https://issues.apache.org/jira/browse/PIG-1461
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
> Fix For: 0.8.0
>
>
> When the data has schema, it often makes sense to union on column names in 
> schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union 
> to support 'using' clause . I am thinking of having a new operator called 
> either unionschema or merge . Does anybody have any other suggestions for the 
> syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-06-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880881#action_12880881
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

It seems you missed out ivy.xml bits in the latest patch. +1 otherwise, please 
commit if tests pass.

> Monitor and kill runaway UDFs
> -
>
> Key: PIG-1427
> URL: https://issues.apache.org/jira/browse/PIG-1427
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, 
> PIG-1427.diff, PIG-1427.diff
>
>
> As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
> It is often preferable to return null or some other default value instead of 
> timing out a runaway evaluation and killing a job. We have in the past seen 
> complex regular expressions lead to job failures due to just half a dozen 
> (out of millions) particularly obnoxious strings.
> It would be great to give Pig users a lightweight way of enabling UDF 
> monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: the last job in the mapreduce plan

2010-06-16 Thread Ashutosh Chauhan
Without knowing much about your operator, I will suggest couple of
things. In general, you should avoid designing operators which
requires you to take explicit actions on pipeline (which in your case
requires pipeline to be closed immediately). Currently there are
operators in Pig which actually does that but we should carefully
consider adding more as there interactions in pipeline leads to
special cases.
If thats not feasible, then one option for you could be to write a
visitor which will traverse the generated MR plan. If it finds your
operator in the pipeline, it will look if there is any more MR
operator following the current one and if its safe to remove that or
readjust the pipeline.
This may be more complicated then what it needs to be, if you can shed
more light on your operator I may be able to suggest a better
alternative.

Hope it helps,
Ashutosh

On Wed, Jun 16, 2010 at 07:20, Gang Luo  wrote:
> Thanks for replying. Actually, I didn't observe such thing happen in pig now. 
> But one of the operators I implement in Pig requires to end the current MR 
> operator afterwards. That issue may happen in my case.
>
> -Gang
>
>
>
> - 原始邮件 
> 发件人: Ashutosh Chauhan 
> 收件人: pig-dev@hadoop.apache.org
> 发送日期: 2010/6/15 (周二) 1:24:46 下午
> 主   题: Re: the last job in the mapreduce plan
>
> Gang,
>
> What you are saying can never happen because we create a new MR
> operator only when we have a blocking operator which needs to go in
> the next MR operator. We dont create new MR operator apriori without
> looking at next physical operator in the pipeline. If you are seeing
> this happening, I would consider that as a bug.
>
> Ashutosh
>
> On Tue, Jun 15, 2010 at 09:26, Alan Gates  wrote:
>> I've never seen a case where this happens. �Is this a theoretical question
>> or are you seeing this issue?
>>
>> Alan.
>>
>> On Jun 15, 2010, at 8:49 AM, Gang Luo wrote:
>>
>>> Hi,
>>> Is it possible the last MapReduce job in the MR plan only loads something
>>> and stores it without any other processing in between? For example, when
>>> visiting some physical operator, we need to end the current MR operator
>>> after embedding the physical operator into MR operator, and create a new MR
>>> operator for later physical operators. Unfortunately, the following physical
>>> operator is a store, the end of the entire query. In this case, the last MR
>>> operator only contain load and store without any meaningful work in between.
>>> This idle MapReduce job will degrade the performance. Will this happen in
>>> Pig?
>>>
>>> Thanks,
>>> -Gang
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
>


Re: the last job in the mapreduce plan

2010-06-15 Thread Ashutosh Chauhan
Gang,

What you are saying can never happen because we create a new MR
operator only when we have a blocking operator which needs to go in
the next MR operator. We dont create new MR operator apriori without
looking at next physical operator in the pipeline. If you are seeing
this happening, I would consider that as a bug.

Ashutosh

On Tue, Jun 15, 2010 at 09:26, Alan Gates  wrote:
> I've never seen a case where this happens.  Is this a theoretical question
> or are you seeing this issue?
>
> Alan.
>
> On Jun 15, 2010, at 8:49 AM, Gang Luo wrote:
>
>> Hi,
>> Is it possible the last MapReduce job in the MR plan only loads something
>> and stores it without any other processing in between? For example, when
>> visiting some physical operator, we need to end the current MR operator
>> after embedding the physical operator into MR operator, and create a new MR
>> operator for later physical operators. Unfortunately, the following physical
>> operator is a store, the end of the entire query. In this case, the last MR
>> operator only contain load and store without any meaningful work in between.
>> This idle MapReduce job will degrade the performance. Will this happen in
>> Pig?
>>
>> Thanks,
>> -Gang
>>
>>
>>
>>
>
>


[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-06-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878663#action_12878663
 ] 

Ashutosh Chauhan commented on PIG-1449:
---

Justin,

Good catch. Can you assimilate your test case in junit in one of 
piggybank/test/storage/TestRegExLoader or TestMyRegExLoader. That way we'll 
have a regression test for the issue.

> RegExLoader hangs on lines that don't match the regular expression
> --
>
> Key: PIG-1449
> URL: https://issues.apache.org/jira/browse/PIG-1449
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Justin Sanders
>Priority: Minor
> Attachments: RegExLoader.patch
>
>
> In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
> will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
> lines would be skipped if they didn't match the regular expression.  The 
> result is the mapper will not respond and will time out with "Task attempt_X 
> failed to report status for 600 seconds. Killing!".
> Here are the steps to recreate the bug:
> Create a text file in HDFS with the following lines:
> test1
> testA
> test2
> Run the following pig script:
> REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
> test = LOAD '/path/to/test.txt' using 
> org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
> dump test;
> Expected result:
> (test1)
> (test3)
> Actual result:
> Job fails to complete after 600 second timeout waiting on the mapper to 
> complete.  The mapper hangs at 33% since it can process the first line but 
> gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)

2010-06-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878336#action_12878336
 ] 

Ashutosh Chauhan commented on PIG-1442:
---

This looks like a variant of PIG-1446 and PIG-1448 PigCombiner attaches the 
tuple to the roots of combine plan, but never detaches them. PODemux also 
attach the tuple to the inner plan, but never detaches it. Note that 
PigCombiner may also contain multiple pipelines depending on number of 
operations done inside For Each resulting in similar problems as described in 
PIG-1448.

> java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
> ---
>
> Key: PIG-1442
> URL: https://issues.apache.org/jira/browse/PIG-1442
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.2.0, 0.7.0
> Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev 
> (18/may)
> Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0
>Reporter: Dirk Schmid
>
> As mentioned by Ashutosh this is a reopen of 
> https://issues.apache.org/jira/browse/PIG-766 because there is still a 
> problem which causes that PIG scales only by memory.
> For convenience here comes the last entry of the PIG-766-Jira-Ticket:
> {quote}1. Are you getting the exact same stack trace as mentioned in the 
> jira?{quote} Yes the same and some similar traces:
> {noformat}
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:2786)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>   at java.io.DataOutputStream.write(DataOutputStream.java:90)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>   at 
> org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279)
>   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
>   at 
> org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249)
>   at 
> org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214)
>   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
>   at 
> org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209)
>   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
>   at 
> org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
>   at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
>   at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
>   at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179)
>   at 
> org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880)
>   at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201)
>   at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>   at 
> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
>   at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
>   at 
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
> java.lang.OutOfMemoryError: Java heap space
>   at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58)
>   at 
> org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
>   at 
> org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61)
>   at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
>   at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
>   at 
> org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263)
>   at 
> org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71)
>   at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145)
>   at 
> org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.j

[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator

2010-06-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878294#action_12878294
 ] 

Ashutosh Chauhan commented on PIG-1448:
---

Problem here is not as bad as it may sound. All the physical operator already 
detaches the input tuple after they are done with it. In the getNext() phy op 
first calls processInput() which first attaches the input tuple and then 
detaches it at the end. So, physical operators contained within inner plans 
will also do that. Problem is when there is a Bin Cond, Pig short circuits one 
of the branches of the inner plan, in which case getNext() of the operator is 
never called and thus tuple is never detached. Note in these cases, tuple was 
already attached by the operator which had this inner plan to all the roots of 
the plan. So, in this particular use case tuple got attached but was never 
detached and thus had the stray reference which cannot be GC'ed. This still 
will not be a problem if there is only a single pipeline in mapper or reducer 
since the next time new key/value pair is read and is run through pipeline, the 
reference will be overwritten and thus tuple which was not detached in previous 
run can now be GC'ed. Only if you have Multi Query optimized script the same 
pipeline may not be run when the next key/value pair is read in map() or 
reduce() and then stray reference will not be overwritten. If all of these 
conditions are met and if tuple  itself is large or contains large bags, we may 
end up with OOME. 

> Detach tuple from inner plans of physical operator 
> ---
>
> Key: PIG-1448
> URL: https://issues.apache.org/jira/browse/PIG-1448
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0
>Reporter: Ashutosh Chauhan
> Fix For: 0.8.0
>
>
> This is a follow-up on PIG-1446 which only addresses this general problem for 
> a specific instance of For Each. In general, all the physical operators which 
> can have inner plans are vulnerable to this. Few of them include 
> POLocalRearrange, POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1448) Detach tuple from inner plans of physical operator

2010-06-12 Thread Ashutosh Chauhan (JIRA)
Detach tuple from inner plans of physical operator 
---

 Key: PIG-1448
 URL: https://issues.apache.org/jira/browse/PIG-1448
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


This is a follow-up on PIG-1446 which only addresses this general problem for a 
specific instance of For Each. In general, all the physical operators which can 
have inner plans are vulnerable to this. Few of them include POLocalRearrange, 
POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.

2010-06-11 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1446:
--

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.8.0
   0.7.0
   Resolution: Fixed

As usual, hudson is not responding. I manually ran all the unit tests, all of 
them passed. Committed to both trunk and 0.7

> OOME in a query having a bincond in the inner plan of a Foreach.
> 
>
> Key: PIG-1446
> URL: https://issues.apache.org/jira/browse/PIG-1446
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>    Assignee: Ashutosh Chauhan
> Fix For: 0.8.0, 0.7.0
>
> Attachments: pig-1446.patch
>
>
> This is seen when For Each is following a group-by and there is a bin cond as 
> an inner plan of For Each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-10 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877591#action_12877591
 ] 

Ashutosh Chauhan commented on PIG-1428:
---

So, I read through PIG-889. It seems that there never was a documented way to 
use counters, reporters etc from UDFs, Load/Store Funcs. Actually, there is a 
hacky way to do it, which exists in DefaultAbstractBag.java 
{code}
protected void incSpillCount(Enum counter) {
// Increment the spill count
// warn is a misnomer. The function updates the counter. If the update
// fails, it dumps a warning
PigHadoopLogger.getInstance().warn(this, "Spill counter incremented", 
counter);
}
{code}
But in PIG-889 Santhosh has argued against for this (mis)use of PigLogger. I 
think we need to provide a formal way to Pig users to access counters, 
reporters from our interfaces (UDFs, L/S) as PigHadoopLogger is designed for 
error-handling (warning aggregation in particular) and not for this purpose. 
And we shall mark this class as Internal only, before some one starts using it. 
With the same argument, above method where Pig is internally making use of its 
own Counters is flawed and needs to be corrected.

> Add getPigStatusReporter() to PigHadoopLogger
> -
>
> Key: PIG-1428
> URL: https://issues.apache.org/jira/browse/PIG-1428
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG-1428.patch, PIG-1428.patch
>
>
> Without this getter method, its not possible to get counters, report progress 
> etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-10 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877616#action_12877616
 ] 

Ashutosh Chauhan commented on PIG-1428:
---

I propose a slightly different approach here. Instead of adding 
getPigStatusReporter() to PigLogger() interface and the corresponding 
implementation in PigHadoopLogger, we can add a static singleton method in 
PigStatusReporter and also add a setContext( TaskInputOutputContext context) We 
can then set the context in map() and reduce() functions and users will have 
full access of the reporter object through the static method. This will allow 
us to keep error logging different then status reporting. 

> Add getPigStatusReporter() to PigHadoopLogger
> -
>
> Key: PIG-1428
> URL: https://issues.apache.org/jira/browse/PIG-1428
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>    Reporter: Ashutosh Chauhan
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.8.0
>
> Attachments: PIG-1428.patch, PIG-1428.patch
>
>
> Without this getter method, its not possible to get counters, report progress 
> etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.

2010-06-10 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1446:
--

Status: Patch Available  (was: Open)

> OOME in a query having a bincond in the inner plan of a Foreach.
> 
>
> Key: PIG-1446
> URL: https://issues.apache.org/jira/browse/PIG-1446
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>    Assignee: Ashutosh Chauhan
> Attachments: pig-1446.patch
>
>
> This is seen when For Each is following a group-by and there is a bin cond as 
> an inner plan of For Each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.

2010-06-10 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1446:
--

Attachment: pig-1446.patch

Sequence of event is as follows:
1) MultiQuery optimizer combined 30 group-bys in one reducer. So, there are 30 
pipelines in a reducer.
2) Each of these group-by has a ForEach after them.
3) ForEach has a bincond in its own plan.
4) Group-by resulted in large bags (10s of million of records).
5) Tuple containing group and bag is attached to the roots of inner plan of FE.
6) FE pulled the tuples through its leaves.
7) Due to short-circuiting in bin-cond, one branch of the plan is never pulled 
resulting in stray reference of bag which actually was not needed.
8) Due to MQ optimized 30 group-bys, we had many such bags now hanging in 
there, eating up all the memory.

Fix: Detach tuples from the roots once you are done in FE.

> OOME in a query having a bincond in the inner plan of a Foreach.
> 
>
> Key: PIG-1446
> URL: https://issues.apache.org/jira/browse/PIG-1446
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
> Attachments: pig-1446.patch
>
>
> This is seen when For Each is following a group-by and there is a bin cond as 
> an inner plan of For Each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.

2010-06-10 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-1446:
-

Assignee: Ashutosh Chauhan

> OOME in a query having a bincond in the inner plan of a Foreach.
> 
>
> Key: PIG-1446
> URL: https://issues.apache.org/jira/browse/PIG-1446
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>        Reporter: Ashutosh Chauhan
>    Assignee: Ashutosh Chauhan
> Attachments: pig-1446.patch
>
>
> This is seen when For Each is following a group-by and there is a bin cond as 
> an inner plan of For Each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.

2010-06-10 Thread Ashutosh Chauhan (JIRA)
OOME in a query having a bincond in the inner plan of a Foreach.


 Key: PIG-1446
 URL: https://issues.apache.org/jira/browse/PIG-1446
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan


This is seen when For Each is following a group-by and there is a bin cond as 
an inner plan of For Each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1438) [Performance] MultiQueryOptimizer should also merge DISTINCT jobs

2010-06-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877150#action_12877150
 ] 

Ashutosh Chauhan commented on PIG-1438:
---

+1 please commit.

> [Performance] MultiQueryOptimizer should also merge DISTINCT jobs
> -
>
> Key: PIG-1438
> URL: https://issues.apache.org/jira/browse/PIG-1438
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1438.patch, PIG-1438_1.patch
>
>
> Current implementation doesn't merge jobs derived from DISTINCT statements. 
> The reason is that DISTINCT jobs are implemented using a special combiner 
> (DistinctCombiner). But we should be able to merge jobs that have the same 
> type of combiner (e.g. merge multiple DISTINCT jobs into one).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-06-08 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876763#action_12876763
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

@Dmitriy,

Occupied with some work. Will get back to it sometime later this week.  

> Monitor and kill runaway UDFs
> -
>
> Key: PIG-1427
> URL: https://issues.apache.org/jira/browse/PIG-1427
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, 
> PIG-1427.diff
>
>
> As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
> It is often preferable to return null or some other default value instead of 
> timing out a runaway evaluation and killing a job. We have in the past seen 
> complex regular expressions lead to job failures due to just half a dozen 
> (out of millions) particularly obnoxious strings.
> It would be great to give Pig users a lightweight way of enabling UDF 
> monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-06-04 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan closed PIG-283.



> Allow to set arbitrary jobconf key-value pairs inside pig program
> -
>
> Key: PIG-283
> URL: https://issues.apache.org/jira/browse/PIG-283
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Affects Versions: 0.7.0
>Reporter: Christian Kunz
>    Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-282.patch
>
>
> It would be useful to be able to set arbitrary JobConf key-value pairs inside 
> a pig program (e.g. in front of a COGROUP statement).
> I wonder whether the simplest way to add this feature is by expanding the 
> 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-06-04 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-283:
-

  Status: Resolved  (was: Patch Available)
Release Note: 
For documentation:

After this patch, it becomes possible to set key value pairs as following in 
the script. 
{code}
set mapred.map.tasks.speculative.execution false
set pig.logfile mylogfile.log
set my.arbitrary.key my.arbitary.value
{code}
These key value pairs would be put in job-conf by Pig. This is a script wide 
setting meaning if value is defined multiple times for a key in the script, the 
last one will take effect and it will be this value which will be set for all 
the jobs generated by script. 
  Resolution: Fixed

Re-ran all the test reported by Hudson as failures. All of them passed. Patch 
committed.



> Allow to set arbitrary jobconf key-value pairs inside pig program
> -
>
> Key: PIG-283
> URL: https://issues.apache.org/jira/browse/PIG-283
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Affects Versions: 0.7.0
>Reporter: Christian Kunz
>    Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-282.patch
>
>
> It would be useful to be able to set arbitrary JobConf key-value pairs inside 
> a pig program (e.g. in front of a COGROUP statement).
> I wonder whether the simplest way to add this feature is by expanding the 
> 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct

2010-06-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875481#action_12875481
 ] 

Ashutosh Chauhan commented on PIG-1437:
---

Since this is logical transformation of query plan, logical optimizer is the 
ideal place for this optimization. But I think it instead might be easier to do 
on MR plan after  it is generated.

> [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
> -
>
> Key: PIG-1437
> URL: https://issues.apache.org/jira/browse/PIG-1437
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Priority: Minor
>
> Its possible to rewrite queries like this
> {code}
> A = load 'data' as (name,age);
> B = group A by (name,age);
> C = foreach B generate group.name, group.age;
> dump C;
> {code}
> or
> {code} 
> (name,age);
> B = group A by (name
> A = load 'data' as,age);
> C = foreach B generate flatten(group);
> dump C;
> {code}
> to
> {code}
> A = load 'data' as (name,age);
> B = distinct A;
> dump B;
> {code}
> This could only be done if no columns within the bags are referenced 
> subsequently in the script. Since in Pig-Hadoop world DISTINCT will be 
> executed more effeciently then group-by this will be a huge win. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct

2010-06-03 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1437:
--

Release Note:   (was: Its possible to rewrite queries like this
{code}
A = load 'data' as (name,age);
B = group A by (name,age);
C = foreach B generate group.name, group.age;
dump C;
{code}
or
{code} 
(name,age);
B = group A by (name
A = load 'data' as,age);
C = foreach B generate flatten(group);
dump C;
{code}

to
{code}
A = load 'data' as (name,age);
B = distinct A;
dump B;
{code}

This could only be done if no columns within the bags are referenced 
subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed 
more effeciently then group-by this will be a huge win. )
 Description: 
Its possible to rewrite queries like this
{code}
A = load 'data' as (name,age);
B = group A by (name,age);
C = foreach B generate group.name, group.age;
dump C;
{code}
or
{code} 
(name,age);
B = group A by (name
A = load 'data' as,age);
C = foreach B generate flatten(group);
dump C;
{code}

to
{code}
A = load 'data' as (name,age);
B = distinct A;
dump B;
{code}

This could only be done if no columns within the bags are referenced 
subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed 
more effeciently then group-by this will be a huge win. 

> [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
> -
>
> Key: PIG-1437
> URL: https://issues.apache.org/jira/browse/PIG-1437
> Project: Pig
>  Issue Type: Bug
>      Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
>Priority: Minor
>
> Its possible to rewrite queries like this
> {code}
> A = load 'data' as (name,age);
> B = group A by (name,age);
> C = foreach B generate group.name, group.age;
> dump C;
> {code}
> or
> {code} 
> (name,age);
> B = group A by (name
> A = load 'data' as,age);
> C = foreach B generate flatten(group);
> dump C;
> {code}
> to
> {code}
> A = load 'data' as (name,age);
> B = distinct A;
> dump B;
> {code}
> This could only be done if no columns within the bags are referenced 
> subsequently in the script. Since in Pig-Hadoop world DISTINCT will be 
> executed more effeciently then group-by this will be a huge win. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct

2010-06-03 Thread Ashutosh Chauhan (JIRA)
[Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
-

 Key: PIG-1437
 URL: https://issues.apache.org/jira/browse/PIG-1437
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Minor




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true

2010-06-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875326#action_12875326
 ] 

Ashutosh Chauhan commented on PIG-1433:
---

My point was to have all constant strings in one place instead of each class 
having some of them It could be either interface or class. If interface is 
considered anti-pattern, doing it in class is fine too.

> pig should create success file if 
> mapreduce.fileoutputcommitter.marksuccessfuljobs is true
> --
>
> Key: PIG-1433
> URL: https://issues.apache.org/jira/browse/PIG-1433
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Fix For: 0.8.0
>
> Attachments: PIG-1433.patch
>
>
> pig should create success file if 
> mapreduce.fileoutputcommitter.marksuccessfuljobs is true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true

2010-06-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875319#action_12875319
 ] 

Ashutosh Chauhan commented on PIG-1433:
---

+1 for the commit. couple of notes for future:
* Since this is related to Hadoop property. We should consider this removing 
from Pig codebase when MAPREDUCE-1447 and MAPREDUCE-947 are fixed.
* We have lot of constant strings in our codebase. For the sake of clean code, 
we shall put all of those public static final string in one top level interface 
called Constants. This should be part of seperate clean-up code jira.

> pig should create success file if 
> mapreduce.fileoutputcommitter.marksuccessfuljobs is true
> --
>
> Key: PIG-1433
> URL: https://issues.apache.org/jira/browse/PIG-1433
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Pradeep Kamath
>Assignee: Pradeep Kamath
> Fix For: 0.8.0
>
> Attachments: PIG-1433.patch
>
>
> pig should create success file if 
> mapreduce.fileoutputcommitter.marksuccessfuljobs is true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-28 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12873095#action_12873095
 ] 

Ashutosh Chauhan commented on PIG-283:
--

Seems hudson didn't fully recover from its long hospital trip. All failures are 
unrelated and because of port conflicts. Patch is ready for review.

> Allow to set arbitrary jobconf key-value pairs inside pig program
> -
>
> Key: PIG-283
> URL: https://issues.apache.org/jira/browse/PIG-283
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Affects Versions: 0.7.0
>Reporter: Christian Kunz
>Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-282.patch
>
>
> It would be useful to be able to set arbitrary JobConf key-value pairs inside 
> a pig program (e.g. in front of a COGROUP statement).
> I wonder whether the simplest way to add this feature is by expanding the 
> 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-05-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872862#action_12872862
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

*  Filed PIG-1428 for it.
*  Neat workaround.
*  I guess checking in lib/ is fine. They are using APL.
*  Performance number looks good. Initially, lets not default for monitoring. 
Later as we gain more experience with this feature we should on monitoring by 
default so as not to waste cluster resources because of programming errors.

> Monitor and kill runaway UDFs
> -
>
> Key: PIG-1427
> URL: https://issues.apache.org/jira/browse/PIG-1427
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Attachments: monitoredUdf.patch, monitoredUdf.patch
>
>
> As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
> It is often preferable to return null or some other default value instead of 
> timing out a runaway evaluation and killing a job. We have in the past seen 
> complex regular expressions lead to job failures due to just half a dozen 
> (out of millions) particularly obnoxious strings.
> It would be great to give Pig users a lightweight way of enabling UDF 
> monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-27 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-283:
-

   Status: Patch Available  (was: Open)
Affects Version/s: 0.7.0
Fix Version/s: 0.8.0

> Allow to set arbitrary jobconf key-value pairs inside pig program
> -
>
> Key: PIG-283
> URL: https://issues.apache.org/jira/browse/PIG-283
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Affects Versions: 0.7.0
>Reporter: Christian Kunz
>    Assignee: Ashutosh Chauhan
> Fix For: 0.8.0
>
> Attachments: pig-282.patch
>
>
> It would be useful to be able to set arbitrary JobConf key-value pairs inside 
> a pig program (e.g. in front of a COGROUP statement).
> I wonder whether the simplest way to add this feature is by expanding the 
> 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-05-27 Thread Ashutosh Chauhan (JIRA)
Add getPigStatusReporter() to PigHadoopLogger
-

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


Without this getter method, its not possible to get counters, report progress 
etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-27 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-283:


Assignee: Ashutosh Chauhan

> Allow to set arbitrary jobconf key-value pairs inside pig program
> -
>
> Key: PIG-283
> URL: https://issues.apache.org/jira/browse/PIG-283
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Reporter: Christian Kunz
>    Assignee: Ashutosh Chauhan
> Attachments: pig-282.patch
>
>
> It would be useful to be able to set arbitrary JobConf key-value pairs inside 
> a pig program (e.g. in front of a COGROUP statement).
> I wonder whether the simplest way to add this feature is by expanding the 
> 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-27 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-283:
-

Attachment: pig-282.patch

Patch as suggested in previous comment. This will let user to add / override 
key value pairs in job conf through grunt or through script. Like
{code}
grunt> set mapred.map.tasks.speculative.execution false
grunt> set pig.logfile mylogfile.log
grunt> set my.arbitrary.key my.arbitary.value 
{code}

> Allow to set arbitrary jobconf key-value pairs inside pig program
> -
>
> Key: PIG-283
> URL: https://issues.apache.org/jira/browse/PIG-283
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Reporter: Christian Kunz
> Attachments: pig-282.patch
>
>
> It would be useful to be able to set arbitrary JobConf key-value pairs inside 
> a pig program (e.g. in front of a COGROUP statement).
> I wonder whether the simplest way to add this feature is by expanding the 
> 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872315#action_12872315
 ] 

Ashutosh Chauhan commented on PIG-283:
--

Proposal here is as suggested in the description. Expand set command so that 
set can take arbitrary key-value pairs and pass it on to the job-conf.

> Allow to set arbitrary jobconf key-value pairs inside pig program
> -
>
> Key: PIG-283
> URL: https://issues.apache.org/jira/browse/PIG-283
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Reporter: Christian Kunz
>
> It would be useful to be able to set arbitrary JobConf key-value pairs inside 
> a pig program (e.g. in front of a COGROUP statement).
> I wonder whether the simplest way to add this feature is by expanding the 
> 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-05-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872303#action_12872303
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

1. You didnt pay heed to my request for incrementing counter when udf times out 
or throws an exception :) I think that will be pretty useful for user to know 
how many faulty records there are in the dataset which can't be processed by 
the UDF.
2. In the getDefaultValue() there seems to be a inconsistency among different 
if statements. I guess you need to make a distinction between Integer[] and 
Integer return type and then return appropriate return value.
3. Doing svn co; patch -p0 < monitoredUDF.patch; ant jar results in build 
failure. It seems ivy is not pulling guava lib.
4. Since its user facing new interface, having stability/visibility tag would 
really be useful.
5. Since it spawns a new thread for every exec() call, I assume it will have 
some overhead. If you have done some comparison or have numbers for that, it 
will be great if you can share that.

> Monitor and kill runaway UDFs
> -
>
> Key: PIG-1427
> URL: https://issues.apache.org/jira/browse/PIG-1427
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Attachments: monitoredUdf.patch, monitoredUdf.patch
>
>
> As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
> It is often preferable to return null or some other default value instead of 
> timing out a runaway evaluation and killing a job. We have in the past seen 
> complex regular expressions lead to job failures due to just half a dozen 
> (out of millions) particularly obnoxious strings.
> It would be great to give Pig users a lightweight way of enabling UDF 
> monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-05-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872031#action_12872031
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

A useful feature. Couple of comments:

1. Currently in case of time outs and error you are always returning null. It 
will be useful if user can specify a default return value as a definition of 
his annotation which is returned in those cases. For example if my regex fails 
on an input String, I want to return an empty String back. Something like:
{code}
 @MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 500, 
defaultReturnValue = "")
{code} 

2. It seems that PigHadoopLogger.getReporter() method accidentally got removed 
in 0.7 and trunk. This needs to be restored. It will be really cool to see how 
many of my input records are faulty on UI. Since, it is a small change, I think 
you can add that getter method in there and then update the appropriate 
counters. 

> Monitor and kill runaway UDFs
> -
>
> Key: PIG-1427
> URL: https://issues.apache.org/jira/browse/PIG-1427
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.8.0
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Attachments: monitoredUdf.patch
>
>
> As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
> It is often preferable to return null or some other default value instead of 
> timing out a runaway evaluation and killing a job. We have in the past seen 
> complex regular expressions lead to job failures due to just half a dozen 
> (out of millions) particularly obnoxious strings.
> It would be great to give Pig users a lightweight way of enabling UDF 
> monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1424) Error logs of streaming should not be placed in output location

2010-05-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871902#action_12871902
 ] 

Ashutosh Chauhan commented on PIG-1424:
---

Till we figure out a proper solution for this, one possibility is to wrap the 
code in my previous comment into try-catch block. That will unblock PIG-1229 
for commit. We can leave this ticket open if we feel there is a need for a 
better solution. 

> Error logs of streaming should not be placed in output location
> ---
>
> Key: PIG-1424
> URL: https://issues.apache.org/jira/browse/PIG-1424
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Ashutosh Chauhan
> Fix For: 0.8.0
>
>
> This becomes a problem when output location is anything other then a 
> filesystem. Output will be written to DB but where the logs generated by 
> streaming should go? Clearly, they cant be written into DB. This blocks 
> PIG-1229 which introduces writing to DB from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1347) Clear up output directory for a failed job

2010-05-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871862#action_12871862
 ] 

Ashutosh Chauhan commented on PIG-1347:
---

Patch is pretty straightforward and harmless as it only removes code and does 
not add any thing new. Only concern I have is 
FileLocalizer.registerDeleteOnFail() is a public method so its possible that 
some one using Pig's java api is using this method to do the cleanup himself 
previously.  So, this can be considered as backward incompatible change. But, 
Daniel explained to me that this method was meant for Pig's internal usage and 
clean up in any case was taken care by Pig before the recent store func 
changes, so user need not to worry about it. So, its extremely unlikely that 
someone is using it. 
So, +1 on committing.

> Clear up output directory for a failed job
> --
>
> Key: PIG-1347
> URL: https://issues.apache.org/jira/browse/PIG-1347
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Ashitosh Darbarwar
> Fix For: 0.8.0
>
> Attachments: PIG-1347-1.patch
>
>
> FileLocalizer.deleteOnFail suppose to track the output files need to be 
> deleted in case the job fails. However, in the current code base, 
> deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is 
> called by nobody. We need to bring it back.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-05-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871448#action_12871448
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Arnab,

Thanks for putting together a patch for this. One question I have is about 
register Vs define. Currently you are auto-registering all the functions in the 
script file and then they are available for later use in script. But I am not 
sure how we will handle the case for inlined functions. For inline functions 
{{define}} seems to be a natural choice as noted in previous comments of the 
jira. And if so, then we need to modify define to support that use case. 
Wondering to remain consistent, we always use {{define}} to define  
functions instead of auto registering them. I also didn't get why there will be 
need for separate interpreter instances in that case.


> UDFs in scripting languages
> ---
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
>  Issue Type: New Feature
>Reporter: Alan Gates
> Fix For: 0.8.0
>
> Attachments: calltrace.png, package.zip, pig-greek.tgz, 
> pig.scripting.patch.arnab, pyg.tgz, scripting.tgz, scripting.tgz, test.zip
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   >