[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-10-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Committed to both trunk and 0.8. Thanks, Niraj!

 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG-1531_5.patch, 
 PIG_1531.patch, PIG_1531_2.patch


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1641) Incorrect counters in local mode

2010-09-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915408#action_12915408
 ] 

Ashutosh Chauhan commented on PIG-1641:
---

Tested manually for local mode. Messages were same as proposed above. +1 for 
the commit. One minor suggestion is to put a line at the start saying something 
like: Detected Local mode. Stats reported below may be incomplete. This will 
reinforce the message to users that stats reporting is not transparent across 
different modes (local Vs map-reduce).

 Incorrect counters in local mode
 

 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1641.patch


 User report, not verified.
 email
 HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 
 21:58:42ORDER_BY
 Success!
 Job Stats (time in seconds):
 JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
 MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
 job_local_000100000000rawMAP_ONLY
 job_local_000200000000rank_sort
 SAMPLER
 job_local_000300000000rank_sort
 ORDER_BYProcessed/user_visits_table,
 Input(s):
 Successfully read 0 records from: Data/Raw/UserVisits.dat
 Output(s):
 Successfully stored 0 records in: Processed/user_visits_table
 However, when I look in the output:
 $ ls -lh Processed/user_visits_table/CG0/
 total 15250760
 -rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*
 It read a 20G input file and generated some output...
 /email
 Is it that in local mode counters are not available? If so, instead of 
 printing zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1641) Incorrect counters in local mode

2010-09-22 Thread Ashutosh Chauhan (JIRA)
Incorrect counters in local mode


 Key: PIG-1641
 URL: https://issues.apache.org/jira/browse/PIG-1641
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan


User report, not verified.

email

HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 21:58:42  
  ORDER_BY

Success!

Job Stats (time in seconds):
JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime
MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs
job_local_000100000000rawMAP_ONLY
job_local_000200000000rank_sortSAMPLER  
  
job_local_000300000000rank_sortORDER_BY 
   Processed/user_visits_table,

Input(s):
Successfully read 0 records from: Data/Raw/UserVisits.dat

Output(s):
Successfully stored 0 records in: Processed/user_visits_table


However, when I look in the output:

$ ls -lh Processed/user_visits_table/CG0/
total 15250760
-rwxrwxrwx  1 user  _lpoperator   7.3G Sep 21 21:58 part-0*

It read a 20G input file and generated some output...

/email

Is it that in local mode counters are not available? If so, instead of printing 
zeros we should print Information Unavailable or some such.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1531) Pig gobbles up error messages

2010-09-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12913048#action_12913048
 ] 

Ashutosh Chauhan commented on PIG-1531:
---

Oh Hudson, oh well...

Ran the full suite of 400 minutes of unit tests; all passed. Patch is ready for 
review.

 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, 
 PIG_1531_2.patch


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1531) Pig gobbles up error messages

2010-09-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reopened PIG-1531:
---


Peril of not writing unit test : Resurrection of bug. Argh..


 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-09-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Status: Patch Available  (was: Reopened)

 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, 
 PIG_1531_2.patch


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-09-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Attachment: pig-1531_4.patch

Added a test-case which fails on trunk. Pig still gobbles up error messages. 
Fix is to rethrow the message in the hierarchy. Attached patch containis the 
test case and the fix.

 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: pig-1531_3.patch, pig-1531_4.patch, PIG_1531.patch, 
 PIG_1531_2.patch


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'

2010-09-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905995#action_12905995
 ] 

Ashutosh Chauhan commented on PIG-1590:
---

Also inner merge join on more then 2 then tables also translates into 
POMergeCogroup + FE + Flatten.  Here also it can be translated to use 
POMergeJoin and enjoy the benefits which comes along with it. Though I suspect 
it will introduce much more complexity in POMergeJoin then the case for left 
outer merge join. So, may not be worth doing. 

 Use POMergeJoin for Left Outer Join when join using 'merge'
 ---

 Key: PIG-1590
 URL: https://issues.apache.org/jira/browse/PIG-1590
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Priority: Minor

 C = join A by $0 left, B by $0 using 'merge';
 will result in map-side sort merge join. Internally, it will translate to use 
 POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few 
 restrictions on its loaders (A and B in this case) which is cumbersome. 
 Currently, only Zebra is known to satisfy all those requirements. It will be 
 better to use POMergeJoin in this case, since it has far fewer requirements 
 on its loader. Importantly, it works with PigStorage.  Plus, POMergeJoin will 
 be faster then POMergeCogroup + FE-Flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1598) Pig gobbles up error messages - Part 2

2010-09-02 Thread Ashutosh Chauhan (JIRA)
Pig gobbles up error messages - Part 2
--

 Key: PIG-1598
 URL: https://issues.apache.org/jira/browse/PIG-1598
 Project: Pig
  Issue Type: Improvement
Reporter: Ashutosh Chauhan


Another case of PIG-1531 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'

2010-09-01 Thread Ashutosh Chauhan (JIRA)
Use POMergeJoin for Left Outer Join when join using 'merge'
---

 Key: PIG-1590
 URL: https://issues.apache.org/jira/browse/PIG-1590
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Priority: Minor


C = join A by $0 left, B by $0 using 'merge';

will result in map-side sort merge join. Internally, it will translate to use 
POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few restrictions 
on its loaders (A and B in this case) which is cumbersome. Currently, only 
Zebra is known to satisfy all those requirements. It will be better to use 
POMergeJoin in this case, since it has far fewer requirements on its loader. 
Importantly, it works with PigStorage.  Plus, POMergeJoin will be faster then 
POMergeCogroup + FE-Flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1590) Use POMergeJoin for Left Outer Join when join using 'merge'

2010-09-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905207#action_12905207
 ] 

Ashutosh Chauhan commented on PIG-1590:
---

It will entail changes in POMergeJoin and LogToPhyTranslationVisitor.

 Use POMergeJoin for Left Outer Join when join using 'merge'
 ---

 Key: PIG-1590
 URL: https://issues.apache.org/jira/browse/PIG-1590
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Priority: Minor

 C = join A by $0 left, B by $0 using 'merge';
 will result in map-side sort merge join. Internally, it will translate to use 
 POMergeCogroup + ForEachFlatten. POMergeCogroup places quite a few 
 restrictions on its loaders (A and B in this case) which is cumbersome. 
 Currently, only Zebra is known to satisfy all those requirements. It will be 
 better to use POMergeJoin in this case, since it has far fewer requirements 
 on its loader. Importantly, it works with PigStorage.  Plus, POMergeJoin will 
 be faster then POMergeCogroup + FE-Flatten.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1531) Pig gobbles up error messages

2010-08-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904497#action_12904497
 ] 

Ashutosh Chauhan commented on PIG-1531:
---

Niraj ran all the unit tests. All passed. No complaints from test-patch either. 
Committed to the trunk.
Thanks, Niraj !

 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-08-29 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Attachment: pig-1531_3.patch

I took a look of the latest patch. There are two minor problems. Firstly, 
pigExec was always null and never assigned a value, so it resulted in NPE in 
certain code path. Second, the boolean logic in PigInputFormat needs  instead 
of ||. I thought of correcting it and committing. But then realized hudson 
hasnt come back with results yet. So, I am uploading a new patch with those 
corrections and submitting to Hudson again. In this patch, I also refactored a 
code a bit, so its easier to read. Have a look and if its look fine to you. Can 
you run test-patch and unit tests and paste results here, so I can commit it.

 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1531) Pig gobbles up error messages

2010-08-29 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1531:
--

Status: Patch Available  (was: Open)

 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: pig-1531_3.patch, PIG_1531.patch, PIG_1531_2.patch


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1531) Pig gobbles up error messages

2010-08-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902576#action_12902576
 ] 

Ashutosh Chauhan commented on PIG-1531:
---

* In addition to error Msg, you also need to set error code on the exception 
you are throwing.
* Since you are catching exceptions thrown by user code (StoreFunc Interface) 
it is not safe to assume that e.getMessage() will be non-null or non-empty 
string. This will result in NPE. You need to check for it and provide a generic 
error Msg in those cases.
* Generic error msg should also contain output location String. Since if user 
didnt provide it, that wont get printed. So, you can reword the message as 
Output location validation failed for: location. More Information to 
follow: 
* Since, PigException extends from IOException. The IOException you are 
catching can also be a PigException, you need to test if it is and then set the 
message and error code.
* In case of non-existent input location I am still seeing the generic message 
ERROR 2997: Unable to recreate exception from backend error: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for: file:///Users/chauhana/workspace/pig-1531/a Though 
the full stack trace is printed at the end which contains the underlying error 
String. Its more confusing because now there are three different error messages 
amid a java stack trace.
* This warrants a testcase for regression purposes. (Infact error reporting 
behavior already changed since the time I opened this bug.)

 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: PIG_1531.patch


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-08-20 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Release Note: 
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and loader implements certain interfaces. Primary algorithm is based on 
sort-merge join with additional restrictions. 

Following preconditions must be met to use this feature: 
1) No other operations can be done between load and join statements. 
2) Data must be sorted on join keys for all tables in ASC order. 
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else. 
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}. 
5) All other loaders must implement IndexableLoadFunc. 
6) Type information must be provided in schema for all the loaders.

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box. 

Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as 
well. 

Example: 
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
C = COGROUP A by id, B by id using 'merge'; 


  was:
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and one of the loader implements {{CollectableLoader}} interface. 
Primary algorithm is based on sort-merge join. 

Additional implementation details: 
1) No other operations can be done between load and join statements. 
2) Data must be sorted in ASC order. 
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else. 
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc. 
5) All other loaders must implement IndexableLoadFunc. 

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box. 
Similiar conditions apply to map-side cogroups (PIG-1309) as well. 

Example: 
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); 
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); 
C = COGROUP A by id, B by id using 'merge'; 



 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0, 0.8.0

 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
 PIG_1309_7.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1353) Map-side outer joins

2010-08-20 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1353:
--

Release Note: 
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and loaders implement required interfaces. Primary algorithm is 
based on sort-merge join. 

Following preconditions should be met in order to use this feature:
1) No other operations can be done between load and join statements.
2) Data must be sorted on join keys in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}.
5) All other loaders must implement {IndexableLoadFunc}.   
6) Type information must be provided in schema for all the loaders. 

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.

Similar conditions apply to map-side cogroups (PIG-1309) as well.  

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted');
C = join A by id left, B by id using 'merge';
.

  was:
With this patch, it is now possible to perform [left|right|full] outer joins on 
two tables as well as inner joins on more then two tables in Pig in map-side if 
data is sorted and one of the loader implements {{CollectableLoader}} 
interface. Primary algorithm is based on sort-merge join. 

Additional implementation details:
1) No other operations can be done between load and join statements.
2) Data must be sorted in ASC order.
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else.
4) Left-most loader must implement CollectableLoader interface as well as 
OrderedLoadFunc.
5) All other loaders must implement IndexableLoadFunc.   

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box.
Similiar conditions apply to map-side cogroups (PIG-1309) as well.  

Example:
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
C = join A by id left, B by id using 'merge';
.


 Map-side outer joins
 

 Key: PIG-1353
 URL: https://issues.apache.org/jira/browse/PIG-1353
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: pig-1353.patch, pig-1353.patch


 Pig already has couple of map-side join implementations: Merge Join and 
 Fragmented-Replicate Join. But both of them are pretty restrictive. Merge 
 Join can only join two tables and that too can only do inner join. FR Join 
 can join multiple relations, but it can also only do inner and left outer 
 joins. Further it restricts the sizes of side relations. It will be nice if 
 we can do map side joins on multiple tables as well do inner, left outer, 
 right outer and full outer joins. 
 Lot of groundwork for this has already been done in PIG-1309. Remaining will 
 be tracked in this jira.   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-08-20 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Release Note: 
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and loader implements certain interfaces. Primary algorithm is based on 
sort-merge join with additional restrictions. 

Following preconditions must be met to use this feature: 
1) No other operations can be done between load and cogroup statements. 
2) Data must be sorted on join keys for all tables in ASC order. 
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else. 
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}. 
5) All other loaders must implement IndexableLoadFunc. 
6) Type information must be provided in schema for all the loaders.

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box. 

Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as 
well. 

Example: 
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
C = COGROUP A by id, B by id using 'merge'; 


  was:
With this patch, it is now possible to perform map-side cogroup if data is 
sorted and loader implements certain interfaces. Primary algorithm is based on 
sort-merge join with additional restrictions. 

Following preconditions must be met to use this feature: 
1) No other operations can be done between load and join statements. 
2) Data must be sorted on join keys for all tables in ASC order. 
3) Nulls are considered smaller then everything. So, if data contains null 
keys, they should occur before anything else. 
4) Left-most loader must implement {CollectableLoader} interface as well as 
{OrderedLoadFunc}. 
5) All other loaders must implement IndexableLoadFunc. 
6) Type information must be provided in schema for all the loaders.

Note that Zebra loader satisfies all of these conditions, so can be used out of 
box. 

Similar conditions apply to map-side outer joins (using merge) (PIG-1353) as 
well. 

Example: 
A = load 'data1' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 
'sorted'); 
C = COGROUP A by id, B by id using 'merge'; 



 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.7.0, 0.8.0

 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
 PIG_1309_7.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-08-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900566#action_12900566
 ] 

Ashutosh Chauhan commented on PIG-1420:
---

 I could not figure out how to re-open this issue.

Issues marked as resolved cannot be reopened. Once the patch is committed, 
commiter should mark issue as resolved, since resolved issues can be reopened 
before release is rolled out. When the release is rolled out, resolved issues 
should be marked as closed, since there is no point in reopening an issue which 
has already been released. If more work needs to be done on that issue new jira 
should be created for it for future releases.

 Make CONCAT act on all fields of a tuple, instead of just the first two 
 fields of a tuple
 -

 Key: PIG-1420
 URL: https://issues.apache.org/jira/browse/PIG-1420
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.8.0

 Attachments: addconcat2.patch, PIG-1420.2.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
 org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
 act on the first two fields of a tuple.  This results in ugly nested CONCAT 
 calls like:
 CONCAT(CONCAT(A, ' '), B)
 The more desirable form is:
 CONCAT(A, ' ', B)
 This change will be backwards compatible, provided that no one was relying on 
 the fact that CONCAT ignores fields after the first two in a tuple.  This 
 seems a reasonable assumption to make, or at least a small break in 
 compatibility for a sizable improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path

2010-08-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12900568#action_12900568
 ] 

Ashutosh Chauhan commented on PIG-1486:
---

I did 
svn co https://svn.apache.org/repos/asf/hadoop/pig/trunk/ pig-1486
ant eclipse-files

and then imported pig-1486 as existing project in eclipse. I presume thats all 
I need to do.
Patch needs more updates after PIG-1520 . Essentially needs to remove owl from 
eclipse's build path. Further, eclipse also reported
* Unbound classpath variable: 'ANT_HOME/lib/ant.jar' in project 'pig-1486'
* Project 'pig-1486' is missing required library: 'lib/hadoop20.jar'



 update ant eclipse-files target to include new jar and remove contrib dirs 
 from build path
 --

 Key: PIG-1486
 URL: https://issues.apache.org/jira/browse/PIG-1486
 Project: Pig
  Issue Type: Bug
  Components: tools
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1486.1.patch, PIG-1486.2.patch, PIG-1486.patch


  .eclipse.templates/.classpath needs to be updated to address following -
 1. There is a new jar that is used by the code - guava-r03.jar
 2. The jar ANT_HOME/lib/ant.jar gives an 'unbounded jar' error in eclipse.
 3. Removing the contrib projects from class path as discussed in PIG-1390, 
 until all libs necessary for the contribs are included in classpath.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-533) DBloader UDF (initial prototype)

2010-08-19 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan resolved PIG-533.
--

Fix Version/s: 0.8.0
   Resolution: Fixed

PIG-1229 makes this redundant.

 DBloader UDF (initial prototype)
 

 Key: PIG-533
 URL: https://issues.apache.org/jira/browse/PIG-533
 Project: Pig
  Issue Type: New Feature
Reporter: Ian Holsman
Priority: Minor
 Fix For: 0.8.0

 Attachments: DbStorage.java


 This is an initial prototype of a UDF that can insert data into a database 
 directly from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898648#action_12898648
 ] 

Ashutosh Chauhan commented on PIG-1518:
---

This feature of combining multiple splits should honor OrderedLoadFunc 
interface. If loadfunc is implementing that interface, then splits generated by 
it should not be combined. However, its not clear why FileInputLoadFunc 
implements this interface. AFAIK, split[] returned by getsplits() on 
FileInputFormat makes no guarantees that underlying splits will be returned in 
ordered fashion. Though, it is a default behavior right now and thus making it 
implement OrderedLoadFunc doesnt result in any problem in current 
implementation. But it seems there is no real benefit of FileInputLoadFunc 
needing to implement it (there is one exception to which I will come later on). 
So, I will argue that FileInputLoadFunc stop implementing OrderedLoadFunc. This 
will result in immediate benefit of making this change useful to all the 
fundamental storage mechanisms of Pig like PigStorage, BinStorage, InterStorage 
etc. Dropping of an interface by an implementing class  can be seen as backward 
incompatible change, but I really doubt if any one cares if PigStorage is 
reading splits in an ordered fashion. 
Only real victim of this change will be MergeJoin which will stop working with 
PigStorage by default. But we have not seen MergeJoin being used with 
PigStorage at many places. Second, its anyway is based on assumption of 
FileInputFormat which may choose to change behavior in future. Third, solution 
of this problem will be straight forward that having other Loader which extends 
PigStorage and implements OrderedLoadFunc which can be used to load data for 
merge join. 

In essence I am arguing to drop OrderedLoadFunc interface from 
FileInputLoadFunc so that this feature is useful for large number of usecases.

Yan, you also need to watch out for ReadToEndLoader which is also making 
assumptions which may break in presence of this feature.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1404) PigUnit - Pig script testing simplified.

2010-08-04 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895318#action_12895318
 ] 

Ashutosh Chauhan commented on PIG-1404:
---

bq. 3. (This one is for other pig developers) Is Piggybank the right place for 
this or should we put it under test? I think this will be really useful for Pig 
users in setting up automated tests of their Pig Latin scripts. Should we 
support it outright rather than put it in piggybank and risk having it go 
unmaintained?

I think it deserves to be put in under test. Having written few end-to-end test 
cases of pig in junit, I can see its really useful for Pig itself. Usefulness 
for pig users is pretty obvious.

 PigUnit - Pig script testing simplified. 
 -

 Key: PIG-1404
 URL: https://issues.apache.org/jira/browse/PIG-1404
 Project: Pig
  Issue Type: New Feature
Reporter: Romain Rigaux
Assignee: Romain Rigaux
 Fix For: 0.8.0

 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, 
 PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, 
 PIG-1404-4.patch, PIG-1404.patch


 The goal is to provide a simple xUnit framework that enables our Pig scripts 
 to be easily:
   - unit tested
   - regression tested
   - quickly prototyped
 No cluster set up is required.
 For example:
 TestCase
 {code}
   @Test
   public void testTop3Queries() {
 String[] args = {
 n=3,
 };
 test = new PigTest(top_queries.pig, args);
 String[] input = {
 yahoo\t10,
 twitter\t7,
 facebook\t10,
 yahoo\t15,
 facebook\t5,
 
 };
 String[] output = {
 (yahoo,25L),
 (facebook,15L),
 (twitter,7L),
 };
 test.assertOutput(data, input, queries_limit, output);
   }
 {code}
 top_queries.pig
 {code}
 data =
 LOAD '$input'
 AS (query:CHARARRAY, count:INT);
  
 ... 
 
 queries_sum = 
 FOREACH queries_group 
 GENERATE 
 group AS query, 
 SUM(queries.count) AS count;
 
 ...
 
 queries_limit = LIMIT queries_ordered $n;
 STORE queries_limit INTO '$output';
 {code}
 They are 3 modes:
 * LOCAL (if pigunit.exectype.local properties is present)
 * MAPREDUCE (use the cluster specified in the classpath, same as 
 HADOOP_CONF_DIR)
 ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in 
 the class path will be: ~/pigtest/conf)
 ** pointing to an existing cluster (if pigunit.exectype.cluster properties 
 is present)
 For now, it would be nice to see how this idea could be integrated in 
 Piggybank and if PigParser/PigServer could improve their interfaces in order 
 to make PigUnit simple.
 Other components based on PigUnit could be built later:
   - standalone MiniCluster
   - notion of workspaces for each test
   - standalone utility that reads test configuration and generates a test 
 report...
 It is a first prototype, open to suggestions and can definitely take 
 advantage of feedbacks.
 How to test, in pig_trunk:
 {code}
 Apply patch
 $pig_trunk ant compile-test
 $pig_trunk ant
 $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99
 {code}
 (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the 
 future between 'unit' and 'integration')
 Many examples are in:
 {code}
 contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java
 {code}
 When used as a standalone, do not forget commons-lang-2.4.jar and the 
 HADOOP_CONF_DIR to your cluster in your CLASSPATH.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1531) Pig gobbles up error messages

2010-08-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894935#action_12894935
 ] 

Ashutosh Chauhan commented on PIG-1531:
---

Another instance where it happens is when input location doesnt exists, error 
message shown is 
{code}
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
create input splits for tmp_emtpy_1280539088
{code}
Whereas underlying exception did have more useful String which gets lost in log 
file
{code}
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist:
hdfs://machine.server.edu/tmp/pig/tmp_tables/tmp_empty_1280539088
{code}

 Pig gobbles up error messages
 -

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: niraj rai
 Fix For: 0.8.0


 Consider the following. I have my own Storer implementing StoreFunc and I am 
 throwing FrontEndException (and other Exceptions derived from PigException) 
 in its various methods. I expect those error messages to be shown in error 
 scenarios. Instead Pig gobbles up my error messages and shows its own generic 
 error message like: 
 {code}
 010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2116: Unexpected error. Could not validate the output specification for: 
 default.partitoned
 Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log
 {code}
 Instead I expect it to display my error messages which it stores away in that 
 log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

2010-08-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894945#action_12894945
 ] 

Ashutosh Chauhan commented on PIG-1516:
---

+1. Changes look good.

 finalize in bag implementations causes pig to run out of memory in reduce 
 --

 Key: PIG-1516
 URL: https://issues.apache.org/jira/browse/PIG-1516
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1516.2.patch, PIG-1516.patch


 *Problem:*
 pig bag implementations that are subclasses of DefaultAbstractBag, have 
 finalize methods implemented. As a result, the garbage collector moves them 
 to a finalization queue, and the memory used is freed only after the 
 finalization happens on it.
 If the bags are not finalized fast enough, a lot of memory is consumed by the 
 finalization queue, and pig runs out of memory. This can happen if large 
 number of small bags are being created.
 *Solution:*
 The finalize function exists for the purpose of deleting the spill files that 
 are created when the bag is too large. But if the bags are small enough, no 
 spill files are created, and there is no use of the finalize function.
  A new class that holds a list of files will be introduced (FileList). This 
 class will have a finalize method that deletes the files. The bags will no 
 longer have finalize methods, and the bags will use FileList instead of 
 ArrayListFile.
 *Possible workaround for earlier releases:*
 Since the fix is going into 0.8, here is a workaround -
 Disabling the combiner will reduce the number of bags getting created, as 
 there will not be the stage of combining intermediate merge results. But I 
 would recommend disabling it only if you have this problem as it is likely to 
 slow down the query .
 To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-08-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894963#action_12894963
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

I am still getting the same exception 
{code}
java.io.IOException: JDBC Error
at 
org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:291)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.init(PigOutputFormat.java:124)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:85)
at 
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.init(MapTask.java:488)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:610)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.sql.SQLException: Table not found in statement [insert into ttt 
(id, name, ratio) values (?,?,?)]
at org.hsqldb.jdbc.Util.throwError(Unknown Source)
at org.hsqldb.jdbc.jdbcPreparedStatement.init(Unknown Source)
at org.hsqldb.jdbc.jdbcConnection.prepareStatement(Unknown Source)
at 
org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:288)
... 6 more
{code}

Reading through few internet forums it seems that there are subtle differences 
in stand-alone mode Vs server mode of hsqldb . May be starting hsqldb 
instance in server mode would alleviate the problem.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, 
 jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1531) Pig gobbles up error messages

2010-07-31 Thread Ashutosh Chauhan (JIRA)
Pig gobbles up error messages
-

 Key: PIG-1531
 URL: https://issues.apache.org/jira/browse/PIG-1531
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


Consider the following. I have my own Storer implementing StoreFunc and I am 
throwing FrontEndException (and other Exceptions derived from PigException) in 
its various methods. I expect those error messages to be shown in error 
scenarios. Instead Pig gobbles up my error messages and shows its own generic 
error message like: 
{code}
010-07-31 14:14:25,414 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2116: Unexpected error. Could not validate the output specification for: 
default.partitoned
Details at logfile: /Users/ashutosh/workspace/pig/pig_1280610650690.log

{code}
Instead I expect it to display my error messages which it stores away in that 
log file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-07-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892378#action_12892378
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

Since fix to PIG-1424 doesnt look straight forward and I dont think anyone is 
working on it, I will suggest to unblock this useful piggy bank functionality 
from Pig's issues. We can take the original approach suggested in the first 
patch of passing jdbc url string as constructor argument instead of store 
location. 
Ankur, do you have cycles to generate the patch which we will commit now so it 
makes into 0.8.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, 
 pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-07-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890845#action_12890845
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Addendum:

* Also what will happen if user returned a nil python object  (null equivalent 
of Java) from UDF. It looks to me that will result in NPE. Can you add a test 
for that and similar test case from pigToPython() 

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, 
 RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, 
 RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, 
 RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, 
 RegisterPythonUDFLatest.patch, RegisterScriptUDFDefineParse.patch, 
 scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path

2010-07-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887885#action_12887885
 ] 

Ashutosh Chauhan commented on PIG-1486:
---

Took a look at the patch. Changes look good. But, because of PIG-1452 some 
additional changes are required. Need to remove lib/hadoop20.jar from eclipse 
build-path and hadoop-core.jar, hadoop-test.jar, apache-commons-* and few other 
jars needed to be added in, which now are pulled in from maven repos and put in 
build/ivy/lib/Pig

 update ant eclipse-files target to include new jar and remove contrib dirs 
 from build path
 --

 Key: PIG-1486
 URL: https://issues.apache.org/jira/browse/PIG-1486
 Project: Pig
  Issue Type: Bug
  Components: tools
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1486.patch


  .eclipse.templates/.classpath needs to be updated to address following -
 1. There is a new jar that is used by the code - guava-r03.jar
 2. The jar ANT_HOME/lib/ant.jar gives an 'unbounded jar' error in eclipse.
 3. Removing the contrib projects from class path as discussed in PIG-1390, 
 until all libs necessary for the contribs are included in classpath.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-07-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888100#action_12888100
 ] 

Ashutosh Chauhan commented on PIG-928:
--

* Do you want to allow: {{register myJavaUDFs.jar using 'java' as 
'javaNameSpace'}} ? Use-case could be that if we are allowing namespaces for 
non-java, why not allow for Java udfs as well. But then {{define}} is exactly 
for this purpose. So, it may make sense to throw exception for such a case.
* In ScriptEngine.getJarPath() shouldn't you throw a FileNotFoundException 
instead of returning null.
* Don't gobble up Checked Exceptions and then rethrow RuntimeExceptions. Throw 
checked exceptions, if you need to.
* ScriptEngine.getInstance() should be a singleton, no?
* In JythonScriptEngine.getFunction() I think you should check if 
interpreter.get(functionName) != null and then return it and call 
Interpreter.init(path) only if its null.
* In JythonUtils, for doing type conversion you should make use of both input 
and output schemas (whenever they are available) and avoid doing reflection for 
every element. You can get hold of input schema through outputSchema() of 
EvalFunc and then do UDFCOntext magic to use it. If schema == null || schema == 
bytearray, you need to resort to reflections. Similarily if outputSchema is 
available via decorators, use it to do type conversions.  
* In jythonUtils.pythonToPig() in case of Tuple, you first create Object[] then 
do Arrays.asList(), you can directly create ListObject and avoid unnecessary 
casting. In the same method, you are only checking for long, dont you need to 
check for int, String  etc. and then do casting appropriately. Also, in default 
case I think we cant let object pass as it is using Object.class, it could be 
object of any type and may cause cryptic errors in Pipeline, if let through. We 
should throw an exception if we dont know what type of object it is. Similar 
argument for default case of pigToPython() 
* I didn't get why the changes are required in POUserFunc. Can you explain and 
also add it as comments in the code.

Testing:

* This is a big enough feature to warrant its own test file. So, consider 
adding a new test file (may be TestNonJavaUDF). Additionally, we see frequent 
timeouts on TestEvalPipeline, we dont want it to run any longer.
* Instead of adding query through pigServer.registerCode() api, add it through 
pigServer.registerQuery(register myscript.py using jython). This will make 
sure we are testing changes in QueryParser.jjt as well.
* Add more tests. Specifically, for complex types passed to the udfs (like bag) 
and returning a bag. You can get bags after doing a group-by. You can also take 
a look at original Julien's patch which contained a python script. Those I 
guess were at right level of complexity to be added as test-cases in our junit 
tests.

Nit-picks:

* Unnecessary import in JythonFunction.java
* In PigContext.java, you are using Vector and LinkedList, instead of usual 
ArrayList. Any particular reason for it, just curious?
* More documentation (in QuerParser.jjt, ScriptEngine, JythonScriptEngine 
(specifically for outputSchema, outputSchemaFunction, schemafunction))
* Also keep an eye of recent mavenization efforts of Pig, depending on when 
it gets checked-in you may (or may not) need to make changes to ivy

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF3.patch, 
 RegisterPythonUDF4.patch, RegisterPythonUDF_Final.patch, 
 RegisterPythonUDFFinale.patch, RegisterPythonUDFFinale3.patch, 
 RegisterPythonUDFFinale4.patch, RegisterPythonUDFFinale5.patch, 
 RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1487) Replace bz with .bz in all the LoadFunc

2010-07-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888182#action_12888182
 ] 

Ashutosh Chauhan commented on PIG-1487:
---

+1 

 Replace bz with .bz  in all the LoadFunc
 

 Key: PIG-1487
 URL: https://issues.apache.org/jira/browse/PIG-1487
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1487.patch


 This issue relates with PIG-1463. Thank Ashutosh find another place in 
 PigStorage should be corrected. I check all the LoadFunc and found that 
 TextLoader also has same problem. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-07-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12887283#action_12887283
 ] 

Ashutosh Chauhan commented on PIG-1249:
---

Map-reduce framework has a jira related to this issue.  
https://issues.apache.org/jira/browse/MAPREDUCE-1521 It has two implications 
for Pig:

1) We need to reconsider whether we still want Pig to set number of reducers on 
user's behalf. We can choose not to intelligently choose # of reducers and 
let framework fail the  job which doesn't correctly specify # of reducers. 
Then, Pig is out of this guessing game and users are forced by framework to 
correctly specify # of reducers. 

2) Now that MR framework will fail the job based on configured limits, 
operators where Pig does compute and set number of reducers (like skewed join 
etc.) should now be aware of those limits so that # of reducers computed by 
them fall within those limits.

 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG_1249_2.patch, 
 PIG_1249_3.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-07-09 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch checked-in to 0.7 branch as well.

 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
 PIG_1309_7.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-07-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886904#action_12886904
 ] 

Ashutosh Chauhan commented on PIG-1389:
---

+1 

Discussed about 3) with Richard offline. Though theoretically it will be better 
to find out the features on the fully compiled and optimized MR plan, it will 
be hard and may not be worth the complexity doing it. So, in this first pass it 
is fine to mark those features while MR plan's compilation is in progress. As a 
result in few corner cases, features marked for MR Oper may not be correct. We 
will fix up those cases as and when they come up.

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch, 
 PIG-1389_2.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1491) Failure planning nested FOREACH with DISTINCT, POLoad cannot be cast to POLocalRearrange

2010-07-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886906#action_12886906
 ] 

Ashutosh Chauhan commented on PIG-1491:
---

Scott,

It will be useful if you can also paste the Pig script which produced this 
exception.

 Failure planning nested FOREACH with DISTINCT, POLoad cannot be cast to 
 POLocalRearrange
 

 Key: PIG-1491
 URL: https://issues.apache.org/jira/browse/PIG-1491
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Scott Carey

 I have a failure that occurs during planning while using DISTINCT in a nested 
 FOREACH. 
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SecondaryKeyOptimizer.visitMROp(SecondaryKeyOptimizer.java:352)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:218)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:40)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-02 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884551#action_12884551
 ] 

Ashutosh Chauhan commented on PIG-1449:
---

Reran the contrib tests. All passed. Patch committed. Thanks, Christian and 
Justin for working on this !

 RegExLoader hangs on lines that don't match the regular expression
 --

 Key: PIG-1449
 URL: https://issues.apache.org/jira/browse/PIG-1449
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Sanders
Priority: Minor
 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
 RegExLoader.patch


 In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
 will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
 lines would be skipped if they didn't match the regular expression.  The 
 result is the mapper will not respond and will time out with Task attempt_X 
 failed to report status for 600 seconds. Killing!.
 Here are the steps to recreate the bug:
 Create a text file in HDFS with the following lines:
 test1
 testA
 test2
 Run the following pig script:
 REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
 test = LOAD '/path/to/test.txt' using 
 org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
 dump test;
 Expected result:
 (test1)
 (test3)
 Actual result:
 Job fails to complete after 600 second timeout waiting on the mapper to 
 complete.  The mapper hangs at 33% since it can process the first line but 
 gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-02 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1449:
--

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.8.0
   Resolution: Fixed

 RegExLoader hangs on lines that don't match the regular expression
 --

 Key: PIG-1449
 URL: https://issues.apache.org/jira/browse/PIG-1449
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Sanders
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
 RegExLoader.patch


 In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
 will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
 lines would be skipped if they didn't match the regular expression.  The 
 result is the mapper will not respond and will time out with Task attempt_X 
 failed to report status for 600 seconds. Killing!.
 Here are the steps to recreate the bug:
 Create a text file in HDFS with the following lines:
 test1
 testA
 test2
 Run the following pig script:
 REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
 test = LOAD '/path/to/test.txt' using 
 org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
 dump test;
 Expected result:
 (test1)
 (test3)
 Actual result:
 Job fails to complete after 600 second timeout waiting on the mapper to 
 complete.  The mapper hangs at 33% since it can process the first line but 
 gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-02 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884552#action_12884552
 ] 

Ashutosh Chauhan commented on PIG-1449:
---

@Christian,

It would definitely be useful to get the execution time for running the tests 
down. It takes a while currently to run all Pig tests.

 RegExLoader hangs on lines that don't match the regular expression
 --

 Key: PIG-1449
 URL: https://issues.apache.org/jira/browse/PIG-1449
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Sanders
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
 RegExLoader.patch


 In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
 will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
 lines would be skipped if they didn't match the regular expression.  The 
 result is the mapper will not respond and will time out with Task attempt_X 
 failed to report status for 600 seconds. Killing!.
 Here are the steps to recreate the bug:
 Create a text file in HDFS with the following lines:
 test1
 testA
 test2
 Run the following pig script:
 REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
 test = LOAD '/path/to/test.txt' using 
 org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
 dump test;
 Expected result:
 (test1)
 (test3)
 Actual result:
 Job fails to complete after 600 second timeout waiting on the mapper to 
 complete.  The mapper hangs at 33% since it can process the first line but 
 gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1309) Map-side Cogroup

2010-07-02 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1309:
--

Attachment: PIG_1309_7.patch

Backport of merge cogroup for 0.7 branch. Since, hudson can test only for 
trunk. Manually ran all the tests, all passed.

 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: mapsideCogrp.patch, pig-1309_1.patch, pig-1309_2.patch, 
 PIG_1309_7.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1424) Error logs of streaming should not be placed in output location

2010-07-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884365#action_12884365
 ] 

Ashutosh Chauhan commented on PIG-1424:
---

This turns out to be much more involved then I initially thought. Assumption 
that output/input location is a file based path exists at more then one place 
in Pig. In particular, Streaming kind of make this explicit assumption and has 
it in the semantics. We need to be careful about streaming semantics before we 
fix this. More at: http://wiki.apache.org/pig/PigStreamingFunctionalSpec

 Error logs of streaming should not be placed in output location
 ---

 Key: PIG-1424
 URL: https://issues.apache.org/jira/browse/PIG-1424
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


 This becomes a problem when output location is anything other then a 
 filesystem. Output will be written to DB but where the logs generated by 
 streaming should go? Clearly, they cant be written into DB. This blocks 
 PIG-1229 which introduces writing to DB from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1449:
--

Status: Open  (was: Patch Available)

 RegExLoader hangs on lines that don't match the regular expression
 --

 Key: PIG-1449
 URL: https://issues.apache.org/jira/browse/PIG-1449
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Sanders
Priority: Minor
 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
 RegExLoader.patch


 In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
 will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
 lines would be skipped if they didn't match the regular expression.  The 
 result is the mapper will not respond and will time out with Task attempt_X 
 failed to report status for 600 seconds. Killing!.
 Here are the steps to recreate the bug:
 Create a text file in HDFS with the following lines:
 test1
 testA
 test2
 Run the following pig script:
 REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
 test = LOAD '/path/to/test.txt' using 
 org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
 dump test;
 Expected result:
 (test1)
 (test3)
 Actual result:
 Job fails to complete after 600 second timeout waiting on the mapper to 
 complete.  The mapper hangs at 33% since it can process the first line but 
 gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-01 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1449:
--

Status: Patch Available  (was: Open)

Running through Hudson.

 RegExLoader hangs on lines that don't match the regular expression
 --

 Key: PIG-1449
 URL: https://issues.apache.org/jira/browse/PIG-1449
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Sanders
Priority: Minor
 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
 RegExLoader.patch


 In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
 will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
 lines would be skipped if they didn't match the regular expression.  The 
 result is the mapper will not respond and will time out with Task attempt_X 
 failed to report status for 600 seconds. Killing!.
 Here are the steps to recreate the bug:
 Create a text file in HDFS with the following lines:
 test1
 testA
 test2
 Run the following pig script:
 REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
 test = LOAD '/path/to/test.txt' using 
 org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
 dump test;
 Expected result:
 (test1)
 (test3)
 Actual result:
 Job fails to complete after 600 second timeout waiting on the mapper to 
 complete.  The mapper hangs at 33% since it can process the first line but 
 gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884116#action_12884116
 ] 

Ashutosh Chauhan commented on PIG-1389:
---

1.
{code}
+/**
+ * Returns the counter name for the given input file name
+ * 
+ * @param fname the input file name
+ * @return the counter name
+ */
+public static String getMultiInputsCounterName(String fname) {
+return MULTI_INPUTS_RECORD_COUNTER +
+new Path(fname).getName();
+}

{code}

Its dangerous to assume that input is a file name. It may not be. It can be a 
jdbc location string. In particular, new Path(fname) parses fname and throws 
exception if String is not the way it expects it to be. So, at various places 
in the patch, dont assume the path will refer to a file location and 
particularly avoid using Path() and deal in Strings.

2. In PigRecordReader, initialization of Counters should be done in 
initialize() instead of getCurrentValue() that will avoid branching for every 
call of getCurrentValue.

3. Marking of features in MRCompiler while compilation is still in progress may 
lead to incorrect results. We do bunch of optimizations *after* MR plan is 
constructed. During which plan may get readjusted and whatever features were 
there in that particular MROper may get pushed around into different MR Oper. 
Better way to do this marking is post-construction of the MRPlan. Have a 
visitor which walks on the final MR Plan and marks the feature in those 
operator.

4. As an extension of 1. I think having a test for non-file based input/output 
location would really be useful. PIG-1229 would have made that super-easy.

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1466) Improve log messages for memory usage

2010-06-25 Thread Ashutosh Chauhan (JIRA)
Improve log messages for memory usage
-

 Key: PIG-1466
 URL: https://issues.apache.org/jira/browse/PIG-1466
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Minor


For anything more then a moderately sized dataset Pig usually spits following 
messages:
{code}
2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
low memory handler called (Usage
threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed = 
954466304(932096K) max =
954466304(932096K)

2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
low memory handler called (Collection
threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed = 
954466304(932096K) max =
954466304(932096K)
{code}

This seems to confuse users a lot. Once these messages are printed, users tend 
to believe that Pig is having hard time with memory, is spilling to disk etc. 
but in fact Pig might be cruising along at ease. We should be little more 
careful what to print in logs. Currently these are printed when a notification 
is sent by JVM and some other conditions are met which may not necessarily 
indicate low memory condition. Furthermore, with {{InternalCachedBag}} embraced 
everywhere in favor of {{DefaultBag}}, these messages have lost their 
usefulness. At the every least, we should lower the log level at which these 
are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1463) Replace bz with .bz in setStoreLocation in PigStorage

2010-06-23 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882017#action_12882017
 ] 

Ashutosh Chauhan commented on PIG-1463:
---

+1

 Replace bz with .bz in setStoreLocation in PigStorage 
 --

 Key: PIG-1463
 URL: https://issues.apache.org/jira/browse/PIG-1463
 Project: Pig
  Issue Type: Bug
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Attachments: PIG_1463.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1462) No informative error message on parse problem

2010-06-22 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881550#action_12881550
 ] 

Ashutosh Chauhan commented on PIG-1462:
---

This has come up before. As noted on PIG-798 correct way to achieve this is
{code}
grunt in = load 'data' using PigStorage() as (m:map[]); 
grunt tags = foreach in generate (tuple(chararray)) m#'k1' as tagtuple;

grunt dump tags;

{code}
 
We probably need to add a note about casting in cookbook. Also, need to 
generate better error message.

 No informative error message on parse problem
 -

 Key: PIG-1462
 URL: https://issues.apache.org/jira/browse/PIG-1462
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ankur

 Consider the following script
 in = load 'data' using PigStorage() as (m:map[]);
 tags = foreach in generate m#'k1' as (tagtuple: tuple(chararray));
 dump tags;
 This throws the following error message that does not really say that this is 
 a bad declaration
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during 
 parsing. Encountered  at line 2, column 38.
 Was expecting one of:
 
   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1170)
   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
   at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
   at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
   at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
   at org.apache.pig.Main.main(Main.java:391)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-06-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880881#action_12880881
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

It seems you missed out ivy.xml bits in the latest patch. +1 otherwise, please 
commit if tests pass.

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, 
 PIG-1427.diff, PIG-1427.diff


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-06-14 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878663#action_12878663
 ] 

Ashutosh Chauhan commented on PIG-1449:
---

Justin,

Good catch. Can you assimilate your test case in junit in one of 
piggybank/test/storage/TestRegExLoader or TestMyRegExLoader. That way we'll 
have a regression test for the issue.

 RegExLoader hangs on lines that don't match the regular expression
 --

 Key: PIG-1449
 URL: https://issues.apache.org/jira/browse/PIG-1449
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Sanders
Priority: Minor
 Attachments: RegExLoader.patch


 In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
 will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
 lines would be skipped if they didn't match the regular expression.  The 
 result is the mapper will not respond and will time out with Task attempt_X 
 failed to report status for 600 seconds. Killing!.
 Here are the steps to recreate the bug:
 Create a text file in HDFS with the following lines:
 test1
 testA
 test2
 Run the following pig script:
 REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
 test = LOAD '/path/to/test.txt' using 
 org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
 dump test;
 Expected result:
 (test1)
 (test3)
 Actual result:
 Job fails to complete after 600 second timeout waiting on the mapper to 
 complete.  The mapper hangs at 33% since it can process the first line but 
 gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1448) Detach tuple from inner plans of physical operator

2010-06-12 Thread Ashutosh Chauhan (JIRA)
Detach tuple from inner plans of physical operator 
---

 Key: PIG-1448
 URL: https://issues.apache.org/jira/browse/PIG-1448
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


This is a follow-up on PIG-1446 which only addresses this general problem for a 
specific instance of For Each. In general, all the physical operators which can 
have inner plans are vulnerable to this. Few of them include POLocalRearrange, 
POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator

2010-06-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878294#action_12878294
 ] 

Ashutosh Chauhan commented on PIG-1448:
---

Problem here is not as bad as it may sound. All the physical operator already 
detaches the input tuple after they are done with it. In the getNext() phy op 
first calls processInput() which first attaches the input tuple and then 
detaches it at the end. So, physical operators contained within inner plans 
will also do that. Problem is when there is a Bin Cond, Pig short circuits one 
of the branches of the inner plan, in which case getNext() of the operator is 
never called and thus tuple is never detached. Note in these cases, tuple was 
already attached by the operator which had this inner plan to all the roots of 
the plan. So, in this particular use case tuple got attached but was never 
detached and thus had the stray reference which cannot be GC'ed. This still 
will not be a problem if there is only a single pipeline in mapper or reducer 
since the next time new key/value pair is read and is run through pipeline, the 
reference will be overwritten and thus tuple which was not detached in previous 
run can now be GC'ed. Only if you have Multi Query optimized script the same 
pipeline may not be run when the next key/value pair is read in map() or 
reduce() and then stray reference will not be overwritten. If all of these 
conditions are met and if tuple  itself is large or contains large bags, we may 
end up with OOME. 

 Detach tuple from inner plans of physical operator 
 ---

 Key: PIG-1448
 URL: https://issues.apache.org/jira/browse/PIG-1448
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


 This is a follow-up on PIG-1446 which only addresses this general problem for 
 a specific instance of For Each. In general, all the physical operators which 
 can have inner plans are vulnerable to this. Few of them include 
 POLocalRearrange, POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)

2010-06-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878336#action_12878336
 ] 

Ashutosh Chauhan commented on PIG-1442:
---

This looks like a variant of PIG-1446 and PIG-1448 PigCombiner attaches the 
tuple to the roots of combine plan, but never detaches them. PODemux also 
attach the tuple to the inner plan, but never detaches it. Note that 
PigCombiner may also contain multiple pipelines depending on number of 
operations done inside For Each resulting in similar problems as described in 
PIG-1448.

 java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
 ---

 Key: PIG-1442
 URL: https://issues.apache.org/jira/browse/PIG-1442
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.7.0
 Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev 
 (18/may)
 Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0
Reporter: Dirk Schmid

 As mentioned by Ashutosh this is a reopen of 
 https://issues.apache.org/jira/browse/PIG-766 because there is still a 
 problem which causes that PIG scales only by memory.
 For convenience here comes the last entry of the PIG-766-Jira-Ticket:
 {quote}1. Are you getting the exact same stack trace as mentioned in the 
 jira?{quote} Yes the same and some similar traces:
 {noformat}
 java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Arrays.java:2786)
   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
   at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
   at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249)
   at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
   at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
   at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
   at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179)
   at 
 org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880)
   at 
 org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201)
   at 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
   at 
 org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
   at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
   at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
 java.lang.OutOfMemoryError: Java heap space
   at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58)
   at 
 org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
   at 
 org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
   at 
 org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263)
   at 
 org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
   at 
 org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
   at 

[jira] Updated: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.

2010-06-11 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1446:
--

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.8.0
   0.7.0
   Resolution: Fixed

As usual, hudson is not responding. I manually ran all the unit tests, all of 
them passed. Committed to both trunk and 0.7

 OOME in a query having a bincond in the inner plan of a Foreach.
 

 Key: PIG-1446
 URL: https://issues.apache.org/jira/browse/PIG-1446
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0, 0.7.0

 Attachments: pig-1446.patch


 This is seen when For Each is following a group-by and there is a bin cond as 
 an inner plan of For Each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.

2010-06-10 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-1446:
-

Assignee: Ashutosh Chauhan

 OOME in a query having a bincond in the inner plan of a Foreach.
 

 Key: PIG-1446
 URL: https://issues.apache.org/jira/browse/PIG-1446
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: pig-1446.patch


 This is seen when For Each is following a group-by and there is a bin cond as 
 an inner plan of For Each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.

2010-06-10 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1446:
--

Attachment: pig-1446.patch

Sequence of event is as follows:
1) MultiQuery optimizer combined 30 group-bys in one reducer. So, there are 30 
pipelines in a reducer.
2) Each of these group-by has a ForEach after them.
3) ForEach has a bincond in its own plan.
4) Group-by resulted in large bags (10s of million of records).
5) Tuple containing group and bag is attached to the roots of inner plan of FE.
6) FE pulled the tuples through its leaves.
7) Due to short-circuiting in bin-cond, one branch of the plan is never pulled 
resulting in stray reference of bag which actually was not needed.
8) Due to MQ optimized 30 group-bys, we had many such bags now hanging in 
there, eating up all the memory.

Fix: Detach tuples from the roots once you are done in FE.

 OOME in a query having a bincond in the inner plan of a Foreach.
 

 Key: PIG-1446
 URL: https://issues.apache.org/jira/browse/PIG-1446
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Attachments: pig-1446.patch


 This is seen when For Each is following a group-by and there is a bin cond as 
 an inner plan of For Each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1446) OOME in a query having a bincond in the inner plan of a Foreach.

2010-06-10 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1446:
--

Status: Patch Available  (was: Open)

 OOME in a query having a bincond in the inner plan of a Foreach.
 

 Key: PIG-1446
 URL: https://issues.apache.org/jira/browse/PIG-1446
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: pig-1446.patch


 This is seen when For Each is following a group-by and there is a bin cond as 
 an inner plan of For Each.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-10 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877616#action_12877616
 ] 

Ashutosh Chauhan commented on PIG-1428:
---

I propose a slightly different approach here. Instead of adding 
getPigStatusReporter() to PigLogger() interface and the corresponding 
implementation in PigHadoopLogger, we can add a static singleton method in 
PigStatusReporter and also add a setContext( TaskInputOutputContext context) We 
can then set the context in map() and reduce() functions and users will have 
full access of the reporter object through the static method. This will allow 
us to keep error logging different then status reporting. 

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-10 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877591#action_12877591
 ] 

Ashutosh Chauhan commented on PIG-1428:
---

So, I read through PIG-889. It seems that there never was a documented way to 
use counters, reporters etc from UDFs, Load/Store Funcs. Actually, there is a 
hacky way to do it, which exists in DefaultAbstractBag.java 
{code}
protected void incSpillCount(Enum counter) {
// Increment the spill count
// warn is a misnomer. The function updates the counter. If the update
// fails, it dumps a warning
PigHadoopLogger.getInstance().warn(this, Spill counter incremented, 
counter);
}
{code}
But in PIG-889 Santhosh has argued against for this (mis)use of PigLogger. I 
think we need to provide a formal way to Pig users to access counters, 
reporters from our interfaces (UDFs, L/S) as PigHadoopLogger is designed for 
error-handling (warning aggregation in particular) and not for this purpose. 
And we shall mark this class as Internal only, before some one starts using it. 
With the same argument, above method where Pig is internally making use of its 
own Counters is flawed and needs to be corrected.

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1438) [Performance] MultiQueryOptimizer should also merge DISTINCT jobs

2010-06-09 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877150#action_12877150
 ] 

Ashutosh Chauhan commented on PIG-1438:
---

+1 please commit.

 [Performance] MultiQueryOptimizer should also merge DISTINCT jobs
 -

 Key: PIG-1438
 URL: https://issues.apache.org/jira/browse/PIG-1438
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1438.patch, PIG-1438_1.patch


 Current implementation doesn't merge jobs derived from DISTINCT statements. 
 The reason is that DISTINCT jobs are implemented using a special combiner 
 (DistinctCombiner). But we should be able to merge jobs that have the same 
 type of combiner (e.g. merge multiple DISTINCT jobs into one).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-06-08 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876763#action_12876763
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

@Dmitriy,

Occupied with some work. Will get back to it sometime later this week.  

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, 
 PIG-1427.diff


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-06-04 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-283:
-

  Status: Resolved  (was: Patch Available)
Release Note: 
For documentation:

After this patch, it becomes possible to set key value pairs as following in 
the script. 
{code}
set mapred.map.tasks.speculative.execution false
set pig.logfile mylogfile.log
set my.arbitrary.key my.arbitary.value
{code}
These key value pairs would be put in job-conf by Pig. This is a script wide 
setting meaning if value is defined multiple times for a key in the script, the 
last one will take effect and it will be this value which will be set for all 
the jobs generated by script. 
  Resolution: Fixed

Re-ran all the test reported by Hudson as failures. All of them passed. Patch 
committed.



 Allow to set arbitrary jobconf key-value pairs inside pig program
 -

 Key: PIG-283
 URL: https://issues.apache.org/jira/browse/PIG-283
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Affects Versions: 0.7.0
Reporter: Christian Kunz
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: pig-282.patch


 It would be useful to be able to set arbitrary JobConf key-value pairs inside 
 a pig program (e.g. in front of a COGROUP statement).
 I wonder whether the simplest way to add this feature is by expanding the 
 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true

2010-06-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12875319#action_12875319
 ] 

Ashutosh Chauhan commented on PIG-1433:
---

+1 for the commit. couple of notes for future:
* Since this is related to Hadoop property. We should consider this removing 
from Pig codebase when MAPREDUCE-1447 and MAPREDUCE-947 are fixed.
* We have lot of constant strings in our codebase. For the sake of clean code, 
we shall put all of those public static final string in one top level interface 
called Constants. This should be part of seperate clean-up code jira.

 pig should create success file if 
 mapreduce.fileoutputcommitter.marksuccessfuljobs is true
 --

 Key: PIG-1433
 URL: https://issues.apache.org/jira/browse/PIG-1433
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.8.0

 Attachments: PIG-1433.patch


 pig should create success file if 
 mapreduce.fileoutputcommitter.marksuccessfuljobs is true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1433) pig should create success file if mapreduce.fileoutputcommitter.marksuccessfuljobs is true

2010-06-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12875326#action_12875326
 ] 

Ashutosh Chauhan commented on PIG-1433:
---

My point was to have all constant strings in one place instead of each class 
having some of them It could be either interface or class. If interface is 
considered anti-pattern, doing it in class is fine too.

 pig should create success file if 
 mapreduce.fileoutputcommitter.marksuccessfuljobs is true
 --

 Key: PIG-1433
 URL: https://issues.apache.org/jira/browse/PIG-1433
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.8.0

 Attachments: PIG-1433.patch


 pig should create success file if 
 mapreduce.fileoutputcommitter.marksuccessfuljobs is true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct

2010-06-03 Thread Ashutosh Chauhan (JIRA)
[Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
-

 Key: PIG-1437
 URL: https://issues.apache.org/jira/browse/PIG-1437
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Minor




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1437) [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct

2010-06-03 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1437:
--

Release Note:   (was: Its possible to rewrite queries like this
{code}
A = load 'data' as (name,age);
B = group A by (name,age);
C = foreach B generate group.name, group.age;
dump C;
{code}
or
{code} 
(name,age);
B = group A by (name
A = load 'data' as,age);
C = foreach B generate flatten(group);
dump C;
{code}

to
{code}
A = load 'data' as (name,age);
B = distinct A;
dump B;
{code}

This could only be done if no columns within the bags are referenced 
subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed 
more effeciently then group-by this will be a huge win. )
 Description: 
Its possible to rewrite queries like this
{code}
A = load 'data' as (name,age);
B = group A by (name,age);
C = foreach B generate group.name, group.age;
dump C;
{code}
or
{code} 
(name,age);
B = group A by (name
A = load 'data' as,age);
C = foreach B generate flatten(group);
dump C;
{code}

to
{code}
A = load 'data' as (name,age);
B = distinct A;
dump B;
{code}

This could only be done if no columns within the bags are referenced 
subsequently in the script. Since in Pig-Hadoop world DISTINCT will be executed 
more effeciently then group-by this will be a huge win. 

 [Optimization] Rewrite GroupBy-Foreach-flatten(group) to Distinct
 -

 Key: PIG-1437
 URL: https://issues.apache.org/jira/browse/PIG-1437
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Priority: Minor

 Its possible to rewrite queries like this
 {code}
 A = load 'data' as (name,age);
 B = group A by (name,age);
 C = foreach B generate group.name, group.age;
 dump C;
 {code}
 or
 {code} 
 (name,age);
 B = group A by (name
 A = load 'data' as,age);
 C = foreach B generate flatten(group);
 dump C;
 {code}
 to
 {code}
 A = load 'data' as (name,age);
 B = distinct A;
 dump B;
 {code}
 This could only be done if no columns within the bags are referenced 
 subsequently in the script. Since in Pig-Hadoop world DISTINCT will be 
 executed more effeciently then group-by this will be a huge win. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-28 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12873095#action_12873095
 ] 

Ashutosh Chauhan commented on PIG-283:
--

Seems hudson didn't fully recover from its long hospital trip. All failures are 
unrelated and because of port conflicts. Patch is ready for review.

 Allow to set arbitrary jobconf key-value pairs inside pig program
 -

 Key: PIG-283
 URL: https://issues.apache.org/jira/browse/PIG-283
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Affects Versions: 0.7.0
Reporter: Christian Kunz
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: pig-282.patch


 It would be useful to be able to set arbitrary JobConf key-value pairs inside 
 a pig program (e.g. in front of a COGROUP statement).
 I wonder whether the simplest way to add this feature is by expanding the 
 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-05-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872303#action_12872303
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

1. You didnt pay heed to my request for incrementing counter when udf times out 
or throws an exception :) I think that will be pretty useful for user to know 
how many faulty records there are in the dataset which can't be processed by 
the UDF.
2. In the getDefaultValue() there seems to be a inconsistency among different 
if statements. I guess you need to make a distinction between Integer[] and 
Integer return type and then return appropriate return value.
3. Doing svn co; patch -p0  monitoredUDF.patch; ant jar results in build 
failure. It seems ivy is not pulling guava lib.
4. Since its user facing new interface, having stability/visibility tag would 
really be useful.
5. Since it spawns a new thread for every exec() call, I assume it will have 
some overhead. If you have done some comparison or have numbers for that, it 
will be great if you can share that.

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: monitoredUdf.patch, monitoredUdf.patch


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872315#action_12872315
 ] 

Ashutosh Chauhan commented on PIG-283:
--

Proposal here is as suggested in the description. Expand set command so that 
set can take arbitrary key-value pairs and pass it on to the job-conf.

 Allow to set arbitrary jobconf key-value pairs inside pig program
 -

 Key: PIG-283
 URL: https://issues.apache.org/jira/browse/PIG-283
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Reporter: Christian Kunz

 It would be useful to be able to set arbitrary JobConf key-value pairs inside 
 a pig program (e.g. in front of a COGROUP statement).
 I wonder whether the simplest way to add this feature is by expanding the 
 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-27 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-283:
-

Attachment: pig-282.patch

Patch as suggested in previous comment. This will let user to add / override 
key value pairs in job conf through grunt or through script. Like
{code}
grunt set mapred.map.tasks.speculative.execution false
grunt set pig.logfile mylogfile.log
grunt set my.arbitrary.key my.arbitary.value 
{code}

 Allow to set arbitrary jobconf key-value pairs inside pig program
 -

 Key: PIG-283
 URL: https://issues.apache.org/jira/browse/PIG-283
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Reporter: Christian Kunz
 Attachments: pig-282.patch


 It would be useful to be able to set arbitrary JobConf key-value pairs inside 
 a pig program (e.g. in front of a COGROUP statement).
 I wonder whether the simplest way to add this feature is by expanding the 
 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-27 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-283:


Assignee: Ashutosh Chauhan

 Allow to set arbitrary jobconf key-value pairs inside pig program
 -

 Key: PIG-283
 URL: https://issues.apache.org/jira/browse/PIG-283
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Reporter: Christian Kunz
Assignee: Ashutosh Chauhan
 Attachments: pig-282.patch


 It would be useful to be able to set arbitrary JobConf key-value pairs inside 
 a pig program (e.g. in front of a COGROUP statement).
 I wonder whether the simplest way to add this feature is by expanding the 
 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-05-27 Thread Ashutosh Chauhan (JIRA)
Add getPigStatusReporter() to PigHadoopLogger
-

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


Without this getter method, its not possible to get counters, report progress 
etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-283) Allow to set arbitrary jobconf key-value pairs inside pig program

2010-05-27 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-283:
-

   Status: Patch Available  (was: Open)
Affects Version/s: 0.7.0
Fix Version/s: 0.8.0

 Allow to set arbitrary jobconf key-value pairs inside pig program
 -

 Key: PIG-283
 URL: https://issues.apache.org/jira/browse/PIG-283
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Affects Versions: 0.7.0
Reporter: Christian Kunz
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: pig-282.patch


 It would be useful to be able to set arbitrary JobConf key-value pairs inside 
 a pig program (e.g. in front of a COGROUP statement).
 I wonder whether the simplest way to add this feature is by expanding the 
 'set' command functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-05-27 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872862#action_12872862
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

*  Filed PIG-1428 for it.
*  Neat workaround.
*  I guess checking in lib/ is fine. They are using APL.
*  Performance number looks good. Initially, lets not default for monitoring. 
Later as we gain more experience with this feature we should on monitoring by 
default so as not to waste cluster resources because of programming errors.

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: monitoredUdf.patch, monitoredUdf.patch


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1347) Clear up output directory for a failed job

2010-05-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871862#action_12871862
 ] 

Ashutosh Chauhan commented on PIG-1347:
---

Patch is pretty straightforward and harmless as it only removes code and does 
not add any thing new. Only concern I have is 
FileLocalizer.registerDeleteOnFail() is a public method so its possible that 
some one using Pig's java api is using this method to do the cleanup himself 
previously.  So, this can be considered as backward incompatible change. But, 
Daniel explained to me that this method was meant for Pig's internal usage and 
clean up in any case was taken care by Pig before the recent store func 
changes, so user need not to worry about it. So, its extremely unlikely that 
someone is using it. 
So, +1 on committing.

 Clear up output directory for a failed job
 --

 Key: PIG-1347
 URL: https://issues.apache.org/jira/browse/PIG-1347
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Ashitosh Darbarwar
 Fix For: 0.8.0

 Attachments: PIG-1347-1.patch


 FileLocalizer.deleteOnFail suppose to track the output files need to be 
 deleted in case the job fails. However, in the current code base, 
 deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is 
 called by nobody. We need to bring it back.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1424) Error logs of streaming should not be placed in output location

2010-05-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871902#action_12871902
 ] 

Ashutosh Chauhan commented on PIG-1424:
---

Till we figure out a proper solution for this, one possibility is to wrap the 
code in my previous comment into try-catch block. That will unblock PIG-1229 
for commit. We can leave this ticket open if we feel there is a need for a 
better solution. 

 Error logs of streaming should not be placed in output location
 ---

 Key: PIG-1424
 URL: https://issues.apache.org/jira/browse/PIG-1424
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


 This becomes a problem when output location is anything other then a 
 filesystem. Output will be written to DB but where the logs generated by 
 streaming should go? Clearly, they cant be written into DB. This blocks 
 PIG-1229 which introduces writing to DB from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-05-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872031#action_12872031
 ] 

Ashutosh Chauhan commented on PIG-1427:
---

A useful feature. Couple of comments:

1. Currently in case of time outs and error you are always returning null. It 
will be useful if user can specify a default return value as a definition of 
his annotation which is returned in those cases. For example if my regex fails 
on an input String, I want to return an empty String back. Something like:
{code}
 @MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 500, 
defaultReturnValue = )
{code} 

2. It seems that PigHadoopLogger.getReporter() method accidentally got removed 
in 0.7 and trunk. This needs to be restored. It will be really cool to see how 
many of my input records are faulty on UI. Since, it is a small change, I think 
you can add that getter method in there and then update the appropriate 
counters. 

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: monitoredUdf.patch


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2010-05-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871253#action_12871253
 ] 

Ashutosh Chauhan commented on PIG-766:
--

Dirk,

1. Are you getting the exact same stack trace as mentioned in the jira?
2. Which operations are you doing in your query - join, group-by, any other ?
3. What load/store func are you using to read and write data? PigStorage or 
your own ?
4. What is your data size and memory available to your tasks?
5. Do you have very large records in your dataset, like hundreds of MB for one 
record ?

It would be great if you can paste here the script from which you get this 
exception.

 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.7.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-05-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871448#action_12871448
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Arnab,

Thanks for putting together a patch for this. One question I have is about 
register Vs define. Currently you are auto-registering all the functions in the 
script file and then they are available for later use in script. But I am not 
sure how we will handle the case for inlined functions. For inline functions 
{{define}} seems to be a natural choice as noted in previous comments of the 
jira. And if so, then we need to modify define to support that use case. 
Wondering to remain consistent, we always use {{define}} to define non-native 
functions instead of auto registering them. I also didn't get why there will be 
need for separate interpreter instances in that case.


 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, pig-greek.tgz, 
 pig.scripting.patch.arnab, pyg.tgz, scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1424) Error logs of streaming should not be placed in output location

2010-05-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869688#action_12869688
 ] 

Ashutosh Chauhan commented on PIG-1424:
---

Since all the logs generated by Pig in backend end up in log directory of task 
tracker, logs generated by streaming binary should also go there and not into 
the output location.
The place where this setting of location happens is in JobControlCompiler.java, 
line 460:
{code}
conf.set(pig.streaming.log.dir, 
new Path(outputPath, LOG_DIR).toString());
{code} 

 Error logs of streaming should not be placed in output location
 ---

 Key: PIG-1424
 URL: https://issues.apache.org/jira/browse/PIG-1424
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0


 This becomes a problem when output location is anything other then a 
 filesystem. Output will be written to DB but where the logs generated by 
 streaming should go? Clearly, they cant be written into DB. This blocks 
 PIG-1229 which introduces writing to DB from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-05-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12869692#action_12869692
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

Cool. I created PIG-1424 to track the Pig issue.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, 
 pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1381) Need a way for Pig to take an alternative property file

2010-05-13 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12867220#action_12867220
 ] 

Ashutosh Chauhan commented on PIG-1381:
---

+1 on the changes. 
For completeness, we can also check in an empty pig.properties  in the conf dir 
and then add comments in both pig.properties and pig-default.properties that if 
user wants to pass some properties doing it through pig-default.properties will 
have no effect and instead they should add extra properties they want to 
add/override in pig.properties.

 Need a way for Pig to take an alternative property file
 ---

 Key: PIG-1381
 URL: https://issues.apache.org/jira/browse/PIG-1381
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: V.V.Chaitanya Krishna
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1381-1.patch, PIG-1381-2.patch, PIG-1381-3.patch, 
 PIG-1381-4.patch


 Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a 
 default pig.properties and if user have a different pig.properties, there 
 will be a conflict since we can only read one. There are couple of ways to 
 solve it:
 1. Give a command line option for user to pass an additional property file
 2. Change the name for default pig.properties to pig-default.properties, and 
 user can give a pig.properties to override
 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems 
 to be more natural for hadoop community. If so, we shall provide backward 
 compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-05-13 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1229:
--

Attachment: pig-1229.patch

Ankur,

Sorry for getting back late on this. I fiddled with your latest patch and was 
able to make some progress on it. I am able to get rid of those Path problems 
(looks like Pig itself is not dealing with it correctly at one place). I think 
with the patch that I attached should work but I am not able to get test case 
to pass because of hsqldb problem which I am not able to resolve. I keep 
getting this error from it:
{noformat}
Caused by: java.sql.SQLException: The database is already in use by another 
process: org.hsqldb.persist.niolockf...@4abea04e[file 
=/private/tmp/batchtest.lck, exists=true, locked=false, valid=false, fl =null]: 
java.lang.Exception: checkHeartbeat(): lock file [/private/tmp/batchtest.lck] 
is presumably locked by another process.
at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
at org.hsqldb.jdbc.jdbcConnection.init(Unknown Source)
at org.hsqldb.jdbcDriver.getConnection(Unknown Source)
at org.hsqldb.jdbcDriver.connect(Unknown Source)
at java.sql.DriverManager.getConnection(DriverManager.java:582)
at java.sql.DriverManager.getConnection(DriverManager.java:185)
at 
org.apache.pig.piggybank.storage.DBStorage.prepareToWrite(DBStorage.java:274)

{noformat}
Anyways here are the changes I made:
1.
{code}
Index:src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
===
-conf.set(pig.streaming.log.dir, 
-new Path(outputPath, LOG_DIR).toString());
+//conf.set(pig.streaming.log.dir, 
+//new Path(outputPath, LOG_DIR).toString());
 conf.set(pig.streaming.task.output.dir, outputPath);
 }
{code}
This looks like a problem in Pig. Here Pig is incorrectly assuming that it can 
put logs generated during stream command in output location which is incorrect 
if output location is something like DB. Since this needs changes in main Pig 
code, I will suggest to open new jira for it and track it there.

2. Then in DBStorage.java
{code}
@Override
public void setStoreLocation(String location, Job job) throws IOException {
  job.getConfiguration().set(pig.db.conn.string, location);
}
@Override
public RecordWriterNullWritable, NullWritable getRecordWriter(
TaskAttemptContext context) throws IOException, InterruptedException {
  jdbcURL = context.getConfiguration().get(pig.db.conn.string);
  return null;
}
{code} 
Need to save db connection string in job in setStoreLocation() and then 
retrieve it in backend in getRecordWriter(). 

3. In DBStorage.java
{code}
@Override
public void cleanupOnFailure(String location, Job job) throws 
IOException {
  log.error(Job has failed.);
}
{code}
You need to necessarily override this function of StoreFunc() as default 
implementation assumes FileSystem as the output location. Currently, I left it 
as no-op but it can be improved to do rollbacks, release db connections etc. 


 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1390) Provide a target to generate eclipse-related classpath and files

2010-04-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12862951#action_12862951
 ] 

Ashutosh Chauhan commented on PIG-1390:
---

I gave it a go and did as mentioned in previous comment

{noformat}
These are the steps that could be followed and imported to eclipse in a faster 
way :
1. checkout the trunk code.
2. run ant eclipse-files.
3. open eclipse and import the existing project.
{noformat}

Though, pig itself compiled fine and is ready to go, the contrib projects 
(owl,zebra,piggybank/hiverc) didnt compile, I think because either it didn't 
download dependices of those projects or didn't include them in the build path. 
So, there appears unfriendly red cross next to project. If I remove them from 
build path, things are good. Did I do something wrong or is this expected ?

 Provide a target to generate eclipse-related classpath and files
 

 Key: PIG-1390
 URL: https://issues.apache.org/jira/browse/PIG-1390
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.7.0, 0.8.0
Reporter: V.V.Chaitanya Krishna
Assignee: V.V.Chaitanya Krishna
 Fix For: 0.8.0

 Attachments: PIG-1390-2.patch, PIG-1390-3.patch, 
 PIG-eclipse_support.patch


 Currently, after checking out from svn repository, there is no provision to 
 auto-generate eclipse-related classpath and files , which could help in 
 import into eclipse directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory

2010-04-27 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1395:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch checked-in with updated comment.

 Mapside cogroup runs out of memory
 --

 Key: PIG-1395
 URL: https://issues.apache.org/jira/browse/PIG-1395
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: cogrp_mem.patch


 In a particular scenario when there aren't lot of tuples with a same key in a 
 relation (i.e. there aren't many repeating keys) map tasks doing cogroup 
 fails with GC overhead exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861122#action_12861122
 ] 

Ashutosh Chauhan commented on PIG-798:
--

1.
{noformat}
 b = foreach a generate (chararray) $0 as name; 
{noformat}

2. {noformat}
B = foreach A generate $0 as name:chararray;
{noformat}

@Viraj,

Discussed with Alan and Daniel. Language semantics for achieving this 
functionality with whatever loader is 1. The fact that 2 works for BinStorage 
is unfortunate and is bug. It is something which is currently there for 
backward compatibility and will eventually be removed. 


 Schema errors when using PigStorage and none when using BinStorage in 
 FOREACH??
 ---

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
Reporter: Viraj Bhat
 Attachments: binstoragecreateop, schemaerr.pig, visits.txt


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}
 ===
 So why should the semantics of BinStorage() be different from PigStorage() 
 where is ok not to specify a schema??? Should it not be consistent across 
 both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1395) Mapside cogroup runs out of memory

2010-04-26 Thread Ashutosh Chauhan (JIRA)
Mapside cogroup runs out of memory
--

 Key: PIG-1395
 URL: https://issues.apache.org/jira/browse/PIG-1395
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0


In a particular scenario when there aren't lot of tuples with a same key in a 
relation (i.e. there aren't many repeating keys) map tasks doing cogroup fails 
with GC overhead exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1395) Mapside cogroup runs out of memory

2010-04-26 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1395:
--

Status: Patch Available  (was: Open)

 Mapside cogroup runs out of memory
 --

 Key: PIG-1395
 URL: https://issues.apache.org/jira/browse/PIG-1395
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: cogrp_mem.patch


 In a particular scenario when there aren't lot of tuples with a same key in a 
 relation (i.e. there aren't many repeating keys) map tasks doing cogroup 
 fails with GC overhead exception.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-04-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861177#action_12861177
 ] 

Ashutosh Chauhan commented on PIG-1229:
---

Ankur,

The stack trace above is out of sync with trunk. Can you upload the patch with 
this alternative approach that you are trying. I think it might be possible to 
get this working.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1381) Need a way for Pig to take an alternative property file

2010-04-26 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12861186#action_12861186
 ] 

Ashutosh Chauhan commented on PIG-1381:
---

Do we need to have two different property files ? One possibility is to not 
package pig.properties in the pig.jar and then include it in the classpath 
while invoking Pig. (We can modify pig shell script to include it in the path 
by default). Then, user can add/delete/modify the pig.properties as he wish as 
well override default properties. 

Disadvantage of two property files, is sometimes its confusing which property 
is getting picked up (one in default or one in user specified). If there is 
only one property file, there is only one way to specify the properties to Pig 
which I think is better way of doing it. 


 Need a way for Pig to take an alternative property file
 ---

 Key: PIG-1381
 URL: https://issues.apache.org/jira/browse/PIG-1381
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
 Fix For: 0.8.0


 Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a 
 default pig.properties and if user have a different pig.properties, there 
 will be a conflict since we can only read one. There are couple of ways to 
 solve it:
 1. Give a command line option for user to pass an additional property file
 2. Change the name for default pig.properties to pig-default.properties, and 
 user can give a pig.properties to override
 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems 
 to be more natural for hadoop community. If so, we shall provide backward 
 compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-24 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860598#action_12860598
 ] 

Ashutosh Chauhan commented on PIG-798:
--

You can specify schema in FOREACH GENERATE with PigStorage loader as follows:
{code}
grunt a = load 'data' using PigStorage();
grunt b = foreach a generate (chararray) $0 as name; 
grunt describe b;
b: {name: chararray}
grunt dump b;
{code}

I get the expected result.

 Schema errors when using PigStorage and none when using BinStorage in 
 FOREACH??
 ---

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0
Reporter: Viraj Bhat
 Attachments: binstoragecreateop, schemaerr.pig, visits.txt


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}
 ===
 So why should the semantics of BinStorage() be different from PigStorage() 
 where is ok not to specify a schema??? Should it not be consistent across 
 both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1339) International characters in column names not supported

2010-04-24 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860606#action_12860606
 ] 

Ashutosh Chauhan commented on PIG-1339:
---

This works fine on grunt. 
{code}
grunt a = load '1-3.txt' using PigStorage() as (あいうえお);
grunt dump a;
{code}

gives expected result. Problem is if it is fed as script to Pig
{code}
bin/pig myscript.pig
{code}
gives the exception as you shown above. This looks like a bug in 
PigScriptParser.jj where it should read the stream from script file as UTF-8.

 International characters in column names not supported
 --

 Key: PIG-1339
 URL: https://issues.apache.org/jira/browse/PIG-1339
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0, 0.7.0, 0.8.0
Reporter: Viraj Bhat

 There is a particular use-case in which someone specifies a column name to be 
 in International characters.
 {code}
 inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
 describe inputdata;
 dump inputdata;
 {code}
 ==
 Pig Stack Trace
 ---
 ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
 Encountered: \u3042 (12354), after : 
 org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 
 1, column 64.  Encountered: \u3042 (12354), after : 
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:391)
 ==
 Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

2010-04-24 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12860614#action_12860614
 ] 

Ashutosh Chauhan commented on PIG-1211:
---

Oh, I got confused. From your earlier comment, it occurred to me you are saying 
that we should add a -checkscript command line option. From your previous 
comment are you suggesting that we should add syntax checker which will always 
run (i.e., without needing any cmd line directive) before the query starts to 
execute and thereby catching as many user error as possible. I think this is a 
reasonable ask and will be useful to users. This might be the first step 
towards making a distinction between pig compile time and run-time explicit to 
user. If we go full length here, we might as well do what Milind suggested 
earlier (and in recent mail thread). We can add a compilation phase which 
first runs a syntax checker, then generates object code (essentially job jar) 
from pig script. This compiled object can then be handed over to run-time 
(hadoop cluster). Wow, pig-latin is evolving towards a true language :)   

 Pig script runs half way after which it reports syntax error
 

 Key: PIG-1211
 URL: https://issues.apache.org/jira/browse/PIG-1211
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


 I have a Pig script which is structured in the following way
 {code}
 register cp.jar
 dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
 col3, col4, col5);
 filtered_dataset = filter dataset by (col1 == 1);
 proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
 rmf $output1;
 store proj_filtered_dataset into '$output1' using PigStorage();
 second_stream = foreach filtered_dataset  generate col2, col4, col5;
 group_second_stream = group second_stream by col4;
 output2 = foreach group_second_stream {
  a =  second_stream.col2
  b =   distinct second_stream.col5;
  c = order b by $0;
  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
 }
 rmf  $output2;
 --syntax error here
 store output2 to '$output2' using PigStorage();
 {code}
 I run this script using the Multi-query option, it runs successfully till the 
 first store but later fails with a syntax error. 
 The usage of HDFS option, rmf causes the first store to execute. 
 The only option the I have is to run an explain before running his script 
 grunt explain -script myscript.pig -out explain.out
 or moving the rmf statements to the top of the script
 Here are some questions:
 a) Can we have an option to do something like checkscript instead of 
 explain to get the same syntax error?  In this way I can ensure that I do not 
 run for 3-4 hours before encountering a syntax error
 b) Can pig not figure out a way to re-order the rmf statements since all the 
 store directories are variables
 Thanks
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1390) Provide a target to generate eclipse-related classpath and files

2010-04-22 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-1390:
-

Assignee: V.V.Chaitanya Krishna

 Provide a target to generate eclipse-related classpath and files
 

 Key: PIG-1390
 URL: https://issues.apache.org/jira/browse/PIG-1390
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.7.0, 0.8.0
Reporter: V.V.Chaitanya Krishna
Assignee: V.V.Chaitanya Krishna
 Fix For: 0.8.0

 Attachments: PIG-eclipse_support.patch


 Currently, after checking out from svn repository, there is no provision to 
 auto-generate eclipse-related classpath and files , which could help in 
 import into eclipse directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

2010-04-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859462#action_12859462
 ] 

Ashutosh Chauhan commented on PIG-1211:
---

bq. Can we have an option to do something like checkscript instead of explain 
to get the same syntax error? In this way I can ensure that I do not run for 
3-4 hours before encountering a syntax error

Though its possible to add something like checkscript. But, it will be a 
syntactic sugar, since it will do the same exact thing as explain does (but not 
printing the plan at the end). So,  I am thinking, shall we tell users to run 
explain to catch syntax errors, instead of adding this new command line option? 
What do others think ?

 Pig script runs half way after which it reports syntax error
 

 Key: PIG-1211
 URL: https://issues.apache.org/jira/browse/PIG-1211
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


 I have a Pig script which is structured in the following way
 {code}
 register cp.jar
 dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, 
 col3, col4, col5);
 filtered_dataset = filter dataset by (col1 == 1);
 proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
 rmf $output1;
 store proj_filtered_dataset into '$output1' using PigStorage();
 second_stream = foreach filtered_dataset  generate col2, col4, col5;
 group_second_stream = group second_stream by col4;
 output2 = foreach group_second_stream {
  a =  second_stream.col2
  b =   distinct second_stream.col5;
  c = order b by $0;
  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
 }
 rmf  $output2;
 --syntax error here
 store output2 to '$output2' using PigStorage();
 {code}
 I run this script using the Multi-query option, it runs successfully till the 
 first store but later fails with a syntax error. 
 The usage of HDFS option, rmf causes the first store to execute. 
 The only option the I have is to run an explain before running his script 
 grunt explain -script myscript.pig -out explain.out
 or moving the rmf statements to the top of the script
 Here are some questions:
 a) Can we have an option to do something like checkscript instead of 
 explain to get the same syntax error?  In this way I can ensure that I do not 
 run for 3-4 hours before encountering a syntax error
 b) Can pig not figure out a way to re-order the rmf statements since all the 
 store directories are variables
 Thanks
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-04-21 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859471#action_12859471
 ] 

Ashutosh Chauhan commented on PIG-1345:
---

This will involve recording line numbers (and possibly more metadata) from 
parser to logical layer, then to physical layer and then to backend and then 
back in case of exceptions. This has been discussed before in some detail in 
PIG-908. Linking it against that.

 Link casting errors in POCast to actual lines numbers in Pig script
 ---

 Key: PIG-1345
 URL: https://issues.apache.org/jira/browse/PIG-1345
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat

 For the purpose of easy debugging, I would be nice to find out where  my 
 warnings are coming from is in the pig script. 
 The only known process is to comment out lines in the Pig script and see if 
 these warnings go away.
 2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26
 I think this may need us to keep track of the line numbers of the Pig script 
 (via out javacc parser) and maintain it in the logical and physical plan.
 It would help users in debugging simple errors/warning related to casting.
 Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?
 Do we need to change the parser to something other than javacc to make this 
 task simpler?
 Standardize on Parser and Scanner Technology
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1345) Link casting errors in POCast to actual lines numbers in Pig script

2010-04-21 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-1345:
--

Parent: PIG-908
Issue Type: Sub-task  (was: Improvement)

 Link casting errors in POCast to actual lines numbers in Pig script
 ---

 Key: PIG-1345
 URL: https://issues.apache.org/jira/browse/PIG-1345
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat

 For the purpose of easy debugging, I would be nice to find out where  my 
 warnings are coming from is in the pig script. 
 The only known process is to comment out lines in the Pig script and see if 
 these warnings go away.
 2010-01-13 21:34:13,697 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_MAP 2 time(s) line 22 
 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_LONG 2 time(s) line 23
 2010-01-13 21:34:13,698 [main] WARN  org.apache.pig.PigServer - Encountered 
 Warning IMPLICIT_CAST_TO_BAG 1 time(s). line 26
 I think this may need us to keep track of the line numbers of the Pig script 
 (via out javacc parser) and maintain it in the logical and physical plan.
 It would help users in debugging simple errors/warning related to casting.
 Is this enhancement listed in the  http://wiki.apache.org/pig/PigJournal?
 Do we need to change the parser to something other than javacc to make this 
 task simpler?
 Standardize on Parser and Scanner Technology
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1339) International characters in column names not supported

2010-04-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859152#action_12859152
 ] 

Ashutosh Chauhan commented on PIG-1339:
---

This is not reproducible on trunk. I get the expected output. Viraj, can you 
please verify if it works for you in trunk ?

 International characters in column names not supported
 --

 Key: PIG-1339
 URL: https://issues.apache.org/jira/browse/PIG-1339
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat

 There is a particular use-case in which someone specifies a column name to be 
 in International characters.
 {code}
 inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
 describe inputdata;
 dump inputdata;
 {code}
 ==
 Pig Stack Trace
 ---
 ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
 Encountered: \u3042 (12354), after : 
 org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 
 1, column 64.  Encountered: \u3042 (12354), after : 
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:391)
 ==
 Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1341) BinStorage cannot convert DataByteArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-04-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859157#action_12859157
 ] 

Ashutosh Chauhan commented on PIG-1341:
---

I think BinStorage is an internal way of moving data around in Pig and it 
should be treated that way. I think we should discourage its usage to user. 
Otherwise, we need to add capabilities as the one requested here. Important 
impact of making such a change is that we can't  then swap out BinStorage with 
other storage mechanisms. If Avro (or protobuf or whatever) proved to be a 
better replacement for BinStorage, then we cant just swap them in place of 
BinStorage, unless we add to them all the capabilities that BinStorage has. 
Therefore, I suggest to keep capabilities of BinStorage to minimal.  

 BinStorage cannot convert DataByteArray to Chararray and results in 
 FIELD_DISCARDED_TYPE_CONVERSION_FAILED
 --

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Attachments: PIG-1341.patch


 Script reads in BinStorage data and tries to convert a column which is in 
 DataByteArray to Chararray. 
 {code}
 raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
 --filter out null columns
 A = filter raw by col1#'bcookie' is not null;
 B = foreach A generate col1#'bcookie'  as reqcolumn;
 describe B;
 --B: {regcolumn: bytearray}
 X = limit B 5;
 dump X;
 B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
 describe B;
 --B: {convertedcol: chararray}
 X = limit B 5;
 dump X;
 {code}
 The first dump produces:
 (36co9b55onr8s)
 (36co9b55onr8s)
 (36hilul5oo1q1)
 (36hilul5oo1q1)
 (36l4cj15ooa8a)
 The second dump produces:
 ()
 ()
 ()
 ()
 ()
 It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
 time(s).
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2010-04-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859159#action_12859159
 ] 

Ashutosh Chauhan commented on PIG-798:
--

Viraj,

I am confused with this description. It seems to me that you are first storing 
some data using BinStorage and then loading it using PigStorage. If that is so, 
obviously it will not work. PigStorage and BinStorage aren't interoperable in 
this way. Specifically, data stored using BinStorage, can only be loaded using 
BinStorage.

 Schema errors when using PigStorage and none when using BinStorage in 
 FOREACH??
 ---

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Attachments: binstoragecreateop, schemaerr.pig, visits.txt


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}
 ===
 So why should the semantics of BinStorage() be different from PigStorage() 
 where is ok not to specify a schema??? Should it not be consistent across 
 both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1378) har url not usable in Pig scripts

2010-04-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12858709#action_12858709
 ] 

Ashutosh Chauhan commented on PIG-1378:
---

{noformat}
grunt a = load 
'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
grunt dump a;
{noformat}

 This is incorrect. You need to do the following:
{noformat}
grunt a = load 
'har://hdfs-namenode.foo.com:8020/user/viraj/project/subproject/files/size/data';
 
grunt dump a;
{noformat}

Note that scheme is hdfs. Then a -(dash), followed by namenode url, followed by 
semi-colon, followed by port number(8020) and then location of your har 
archive. 


 har url not usable in Pig scripts
 -

 Key: PIG-1378
 URL: https://issues.apache.org/jira/browse/PIG-1378
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Viraj Bhat
 Fix For: 0.8.0


 I am trying to use har (Hadoop Archives) in my Pig script.
 I can use them through the HDFS shell
 {noformat}
 $hadoop fs -ls 'har:///user/viraj/project/subproject/files/size/data'
 Found 1 items
 -rw---   5 viraj users1537234 2010-04-14 09:49 
 user/viraj/project/subproject/files/size/data/part-1
 {noformat}
 Using similar URL's in grunt yields
 {noformat}
 grunt a = load 'har:///user/viraj/project/subproject/files/size/data'; 
 grunt dump a;
 {noformat}
 {noformat}
 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2998: Unhandled internal error. 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Incompatible 
 file URI scheme: har : hdfs
 2010-04-14 22:08:48,814 [main] WARN  org.apache.pig.tools.grunt.Grunt - There 
 is no log file to write to.
 2010-04-14 22:08:48,814 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
 java.lang.Error: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
 Incompatible file URI scheme: har : hdfs
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1483)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1245)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
 at 
 org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
 at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
 at org.apache.pig.Main.main(Main.java:357)
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: 
 Incompatible file URI scheme: har : hdfs
 at org.apache.pig.LoadFunc.getAbsolutePath(LoadFunc.java:249)
 at org.apache.pig.LoadFunc.relativeToAbsolutePath(LoadFunc.java:62)
 at 
 org.apache.pig.impl.logicalLayer.parser.QueryParser.LoadClause(QueryParser.java:1472)
 ... 13 more
 {noformat}
 According to Jira http://issues.apache.org/jira/browse/PIG-1234 I try the 
 following as stated in the original description
 {noformat}
 grunt a = load 
 'har://namenode-location/user/viraj/project/subproject/files/size/data'; 
 grunt dump a;
 {noformat}
 {noformat}
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: 
 Unable to create input splits for: 
 har://namenode-location/user/viraj/project/subproject/files/size/data'; 
 ... 8 more
 Caused by: java.io.IOException: No FileSystem for scheme: namenode-location
 at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375)
 at .apache.hadoop.fs.FileSystem.access(200(FileSystem.java:66)
 at .apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
 at .apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
 at .apache.hadoop.fs.HarFileSystem.initialize(HarFileSystem.java:104)
 at .apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
 at .apache.hadoop.fs.FileSystem.get(FileSystem.java:193)
 at .apache.hadoop.fs.Path.getFileSystem(Path.java:175)
 at 
 .apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:208)
 at 
 

  1   2   3   4   >