[jira] Updated: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple
[ https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Jurney updated PIG-1420: Status: In Progress (was: Patch Available) I don't know what resume progress does, but I'm about to find out. Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple - Key: PIG-1420 URL: https://issues.apache.org/jira/browse/PIG-1420 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Russell Jurney Fix For: 0.7.0 Attachments: concat.patch Original Estimate: 24h Remaining Estimate: 24h org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and org.apache.pig.builtin.StringConcat (which acts on Strings internally), both act on the first two fields of a tuple. This results in ugly nested CONCAT calls like: CONCAT(CONCAT(A, ' '), B) The more desirable form is: CONCAT(A, ' ', B) This change will be backwards compatible, provided that no one was relying on the fact that CONCAT ignores fields after the first two in a tuple. This seems a reasonable assumption to make, or at least a small break in compatibility for a sizable improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-566) Dump and store outputs do not match for PigStorage
[ https://issues.apache.org/jira/browse/PIG-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868155#action_12868155 ] Mridul Muralidharan commented on PIG-566: - Just to point out an error in the comment above : For the case of * Integer.MIN_INTEGER = d Integer.MAX_INTEGER+1: pig should return floor(d), not ceil(d) Dump and store outputs do not match for PigStorage -- Key: PIG-566 URL: https://issues.apache.org/jira/browse/PIG-566 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Santhosh Srinivasan Assignee: Gianmarco De Francisci Morales Priority: Minor Fix For: 0.8.0 Attachments: PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch The dump and store formats for PigStorage do not match for longs and floats. {code} grunt y = foreach x generate {(2985671202194220139L)}; grunt describe y; y: {{(long)}} grunt dump y; ({(2985671202194220139L)}) grunt store y into 'y'; grunt cat y {(2985671202194220139)} {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang reassigned PIG-1249: --- Assignee: Jeff Zhang Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-1249: Attachment: PIG-1249.patch The current idea is borrowed from hive, use the input file size to estimate the reducer number. Two parameters can been set for this purpose pig.exec.reducers.bytes.per.reducer // the number of bytes of input for each reducer pig.exec.reducers.max // the max number of reducer number This only work for hdfs, won't work for other data source such as hbase or cassandra. Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-1249: Status: Patch Available (was: Open) Affects Version/s: 0.8.0 Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple
[ https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868273#action_12868273 ] Dmitriy V. Ryaboy commented on PIG-1420: I intend to look at the patch today. Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple - Key: PIG-1420 URL: https://issues.apache.org/jira/browse/PIG-1420 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Russell Jurney Fix For: 0.7.0 Attachments: concat.patch Original Estimate: 24h Remaining Estimate: 24h org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and org.apache.pig.builtin.StringConcat (which acts on Strings internally), both act on the first two fields of a tuple. This results in ugly nested CONCAT calls like: CONCAT(CONCAT(A, ' '), B) The more desirable form is: CONCAT(A, ' ', B) This change will be backwards compatible, provided that no one was relying on the fact that CONCAT ignores fields after the first two in a tuple. This seems a reasonable assumption to make, or at least a small break in compatibility for a sizable improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[ https://issues.apache.org/jira/browse/PIG-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1421: - Attachment: jira1421.patch [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: jira1421.patch Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[ https://issues.apache.org/jira/browse/PIG-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1421: - Status: Patch Available (was: Open) Hadoop Flags: [Reviewed] Fix the issue by making sure that when setLocation() is called, no name node access is conducted. [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: jira1421.patch Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1415) LoadFunc signature is not correct in LoadFunc.getSchema sometimes
[ https://issues.apache.org/jira/browse/PIG-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1415: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Fix Version/s: 0.7.0 (was: 0.8.0) Resolution: Fixed Patch committed to both 0.7 branch and trunk. LoadFunc signature is not correct in LoadFunc.getSchema sometimes - Key: PIG-1415 URL: https://issues.apache.org/jira/browse/PIG-1415 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.7.0 Attachments: PIG-1415-1.patch The following script does not set signature correctly when we call LoadFunc.getSchema. a = load 'xxx' using TableLoader('xxx') as (a, b, c); However, if we don't give schema to a, we get the right signature: a = load 'xxx' using TableLoader('xxx); Diagnosis: Parser will generate LoadClause before go to the generation Alias = LoadClause, which actually set signature to the LOLoad. When we give a schema, parser try to call LOLoad.setSchema(), internally it will call LoadFunc.determineSchema. And at that time, signature has not been set yet. This relates to the change we cache determinedSchema in LOLoad [PIG-1317|https://issues.apache.org/jira/browse/PIG-1317]. Before that change, we will later call LoadFunc.getSchema() again using the right signature. Now we cache determinedSchema, so LoadFunc don't have a chance to get the right signature inside LoadFunc.getSchema() Solution: We shall not call LoadFunc.determineSchema inside LOLoad.setSchema(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-566) Dump and store outputs do not match for PigStorage
[ https://issues.apache.org/jira/browse/PIG-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-566: --- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Yes, my mistake. Thanks Mridul. Fortunately Gianmarco doesn't listen to me :) I manually test the patch, all tests pass. Committed to trunk. Thanks Gianmarco! Dump and store outputs do not match for PigStorage -- Key: PIG-566 URL: https://issues.apache.org/jira/browse/PIG-566 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.8.0 Reporter: Santhosh Srinivasan Assignee: Gianmarco De Francisci Morales Priority: Minor Fix For: 0.8.0 Attachments: PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch, PIG-566.patch The dump and store formats for PigStorage do not match for longs and floats. {code} grunt y = foreach x generate {(2985671202194220139L)}; grunt describe y; y: {{(long)}} grunt dump y; ({(2985671202194220139L)}) grunt store y into 'y'; grunt cat y {(2985671202194220139)} {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Travel Assistance] - Applications Open for ApacheCon NA 2010
The Travel Assistance Committee is now taking in applications for those wanting to attend ApacheCon North America (NA) 2010, which is taking place between the 1st and 5th November in Atlanta. The Travel Assistance Committee is looking for people who would like to be able to attend ApacheCon, but who need some financial support in order to be able to get there. There are limited places available, and all applications will be scored on their individual merit. Financial assistance is available to cover travel to the event, either in part or in full, depending on circumstances. However, the support available for those attending only the barcamp is smaller than that for people attending the whole event. The Travel Assistance Committee aims to support all ApacheCons, and cross-project events, and so it may be prudent for those in Asia and the EU to wait for an event closer to them. More information can be found on the main Apache website at http://www.apache.org/travel/index.html - where you will also find a link to the online application and details for submitting. Applications for applying for travel assistance are now being accepted, and will close on the 7th July 2010. Good luck to all those that will apply. You are welcome to tweet, blog as appropriate. Regards, The Travel Assistance Committee.
[jira] Updated: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[ https://issues.apache.org/jira/browse/PIG-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1421: - Attachment: (was: jira1421.patch) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[ https://issues.apache.org/jira/browse/PIG-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1421: - Attachment: PIG-1421.patch Fix includes: 1. Make setLocation() light weight and make sure no name node access. Note that setLocation() was a new API on LoadFunc introduced in 0.7. UDFContext is used for some cases. 2. Remove code for setting properties (INPUT_FE and INPUT_DELETED_CGS) in TableInputFormat because it's ineffective. 3. Move the logic in #2 to TableInputFormat.setInputPaths() and make sure that it's only done once (Because setInputPaths() are called multiple times in PIG code path). 4. Remove unnecessary list status calls in Zebra IO layer. 5. Remove the code that makes name node calls for sorted table in Pig code path. 6. Make sure that clob check is only done on the front end. [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: PIG-1421.patch Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[ https://issues.apache.org/jira/browse/PIG-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1421: - Status: Patch Available (was: Open) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: PIG-1421.patch Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[ https://issues.apache.org/jira/browse/PIG-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868368#action_12868368 ] Yan Zhou commented on PIG-1421: --- Local Hudson results are as follows: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. No test case is added as the problem is related to excessive name node calls on a real cluster. We manually check the fix so that name node works without any hiccups. [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: PIG-1421.patch Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[ https://issues.apache.org/jira/browse/PIG-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868369#action_12868369 ] Yan Zhou commented on PIG-1421: --- +1 [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: PIG-1421.patch Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[ https://issues.apache.org/jira/browse/PIG-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868375#action_12868375 ] Xuefu Zhang commented on PIG-1421: -- Original problem happens only in stressed scenario. It's difficult to provide a unit test case to cover this. With this, hudson result can be ignored. [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: PIG-1421.patch Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1421) [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call.
[ https://issues.apache.org/jira/browse/PIG-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-1421: -- Status: Resolved (was: Patch Available) Resolution: Fixed committed to the trunk and the 0.7 branch [Zebra] Pig script with Zebra data storage brings down name node due to excessive name node call. - Key: PIG-1421 URL: https://issues.apache.org/jira/browse/PIG-1421 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: PIG-1421.patch Because Pig call setLocation() on LoadFunc API on both frontent and backend, and Zebra makes name node access in its implementation, name node becomes irresponsive because of the number of name node calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.