[jira] Updated: (PIG-1351) [Zebra] No type check when we write to the basic table
[ https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1351: --- Status: Open (was: Patch Available) [Zebra] No type check when we write to the basic table -- Key: PIG-1351 URL: https://issues.apache.org/jira/browse/PIG-1351 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1351.patch, PIG-1351.patch In Zebra, we do not have any type check when writing to a basic table. Say, we have a schema: f1:int, f2:string, however we can write a tuple (abc, 123) without any problem, which is definitely not desirable. To overcome this problem, we decide to perform certain amount of type checking in Zebra - We check the first row only for each writer. This only serves as a sanity check purpose in cases where users screw up specifying the output schema. We do NOT perform a rigorous type checking for all rows for apparently performance concerns. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1351) [Zebra] No type check when we write to the basic table
[ https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1351: --- Attachment: (was: PIG-1351.patch) [Zebra] No type check when we write to the basic table -- Key: PIG-1351 URL: https://issues.apache.org/jira/browse/PIG-1351 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1351.patch In Zebra, we do not have any type check when writing to a basic table. Say, we have a schema: f1:int, f2:string, however we can write a tuple (abc, 123) without any problem, which is definitely not desirable. To overcome this problem, we decide to perform certain amount of type checking in Zebra - We check the first row only for each writer. This only serves as a sanity check purpose in cases where users screw up specifying the output schema. We do NOT perform a rigorous type checking for all rows for apparently performance concerns. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1351) [Zebra] No type check when we write to the basic table
[ https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1351: --- Attachment: (was: PIG-1351.patch) [Zebra] No type check when we write to the basic table -- Key: PIG-1351 URL: https://issues.apache.org/jira/browse/PIG-1351 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1351.patch In Zebra, we do not have any type check when writing to a basic table. Say, we have a schema: f1:int, f2:string, however we can write a tuple (abc, 123) without any problem, which is definitely not desirable. To overcome this problem, we decide to perform certain amount of type checking in Zebra - We check the first row only for each writer. This only serves as a sanity check purpose in cases where users screw up specifying the output schema. We do NOT perform a rigorous type checking for all rows for apparently performance concerns. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1351) [Zebra] No type check when we write to the basic table
[ https://issues.apache.org/jira/browse/PIG-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1351: --- Attachment: PIG-1351.patch [Zebra] No type check when we write to the basic table -- Key: PIG-1351 URL: https://issues.apache.org/jira/browse/PIG-1351 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1351.patch In Zebra, we do not have any type check when writing to a basic table. Say, we have a schema: f1:int, f2:string, however we can write a tuple (abc, 123) without any problem, which is definitely not desirable. To overcome this problem, we decide to perform certain amount of type checking in Zebra - We check the first row only for each writer. This only serves as a sanity check purpose in cases where users screw up specifying the output schema. We do NOT perform a rigorous type checking for all rows for apparently performance concerns. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1375) [Zebra] To support writing multiple Zebra tables through Pig
[ https://issues.apache.org/jira/browse/PIG-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1375: --- Attachment: PIG-1375.patch [Zebra] To support writing multiple Zebra tables through Pig Key: PIG-1375 URL: https://issues.apache.org/jira/browse/PIG-1375 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1375.patch, PIG-1375.patch In Zebra, we already have multiple outputs support for map/reduce. But we do not support this feature if users use Zebra through Pig. This jira is to address this issue. We plan to support writing to multiple output tables through Pig as well. We propose to support the following Pig store statements with multiple outputs: store relation into 'loc1,loc2,loc3' using org.apache.hadoop.zebra.pig.TableStorer('storagehint_string', 'complete name of your custom partition class', 'some arguments to partition class'); /* if certain partition class arguments is needed */ store relation into 'loc1,loc2,loc3' using org.apache.hadoop.zebra.pig.TableStorer('storagehint_string', 'complete name of your custom partition class'); /* if no partition class arguments is needed */ Note that users need to specify up to three arguments - storage hint string, complete name of partition class and partition class arguments string. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1375) [Zebra] To support writing multiple Zebra tables through Pig
[ https://issues.apache.org/jira/browse/PIG-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1375: --- Status: Patch Available (was: Open) Affects Version/s: 0.7.0 (was: 0.8.0) [Zebra] To support writing multiple Zebra tables through Pig Key: PIG-1375 URL: https://issues.apache.org/jira/browse/PIG-1375 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1375.patch, PIG-1375.patch In Zebra, we already have multiple outputs support for map/reduce. But we do not support this feature if users use Zebra through Pig. This jira is to address this issue. We plan to support writing to multiple output tables through Pig as well. We propose to support the following Pig store statements with multiple outputs: store relation into 'loc1,loc2,loc3' using org.apache.hadoop.zebra.pig.TableStorer('storagehint_string', 'complete name of your custom partition class', 'some arguments to partition class'); /* if certain partition class arguments is needed */ store relation into 'loc1,loc2,loc3' using org.apache.hadoop.zebra.pig.TableStorer('storagehint_string', 'complete name of your custom partition class'); /* if no partition class arguments is needed */ Note that users need to specify up to three arguments - storage hint string, complete name of partition class and partition class arguments string. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1375) [Zebra] To support writing multiple Zebra tables through Pig
[ https://issues.apache.org/jira/browse/PIG-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1375: --- Attachment: PIG-1375.patch Thank Xuefu for the feedback. Updated the patch to incorporate in comment 2 and 4. For comment 1) The indentation change is only incidental to make some files (impacted by this feature) to follow Zebra's tab policy - space of width two. For comment 3) The flag idea needs to be justified by further performance profiling work. The check here should be trivial compared with other operations such as generateKey() and insert(). [Zebra] To support writing multiple Zebra tables through Pig Key: PIG-1375 URL: https://issues.apache.org/jira/browse/PIG-1375 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1375.patch, PIG-1375.patch, PIG-1375.patch In Zebra, we already have multiple outputs support for map/reduce. But we do not support this feature if users use Zebra through Pig. This jira is to address this issue. We plan to support writing to multiple output tables through Pig as well. We propose to support the following Pig store statements with multiple outputs: store relation into 'loc1,loc2,loc3' using org.apache.hadoop.zebra.pig.TableStorer('storagehint_string', 'complete name of your custom partition class', 'some arguments to partition class'); /* if certain partition class arguments is needed */ store relation into 'loc1,loc2,loc3' using org.apache.hadoop.zebra.pig.TableStorer('storagehint_string', 'complete name of your custom partition class'); /* if no partition class arguments is needed */ Note that users need to specify up to three arguments - storage hint string, complete name of partition class and partition class arguments string. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1342) [Zebra] Avoid making unnecessary name node calls for writes in Zebra
[ https://issues.apache.org/jira/browse/PIG-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1342: --- Status: Patch Available (was: Open) [Zebra] Avoid making unnecessary name node calls for writes in Zebra Key: PIG-1342 URL: https://issues.apache.org/jira/browse/PIG-1342 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0, 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1342.patch, PIG-1342.patch Currently, table and column group level meta data is extracted from job configuration object and written onto HDFS disk within checkOutputSpec(). Later on, writers at back end will open these files to access the meta data for doing writes. This puts extra load to name node since all writers need to make name node calls to open files. We propose the following approach to this problem: For writers at back end, they extract meta information from job configuration object directly, rather than making name node calls and going to HDFS disk to fetch the information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1342) [Zebra] Avoid making unnecessary name node calls for writes in Zebra
[ https://issues.apache.org/jira/browse/PIG-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1342: --- Attachment: PIG-1342.patch [Zebra] Avoid making unnecessary name node calls for writes in Zebra Key: PIG-1342 URL: https://issues.apache.org/jira/browse/PIG-1342 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0, 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1342.patch, PIG-1342.patch Currently, table and column group level meta data is extracted from job configuration object and written onto HDFS disk within checkOutputSpec(). Later on, writers at back end will open these files to access the meta data for doing writes. This puts extra load to name node since all writers need to make name node calls to open files. We propose the following approach to this problem: For writers at back end, they extract meta information from job configuration object directly, rather than making name node calls and going to HDFS disk to fetch the information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1342) [Zebra] Avoid making unnecessary name node calls for writes in Zebra
[ https://issues.apache.org/jira/browse/PIG-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859523#action_12859523 ] Chao Wang commented on PIG-1342: Rebase the patch against the latest trunk. [Zebra] Avoid making unnecessary name node calls for writes in Zebra Key: PIG-1342 URL: https://issues.apache.org/jira/browse/PIG-1342 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0, 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1342.patch, PIG-1342.patch Currently, table and column group level meta data is extracted from job configuration object and written onto HDFS disk within checkOutputSpec(). Later on, writers at back end will open these files to access the meta data for doing writes. This puts extra load to name node since all writers need to make name node calls to open files. We propose the following approach to this problem: For writers at back end, they extract meta information from job configuration object directly, rather than making name node calls and going to HDFS disk to fetch the information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1342) [Zebra] Avoid making unnecessary name node calls for writes in Zebra
[ https://issues.apache.org/jira/browse/PIG-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1342: --- Status: Open (was: Patch Available) [Zebra] Avoid making unnecessary name node calls for writes in Zebra Key: PIG-1342 URL: https://issues.apache.org/jira/browse/PIG-1342 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0, 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1342.patch, PIG-1342.patch Currently, table and column group level meta data is extracted from job configuration object and written onto HDFS disk within checkOutputSpec(). Later on, writers at back end will open these files to access the meta data for doing writes. This puts extra load to name node since all writers need to make name node calls to open files. We propose the following approach to this problem: For writers at back end, they extract meta information from job configuration object directly, rather than making name node calls and going to HDFS disk to fetch the information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1342) [Zebra] Avoid making unnecessary name node calls for writes in Zebra
[ https://issues.apache.org/jira/browse/PIG-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Wang updated PIG-1342: --- Status: Patch Available (was: Open) From the test result log, it looks like the testcase TestFinish failed. I manually ran this test case against the Pig trunk + patch, and it passed. Seems it's env issue and resubmit the patch. [Zebra] Avoid making unnecessary name node calls for writes in Zebra Key: PIG-1342 URL: https://issues.apache.org/jira/browse/PIG-1342 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0, 0.7.0 Reporter: Chao Wang Assignee: Chao Wang Fix For: 0.8.0 Attachments: PIG-1342.patch, PIG-1342.patch Currently, table and column group level meta data is extracted from job configuration object and written onto HDFS disk within checkOutputSpec(). Later on, writers at back end will open these files to access the meta data for doing writes. This puts extra load to name node since all writers need to make name node calls to open files. We propose the following approach to this problem: For writers at back end, they extract meta information from job configuration object directly, rather than making name node calls and going to HDFS disk to fetch the information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.