[jira] Created: (PIG-833) Storage access layer
Storage access layer Key: PIG-833 URL: https://issues.apache.org/jira/browse/PIG-833 Project: Pig Issue Type: New Feature Reporter: Jay Tang A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code. This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata. Eventually it could also support predicate pushdown for further performance improvement. Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742201#action_12742201 ] Jay Tang commented on PIG-833: -- Zebra has a dependency on TFile that is available in Hadoop 20; that's why the compilation instruction is more complicated. A new wiki at http://wiki.apache.org/pig/zebra will provide more information on Zebra. Storage access layer Key: PIG-833 URL: https://issues.apache.org/jira/browse/PIG-833 Project: Pig Issue Type: New Feature Reporter: Jay Tang Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code. This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata. Eventually it could also support predicate pushdown for further performance improvement. Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1140) [zebra] Use of Hadoop 2.0 APIs
[ https://issues.apache.org/jira/browse/PIG-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang reassigned PIG-1140: - Assignee: Xuefu Zhang [zebra] Use of Hadoop 2.0 APIs Key: PIG-1140 URL: https://issues.apache.org/jira/browse/PIG-1140 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Xuefu Zhang Fix For: 0.7.0 Attachments: zebra.0209 Currently, Zebra is still using already deprecated Hadoop 1.8 APIs. Need to upgrade to its 2.0 APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1137) [zebra] get* methods of Zebra Map/Reduce APIs need improvements
[ https://issues.apache.org/jira/browse/PIG-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang reassigned PIG-1137: - Assignee: Yan Zhou [zebra] get* methods of Zebra Map/Reduce APIs need improvements --- Key: PIG-1137 URL: https://issues.apache.org/jira/browse/PIG-1137 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.7.0 Currently the set* methods takes external Zebra objects, namely objects of ZebraStorageHint, ZebraSchema, ZebraSortInfo or ZebraProjection. Correspondingly, the get* methods should return such objects instead of String or Zebra internal objects like Schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1139) [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check by a writer could be better encapsulated
[ https://issues.apache.org/jira/browse/PIG-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang updated PIG-1139: -- Fix Version/s: (was: 0.7.0) 0.8.0 [zebra] Encapsulation of check of ZebraSortInfo by a Zebra reader; the check by a writer could be better encapsulated - Key: PIG-1139 URL: https://issues.apache.org/jira/browse/PIG-1139 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Priority: Minor Fix For: 0.8.0 Currently the user's ZebraSortInfo by Map/Reduce's writer, namely, the BasicTableOutputFormat.setStorageInfo, is sanity checked by the SortInfo.parse(), although the sanity check could be all performed in that method taking a ZebraSortInfo object. But the sanity check at the reader side is totally by the caller of TableInputFormat.requireSortedTable method, which should be better encapsulated into a new SortInfo's method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1137) [zebra] get* methods of Zebra Map/Reduce APIs need improvements
[ https://issues.apache.org/jira/browse/PIG-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang updated PIG-1137: -- Fix Version/s: (was: 0.7.0) 0.8.0 [zebra] get* methods of Zebra Map/Reduce APIs need improvements --- Key: PIG-1137 URL: https://issues.apache.org/jira/browse/PIG-1137 Project: Pig Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.8.0 Currently the set* methods takes external Zebra objects, namely objects of ZebraStorageHint, ZebraSchema, ZebraSortInfo or ZebraProjection. Correspondingly, the get* methods should return such objects instead of String or Zebra internal objects like Schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1120) [zebra] should support using org.apache.hadoop.zebra.pig.TableStorer() if user does not want to specify storage hint
[ https://issues.apache.org/jira/browse/PIG-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang updated PIG-1120: -- Fix Version/s: (was: 0.7.0) 0.8.0 [zebra] should support using org.apache.hadoop.zebra.pig.TableStorer() if user does not want to specify storage hint - Key: PIG-1120 URL: https://issues.apache.org/jira/browse/PIG-1120 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Jing Huang Fix For: 0.8.0 If user doesn't want to specify storage hint, current zebra implementation only support using org.apache.hadoop.zebra.pig.TableStorer('') Note: empty string in TableStorer(' '). We should support the format of using org.apache.hadoop.zebra.pig.TableStorer() as we do on using org.apache.hadoop.zebra.pig.TableLoader() sample pig script: register /grid/0/dev/hadoopqa/jars/zebra.jar; a = load '1.txt' as (a:int, b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]); b = load '2.txt' as (a:int, b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]); c = join a by a, b by a; d = foreach c generate a::a, a::b, b::c; describe d; dump d; store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer(''); --this will fail --store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer( ); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1138) [zebra] Support of PIG's new Load/Store Interfaces
[ https://issues.apache.org/jira/browse/PIG-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang resolved PIG-1138. --- Resolution: Duplicate Fix Version/s: 0.7.0 Duplicate of 1140 [zebra] Support of PIG's new Load/Store Interfaces --- Key: PIG-1138 URL: https://issues.apache.org/jira/browse/PIG-1138 Project: Pig Issue Type: Improvement Affects Versions: 0.7.0 Reporter: Yan Zhou Fix For: 0.7.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1223) [zebra] Add cli to help admin zebra
[ https://issues.apache.org/jira/browse/PIG-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848329#action_12848329 ] Jay Tang commented on PIG-1223: --- Yongqiang, could you comment on what kind of admin features you're looking for? [zebra] Add cli to help admin zebra --- Key: PIG-1223 URL: https://issues.apache.org/jira/browse/PIG-1223 Project: Pig Issue Type: Wish Reporter: He Yongqiang -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1306) [zebra] Support of locally sorted input splits
[ https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang updated PIG-1306: -- Fix Version/s: 0.7.0 [zebra] Support of locally sorted input splits -- Key: PIG-1306 URL: https://issues.apache.org/jira/browse/PIG-1306 Project: Pig Issue Type: Improvement Reporter: Yan Zhou Assignee: Yan Zhou Fix For: 0.7.0 Current Zebra supports sorted or unsorted input splits on sorted table or sorted table unions. The sorted input splits are based upon key ranges which do not overlap. And the splits are basically globally sorted in that they are locally sorted, and their key ranges do not overlap. The biggest problem of the key-range splits are performance hits suffered if data skew is present, particularly if a key range contains a duplicate key solely which makes the data trunk of the duplicate keys virtually unsplittable regardless how many mappers are available: it just has to be processed by a single mapper. On the other hand, there are scenarios when the globally sorted splits are a over-kill and only locally sorted splits are good enough. Examples are the use of Zebra sorted tables as the probe table in a map-side merge inner join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service
[ https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850347#action_12850347 ] Jay Tang commented on PIG-1331: --- Owl has an internal metastore that has a similar relational table and partition model with Hive's metastore. Owl goes beyond this and provides a uniform data access mechanism on top of multiple storage format. This interface can be leveraged by Pig and MapReduce applications. There is room for collaboration between Owl and Hive so that we could eventually converge on a common metastore for Hadoop. Owl Hadoop Table Management Service --- Key: PIG-1331 URL: https://issues.apache.org/jira/browse/PIG-1331 Project: Pig Issue Type: New Feature Reporter: Jay Tang This JIRA is a proposal to create a Hadoop table management service: Owl. Today, MapReduce and Pig applications interacts directly with HDFS directories and files and must deal with low level data management issues such as storage format, serialization/compression schemes, data layout, and efficient data accesses, etc, often with different solutions. Owl aims to provide a standard way to addresses this issue and abstracts away the complexities of reading/writing huge amount of data from/to HDFS. Owl has a data access API that is modeled after the traditional Hadoop !InputFormt and a management API to manipulate Owl objects. This JIRA is related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata store. Owl integrates with different storage module like Zebra with a pluggable architecture. Initially, the proposal is to submit Owl as a Pig contrib project. Over time, it makes sense to move it to a Hadoop subproject. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service
[ https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850638#action_12850638 ] Jay Tang commented on PIG-1331: --- Owl's data access API, OwlInputFormat, provides a uniform API to access data stored in different storage format like Zebra, RCFile, SequenceFile, etc. Its a single data access abstraction on top of disparate data. Owl Hadoop Table Management Service --- Key: PIG-1331 URL: https://issues.apache.org/jira/browse/PIG-1331 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Jay Tang Attachments: owl.contrib.3.tgz This JIRA is a proposal to create a Hadoop table management service: Owl. Today, MapReduce and Pig applications interacts directly with HDFS directories and files and must deal with low level data management issues such as storage format, serialization/compression schemes, data layout, and efficient data accesses, etc, often with different solutions. Owl aims to provide a standard way to addresses this issue and abstracts away the complexities of reading/writing huge amount of data from/to HDFS. Owl has a data access API that is modeled after the traditional Hadoop !InputFormt and a management API to manipulate Owl objects. This JIRA is related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata store. Owl integrates with different storage module like Zebra with a pluggable architecture. Initially, the proposal is to submit Owl as a Pig contrib project. Over time, it makes sense to move it to a Hadoop subproject. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service
[ https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850688#action_12850688 ] Jay Tang commented on PIG-1331: --- Carl, from a serialization/deserialization perspective, the functionality appears similar. Owl also handles other storage layer interactions like data pruning. Owl supports partition and column pruning; we plan to support row pruning via predicate pushdown. The goal is to push data filtering work down. If a storage layer does not support a certain filter capability, Owl would provide an implementation. Owl Hadoop Table Management Service --- Key: PIG-1331 URL: https://issues.apache.org/jira/browse/PIG-1331 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Jay Tang Attachments: owl.contrib.3.tgz This JIRA is a proposal to create a Hadoop table management service: Owl. Today, MapReduce and Pig applications interacts directly with HDFS directories and files and must deal with low level data management issues such as storage format, serialization/compression schemes, data layout, and efficient data accesses, etc, often with different solutions. Owl aims to provide a standard way to addresses this issue and abstracts away the complexities of reading/writing huge amount of data from/to HDFS. Owl has a data access API that is modeled after the traditional Hadoop !InputFormt and a management API to manipulate Owl objects. This JIRA is related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata store. Owl integrates with different storage module like Zebra with a pluggable architecture. Initially, the proposal is to submit Owl as a Pig contrib project. Over time, it makes sense to move it to a Hadoop subproject. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service
[ https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851230#action_12851230 ] Jay Tang commented on PIG-1331: --- Ashish, the goal of Owl is to provide a table-like abstraction to manage Hadoop data. The design would allow any customer MapReduce applications, Pig Latin, and even Hive query language to consume data via Owl's interface. Our vision is to build a full data life cycle management stack that encompasses data creation, notification, consumption, retention, and security management, etc. Owl would make things easier for a MapReduce application writer or for someone to build another query processing language on top of it. We will update Owl wikie page with more detailed information. Owl Hadoop Table Management Service --- Key: PIG-1331 URL: https://issues.apache.org/jira/browse/PIG-1331 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Jay Tang Attachments: owl.contrib.3.tgz This JIRA is a proposal to create a Hadoop table management service: Owl. Today, MapReduce and Pig applications interacts directly with HDFS directories and files and must deal with low level data management issues such as storage format, serialization/compression schemes, data layout, and efficient data accesses, etc, often with different solutions. Owl aims to provide a standard way to addresses this issue and abstracts away the complexities of reading/writing huge amount of data from/to HDFS. Owl has a data access API that is modeled after the traditional Hadoop !InputFormt and a management API to manipulate Owl objects. This JIRA is related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata store. Owl integrates with different storage module like Zebra with a pluggable architecture. Initially, the proposal is to submit Owl as a Pig contrib project. Over time, it makes sense to move it to a Hadoop subproject. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service
[ https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851998#action_12851998 ] Jay Tang commented on PIG-1331: --- There seems to be an issue with maven repo. We'll attach jar files and update build scripts. Owl Hadoop Table Management Service --- Key: PIG-1331 URL: https://issues.apache.org/jira/browse/PIG-1331 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Jay Tang Attachments: build.log, owl.contrib.3.tgz This JIRA is a proposal to create a Hadoop table management service: Owl. Today, MapReduce and Pig applications interacts directly with HDFS directories and files and must deal with low level data management issues such as storage format, serialization/compression schemes, data layout, and efficient data accesses, etc, often with different solutions. Owl aims to provide a standard way to addresses this issue and abstracts away the complexities of reading/writing huge amount of data from/to HDFS. Owl has a data access API that is modeled after the traditional Hadoop !InputFormt and a management API to manipulate Owl objects. This JIRA is related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata store. Owl integrates with different storage module like Zebra with a pluggable architecture. Initially, the proposal is to submit Owl as a Pig contrib project. Over time, it makes sense to move it to a Hadoop subproject. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1367) [zebra] Map-side Cogroup Test case is needed on 0.7 if the feature is supported in 0.7
[ https://issues.apache.org/jira/browse/PIG-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang updated PIG-1367: -- Fix Version/s: site (was: 0.7.0) [zebra] Map-side Cogroup Test case is needed on 0.7 if the feature is supported in 0.7 -- Key: PIG-1367 URL: https://issues.apache.org/jira/browse/PIG-1367 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Yan Zhou Fix For: site PIG-1315 has the Zebra support for this feature and the map-side group-by. It also has the test case for map-side COGROUP; while the test case for map-side GROUP-BY is in PIG-1357. However PIG-1315 is committed to the trunk as a whole; but only committed to the 0.7 branch without the map-side group-by test case because PIG has yet to decide if the feature will be in the 0.7 release. This JIRA is created for tracking purpose should the decision to support map-side COGROUP in 0.7 by PIG is made. If not, this should be made invalid eventually. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1367) [zebra] Map-side Cogroup Test case is needed on 0.7 if the feature is supported in 0.7
[ https://issues.apache.org/jira/browse/PIG-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang updated PIG-1367: -- Fix Version/s: 0.8.0 (was: site) [zebra] Map-side Cogroup Test case is needed on 0.7 if the feature is supported in 0.7 -- Key: PIG-1367 URL: https://issues.apache.org/jira/browse/PIG-1367 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Yan Zhou Fix For: 0.8.0 PIG-1315 has the Zebra support for this feature and the map-side group-by. It also has the test case for map-side COGROUP; while the test case for map-side GROUP-BY is in PIG-1357. However PIG-1315 is committed to the trunk as a whole; but only committed to the 0.7 branch without the map-side group-by test case because PIG has yet to decide if the feature will be in the 0.7 release. This JIRA is created for tracking purpose should the decision to support map-side COGROUP in 0.7 by PIG is made. If not, this should be made invalid eventually. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (PIG-1350) [Zebra] Zebra column names cannot have leading _
[ https://issues.apache.org/jira/browse/PIG-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Tang updated PIG-1350: -- Fix Version/s: 0.8.0 (was: 0.7.0) [Zebra] Zebra column names cannot have leading _ -- Key: PIG-1350 URL: https://issues.apache.org/jira/browse/PIG-1350 Project: Pig Issue Type: Improvement Reporter: Xuefu Zhang Assignee: Xuefu Zhang Fix For: 0.8.0 Attachments: pig-1350.patch, pig-1350.patch Disallowing '_' as leading character in column names in Zebra schema is too restrictive, which should be lifted. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service
[ https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12863835#action_12863835 ] Jay Tang commented on PIG-1331: --- Yes, Jeff. Owl, as a table management service, has a metadata module. Please see http://wiki.apache.org/pig/owl for more information. Owl Hadoop Table Management Service --- Key: PIG-1331 URL: https://issues.apache.org/jira/browse/PIG-1331 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Jay Tang Assignee: Ajay Kidave Fix For: 0.8.0 Attachments: anttestoutput.tgz, build.log, ivy_version.patch, owl.contrib.3.tgz, owl.contrib.4.tar.gz This JIRA is a proposal to create a Hadoop table management service: Owl. Today, MapReduce and Pig applications interacts directly with HDFS directories and files and must deal with low level data management issues such as storage format, serialization/compression schemes, data layout, and efficient data accesses, etc, often with different solutions. Owl aims to provide a standard way to addresses this issue and abstracts away the complexities of reading/writing huge amount of data from/to HDFS. Owl has a data access API that is modeled after the traditional Hadoop !InputFormt and a management API to manipulate Owl objects. This JIRA is related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata store. Owl integrates with different storage module like Zebra with a pluggable architecture. Initially, the proposal is to submit Owl as a Pig contrib project. Over time, it makes sense to move it to a Hadoop subproject. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.