[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780998#action_12780998 ] Alan Gates commented on PIG-1077: - Patch checked into 0.6 branch. > [Zebra] to support record(row)-based file split in Zebra's TableInputFormat > --- > > Key: PIG-1077 > URL: https://issues.apache.org/jira/browse/PIG-1077 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Chao Wang >Assignee: Chao Wang > Fix For: 0.6.0, 0.7.0 > > Attachments: patch_Pig1077 > > > TFile currently supports split by record sequence number (see Jira > HADOOP-6218). We want to utilize this to provide record(row)-based input > split support in Zebra. > One prominent benefit is that: in cases where we have very large data files, > we can create much more fine-grained input splits than before where we can > only create one big split for one big file. > In more detail, the new row-based getSplits() works by default (user does not > specify no. of splits to be generated) as follows: > 1) Select the biggest column group in terms of data size, split all of its > TFiles according to hdfs block size (64 MB or 128 MB) and get a list of > physical byte offsets as the output per TFile. For example, let us assume for > the 1st TFile we get offset1, offset2, ..., offset10; > 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a > key-value pair near a byte offset. For the example above, say we get > recordNum1, recordNum2, ..., recordNum10; > 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, > recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, > respectively to form 11 record-based input splits for the 1st TFile. > 4) For each input split, we need to create a TFile scanner through: > TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). > Note: conversion from byte offset to record number will be done by each > mapper, rather than being done at the job initialization phase. This is due > to performance concern since the conversion incurs some TFile reading > overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780735#action_12780735 ] Yan Zhou commented on PIG-1077: --- This pacth is also targeted for the 0.6 release so it needs to be on the 0.6 branch too. > [Zebra] to support record(row)-based file split in Zebra's TableInputFormat > --- > > Key: PIG-1077 > URL: https://issues.apache.org/jira/browse/PIG-1077 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Chao Wang >Assignee: Chao Wang > Fix For: 0.7.0 > > Attachments: patch_Pig1077 > > > TFile currently supports split by record sequence number (see Jira > HADOOP-6218). We want to utilize this to provide record(row)-based input > split support in Zebra. > One prominent benefit is that: in cases where we have very large data files, > we can create much more fine-grained input splits than before where we can > only create one big split for one big file. > In more detail, the new row-based getSplits() works by default (user does not > specify no. of splits to be generated) as follows: > 1) Select the biggest column group in terms of data size, split all of its > TFiles according to hdfs block size (64 MB or 128 MB) and get a list of > physical byte offsets as the output per TFile. For example, let us assume for > the 1st TFile we get offset1, offset2, ..., offset10; > 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a > key-value pair near a byte offset. For the example above, say we get > recordNum1, recordNum2, ..., recordNum10; > 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, > recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, > respectively to form 11 record-based input splits for the 1st TFile. > 4) For each input split, we need to create a TFile scanner through: > TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). > Note: conversion from byte offset to record number will be done by each > mapper, rather than being done at the job initialization phase. This is due > to performance concern since the conversion incurs some TFile reading > overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1293#action_1293 ] Hadoop QA commented on PIG-1077: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424874/patch_Pig1077 against trunk revision 835499. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 104 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/console This message is automatically generated. > [Zebra] to support record(row)-based file split in Zebra's TableInputFormat > --- > > Key: PIG-1077 > URL: https://issues.apache.org/jira/browse/PIG-1077 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Chao Wang >Assignee: Chao Wang > Fix For: 0.6.0 > > Attachments: patch_Pig1077 > > > TFile currently supports split by record sequence number (see Jira > HADOOP-6218). We want to utilize this to provide record(row)-based input > split support in Zebra. > One prominent benefit is that: in cases where we have very large data files, > we can create much more fine-grained input splits than before where we can > only create one big split for one big file. > In more detail, the new row-based getSplits() works by default (user does not > specify no. of splits to be generated) as follows: > 1) Select the biggest column group in terms of data size, split all of its > TFiles according to hdfs block size (64 MB or 128 MB) and get a list of > physical byte offsets as the output per TFile. For example, let us assume for > the 1st TFile we get offset1, offset2, ..., offset10; > 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a > key-value pair near a byte offset. For the example above, say we get > recordNum1, recordNum2, ..., recordNum10; > 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, > recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, > respectively to form 11 record-based input splits for the 1st TFile. > 4) For each input split, we need to create a TFile scanner through: > TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). > Note: conversion from byte offset to record number will be done by each > mapper, rather than being done at the job initialization phase. This is due > to performance concern since the conversion incurs some TFile reading > overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
[ https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777680#action_12777680 ] Yan Zhou commented on PIG-1077: --- +1 > [Zebra] to support record(row)-based file split in Zebra's TableInputFormat > --- > > Key: PIG-1077 > URL: https://issues.apache.org/jira/browse/PIG-1077 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Chao Wang >Assignee: Chao Wang > Fix For: 0.6.0 > > Attachments: patch_Pig1077 > > > TFile currently supports split by record sequence number (see Jira > HADOOP-6218). We want to utilize this to provide record(row)-based input > split support in Zebra. > One prominent benefit is that: in cases where we have very large data files, > we can create much more fine-grained input splits than before where we can > only create one big split for one big file. > In more detail, the new row-based getSplits() works by default (user does not > specify no. of splits to be generated) as follows: > 1) Select the biggest column group in terms of data size, split all of its > TFiles according to hdfs block size (64 MB or 128 MB) and get a list of > physical byte offsets as the output per TFile. For example, let us assume for > the 1st TFile we get offset1, offset2, ..., offset10; > 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a > key-value pair near a byte offset. For the example above, say we get > recordNum1, recordNum2, ..., recordNum10; > 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, > recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, > respectively to form 11 record-based input splits for the 1st TFile. > 4) For each input split, we need to create a TFile scanner through: > TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). > Note: conversion from byte offset to record number will be done by each > mapper, rather than being done at the job initialization phase. This is due > to performance concern since the conversion incurs some TFile reading > overhead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.