subject:"\[jira\] Commented\: \(PIG\-1077\) \[Zebra\] to support record\(row\)\-based file split in Zebra's TableInputFormat"

[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

2009-11-21 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780998#action_12780998
 ] 

Alan Gates commented on PIG-1077:
-

Patch checked into 0.6 branch.

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---
>
> Key: PIG-1077
> URL: https://issues.apache.org/jira/browse/PIG-1077
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Chao Wang
>Assignee: Chao Wang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira 
> HADOOP-6218). We want to utilize this to provide record(row)-based input 
> split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, 
> we can create much more fine-grained input splits than before where we can 
> only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not 
> specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its 
> TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
> physical byte offsets as the output per TFile. For example, let us assume for 
> the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
> key-value pair near a byte offset. For the example above, say we get 
> recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
> recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
> respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: 
> TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each 
> mapper, rather than being done at the job initialization phase. This is due 
> to performance concern since the conversion incurs some TFile reading 
> overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

2009-11-20 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780735#action_12780735
 ] 

Yan Zhou commented on PIG-1077:
---

This pacth is also targeted for the 0.6 release so it needs to be on the 0.6 
branch too.

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---
>
> Key: PIG-1077
> URL: https://issues.apache.org/jira/browse/PIG-1077
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Chao Wang
>Assignee: Chao Wang
> Fix For: 0.7.0
>
> Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira 
> HADOOP-6218). We want to utilize this to provide record(row)-based input 
> split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, 
> we can create much more fine-grained input splits than before where we can 
> only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not 
> specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its 
> TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
> physical byte offsets as the output per TFile. For example, let us assume for 
> the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
> key-value pair near a byte offset. For the example above, say we get 
> recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
> recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
> respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: 
> TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each 
> mapper, rather than being done at the job initialization phase. This is due 
> to performance concern since the conversion incurs some TFile reading 
> overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

2009-11-13 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1293#action_1293
 ] 

Hadoop QA commented on PIG-1077:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12424874/patch_Pig1077
  against trunk revision 835499.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 104 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/49/console

This message is automatically generated.

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---
>
> Key: PIG-1077
> URL: https://issues.apache.org/jira/browse/PIG-1077
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Chao Wang
>Assignee: Chao Wang
> Fix For: 0.6.0
>
> Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira 
> HADOOP-6218). We want to utilize this to provide record(row)-based input 
> split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, 
> we can create much more fine-grained input splits than before where we can 
> only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not 
> specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its 
> TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
> physical byte offsets as the output per TFile. For example, let us assume for 
> the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
> key-value pair near a byte offset. For the example above, say we get 
> recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
> recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
> respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: 
> TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each 
> mapper, rather than being done at the job initialization phase. This is due 
> to performance concern since the conversion incurs some TFile reading 
> overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

2009-11-13 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777680#action_12777680
 ] 

Yan Zhou commented on PIG-1077:
---

+1

> [Zebra] to support record(row)-based file split in Zebra's TableInputFormat
> ---
>
> Key: PIG-1077
> URL: https://issues.apache.org/jira/browse/PIG-1077
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.4.0
>Reporter: Chao Wang
>Assignee: Chao Wang
> Fix For: 0.6.0
>
> Attachments: patch_Pig1077
>
>
> TFile currently supports split by record sequence number (see Jira 
> HADOOP-6218). We want to utilize this to provide record(row)-based input 
> split support in Zebra.
> One prominent benefit is that: in cases where we have very large data files, 
> we can create much more fine-grained input splits than before where we can 
> only create one big split for one big file.
> In more detail, the new row-based getSplits() works by default (user does not 
> specify no. of splits to be generated) as follows: 
> 1) Select the biggest column group in terms of data size, split all of its 
> TFiles according to hdfs block size (64 MB or 128 MB) and get a list of 
> physical byte offsets as the output per TFile. For example, let us assume for 
> the 1st TFile we get offset1, offset2, ..., offset10; 
> 2) Invoke TFile.getRecordNumNear(long offset) to get the RecordNum of a 
> key-value pair near a byte offset. For the example above, say we get 
> recordNum1, recordNum2, ..., recordNum10; 
> 3) Stitch [0, recordNum1], [recordNum1+1, recordNum2], ..., [recordNum9+1, 
> recordNum10], [recordNum10+1, lastRecordNum] splits of all column groups, 
> respectively to form 11 record-based input splits for the 1st TFile. 
> 4) For each input split, we need to create a TFile scanner through: 
> TFile.createScannerByRecordNum(long beginRecNum, long endRecNum). 
> Note: conversion from byte offset to record number will be done by each 
> mapper, rather than being done at the job initialization phase. This is due 
> to performance concern since the conversion incurs some TFile reading 
> overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

[jira] Commented: (PIG-1077) [Zebra] to support record(row)-based file split in Zebra's TableInputFormat

4 matches

Site Navigation

Mail list logo

Footer information