[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2019-02-28 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781095#comment-16781095
 ] 

Hadoop QA commented on MAPREDUCE-5018:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
23s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  6m 
18s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m  
7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 36s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
30s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
27s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 
19s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
3m  3s{color} | {color:orange} root: The patch generated 8 new + 2 unchanged - 
0 fixed = 10 total (was 2) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  0m 
 0s{color} | {color:green} There were no new shellcheck issues. {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
36s{color} | {color:green} There were no new shelldocs issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 49s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
14s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  1m  
9s{color} | {color:red} hadoop-common-project_hadoop-common generated 2 new + 0 
unchanged - 0 fixed = 2 total (was 0) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 
11s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  5m 
32s{color} | {color:green} hadoop-mapreduce-client-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  6m 
38s{color} | {color:green} hadoop-streaming in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
53s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}131m 24s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | MAPREDUCE-5018 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12644886/MAPREDUCE-5018.patch |
| Optional Tests |  dupname  

[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2019-02-28 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780917#comment-16780917
 ] 

Ruslan Dautkhanov commented on MAPREDUCE-5018:
--

Any workaround for this .. would be great to use Hadoop Streaming facility for 
binary files.. 

> Support raw binary data with Hadoop streaming
> -
>
> Key: MAPREDUCE-5018
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>  Components: contrib/streaming
>Affects Versions: 1.1.2
>Reporter: Jay Hacker
>Assignee: Steven Willis
>Priority: Minor
>  Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
> MAPREDUCE-5018.patch, justbytes.jar, mapstream
>
>
> People often have a need to run older programs over many files, and turn to 
> Hadoop streaming as a reliable, performant batch system.  There are good 
> reasons for this:
> 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
> it is easy to spin up a cluster in the cloud.
> 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
> 3. It is reasonably performant: it moves the code to the data, maintaining 
> locality, and scales with the number of nodes.
> Historically Hadoop is of course oriented toward processing key/value pairs, 
> and so needs to interpret the data passing through it.  Unfortunately, this 
> makes it difficult to use Hadoop streaming with programs that don't deal in 
> key/value pairs, or with binary data in general.  For example, something as 
> simple as running md5sum to verify the integrity of files will not give the 
> correct result, due to Hadoop's interpretation of the data.  
> There have been several attempts at binary serialization schemes for Hadoop 
> streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
> at efficiently encoding key/value pairs, and not passing data through 
> unmodified.  Even the "RawBytes" serialization scheme adds length fields to 
> the data, rendering it not-so-raw.
> I often have a need to run a Unix filter on files stored in HDFS; currently, 
> the only way I can do this on the raw data is to copy the data out and run 
> the filter on one machine, which is inconvenient, slow, and unreliable.  It 
> would be very convenient to run the filter as a map-only job, allowing me to 
> build on existing (well-tested!) building blocks in the Unix tradition 
> instead of reimplementing them as mapreduce programs.
> However, most existing tools don't know about file splits, and so want to 
> process whole files; and of course many expect raw binary input and output.  
> The solution is to run a map-only job with an InputFormat and OutputFormat 
> that just pass raw bytes and don't split.  It turns out to be a little more 
> complicated with streaming; I have attached a patch with the simplest 
> solution I could come up with.  I call the format "JustBytes" (as "RawBytes" 
> was already taken), and it should be usable with most recent versions of 
> Hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2015-03-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354299#comment-14354299
 ] 

Hadoop QA commented on MAPREDUCE-5018:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org
  against trunk revision 47f7f18.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5260//console

This message is automatically generated.

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Affects Versions: 1.1.2
Reporter: Jay Hacker
Assignee: Steven Willis
Priority: Minor
 Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
 MAPREDUCE-5018.patch, justbytes.jar, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2014-06-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032514#comment-14032514
 ] 

Hadoop QA commented on MAPREDUCE-5018:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644886/MAPREDUCE-5018.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.
See 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4662//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-tools/hadoop-streaming.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4662//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4662//console

This message is automatically generated.

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Affects Versions: trunk, 1.1.2
Reporter: Jay Hacker
Assignee: Steven Willis
Priority: Minor
 Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
 MAPREDUCE-5018.patch, justbytes.jar, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2013-05-22 Thread Jay Hacker (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13664486#comment-13664486
 ] 

Jay Hacker commented on MAPREDUCE-5018:
---

You're welcome!  

It might be easier to just split your inputs yourself before putting them in 
HDFS (see {{split(1)}}), but perhaps your files are already in HDFS.

JustBytes shouldn't modify or interpret your data at all; it reads an entire 
file in binary, gives those exact bytes to your mapper, and writes out the 
exact bytes your mapper gives.  It does not know or care about newlines.  I 
would encourage you to run {{md5sum}} on your data outside HDFS and via 
{{mapstream}} to verify that it is not changing your data at all, and let me 
know if it is.

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Reporter: Jay Hacker
Priority: Minor
 Attachments: justbytes.jar, MAPREDUCE-5018.patch, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2013-05-15 Thread PrateekM (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13659194#comment-13659194
 ] 

PrateekM commented on MAPREDUCE-5018:
-

Yes in that case its fine..We are creating a modified version of 
JustBytesInputFormat that does the splits as we could split our binary data 
with FixedLength Record sizes.Thanks for JustBytes!

One more query, at places our data contains \n and \r characters as part of the 
binary data and we dont want the stdin to interpret these characters, since its 
corrupts the data once it reaches the mapper.
Is there anything that can be done? I dont want to hexencode it before writing 
it to the stream to the mapper..

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Reporter: Jay Hacker
Priority: Minor
 Attachments: justbytes.jar, MAPREDUCE-5018.patch, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2013-05-10 Thread Jay Hacker (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13654863#comment-13654863
 ] 

Jay Hacker commented on MAPREDUCE-5018:
---

[~pratem], you're right, there are cases where it's not efficient.  Consider 
this though: if you have 100 TB of files in HDFS that you want to md5sum (or 
what have you), would you rather do an inefficient distributed md5sum on the 
cluster, or copy 100 TB out to a single machine and wait for a single md5sum?  
Can you even fit that on one machine?

You still gain reliability: there are multiple copies of each file, and failed 
jobs get restarted.  It's also just convenient.

Here's the trick to make it efficient: use many files, and set the block size 
of individual files big enough to fit the whole file:

{{hadoop fs -D dfs.block.size=1073741824 -put ...}}

Then all reads are local, and you get all the performance Hadoop can give you.

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Reporter: Jay Hacker
Priority: Minor
 Attachments: justbytes.jar, MAPREDUCE-5018.patch, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2013-02-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583302#comment-13583302
 ] 

Hadoop QA commented on MAPREDUCE-5018:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12570317/MAPREDUCE-5018.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3352//console

This message is automatically generated.

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Reporter: Jay Hacker
Priority: Minor
 Attachments: MAPREDUCE-5018.patch


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2013-02-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583352#comment-13583352
 ] 

Hadoop QA commented on MAPREDUCE-5018:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12570328/mapstream
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3353//console

This message is automatically generated.

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Reporter: Jay Hacker
Priority: Minor
 Attachments: justbytes.jar, MAPREDUCE-5018.patch, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira