[jira] [Updated] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2014-06-16 Thread Steven Willis (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Willis updated MAPREDUCE-5018:
-

Target Version/s: 1.1.2, trunk  (was: trunk, 1.1.2)
  Status: Patch Available  (was: Open)

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Affects Versions: 1.1.2, trunk
Reporter: Jay Hacker
Assignee: Steven Willis
Priority: Minor
 Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
 MAPREDUCE-5018.patch, justbytes.jar, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2014-06-16 Thread Steven Willis (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Willis updated MAPREDUCE-5018:
-

Target Version/s: 1.1.2, trunk  (was: trunk, 1.1.2)
  Status: Open  (was: Patch Available)

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Affects Versions: 1.1.2, trunk
Reporter: Jay Hacker
Assignee: Steven Willis
Priority: Minor
 Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
 MAPREDUCE-5018.patch, justbytes.jar, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-4035) why is there no javadoc, sources jars published in the maven repo for hadoop-core 0.20.2*, 1.0.X?

2014-05-16 Thread Steven Willis (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1427#comment-1427
 ] 

Steven Willis commented on MAPREDUCE-4035:
--

Is there a reason no one's working on this? I don't know exactly what needs to 
be done from an infrastructure view to get this working, but it's a very 
important thing to have. Is it as easy as adding the maven-source-plugin as 
[~michaelisvy] says? Does anything need to change from a publishing perspective?

 why is there no javadoc, sources jars published in the maven repo for 
 hadoop-core 0.20.2*, 1.0.X?
 -

 Key: MAPREDUCE-4035
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4035
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: build
Affects Versions: 0.20.2, 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0, 
 1.0.1, 1.0.2
Reporter: Jim Donofrio
Priority: Minor

 Why is there no javadoc, sources jars published in the maven repo for 
 hadoop-core 0.20.2*, 1.0.X?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2014-05-15 Thread Steven Willis (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Willis updated MAPREDUCE-5018:
-

Assignee: Steven Willis
Target Version/s: 1.1.2, trunk  (was: trunk, 1.1.2)
  Status: Open  (was: Patch Available)

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Reporter: Jay Hacker
Assignee: Steven Willis
Priority: Minor
 Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
 MAPREDUCE-5018.patch, justbytes.jar, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2014-05-15 Thread Steven Willis (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Willis updated MAPREDUCE-5018:
-

 Target Version/s: 1.1.2, trunk  (was: trunk, 1.1.2)
Affects Version/s: trunk
   1.1.2
   Status: Patch Available  (was: Open)

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Affects Versions: 1.1.2, trunk
Reporter: Jay Hacker
Assignee: Steven Willis
Priority: Minor
 Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
 MAPREDUCE-5018.patch, justbytes.jar, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2014-05-15 Thread Steven Willis (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Willis updated MAPREDUCE-5018:
-

Attachment: MAPREDUCE-5018-branch-1.1.patch

A patch for the 1.1 branch

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Reporter: Jay Hacker
Priority: Minor
 Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
 justbytes.jar, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

2014-05-14 Thread Steven Willis (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Willis updated MAPREDUCE-5018:
-

Attachment: MAPREDUCE-5018.patch

New patch with tests

 Support raw binary data with Hadoop streaming
 -

 Key: MAPREDUCE-5018
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
  Components: contrib/streaming
Reporter: Jay Hacker
Priority: Minor
 Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
 MAPREDUCE-5018.patch, justbytes.jar, mapstream


 People often have a need to run older programs over many files, and turn to 
 Hadoop streaming as a reliable, performant batch system.  There are good 
 reasons for this:
 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
 it is easy to spin up a cluster in the cloud.
 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
 3. It is reasonably performant: it moves the code to the data, maintaining 
 locality, and scales with the number of nodes.
 Historically Hadoop is of course oriented toward processing key/value pairs, 
 and so needs to interpret the data passing through it.  Unfortunately, this 
 makes it difficult to use Hadoop streaming with programs that don't deal in 
 key/value pairs, or with binary data in general.  For example, something as 
 simple as running md5sum to verify the integrity of files will not give the 
 correct result, due to Hadoop's interpretation of the data.  
 There have been several attempts at binary serialization schemes for Hadoop 
 streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
 at efficiently encoding key/value pairs, and not passing data through 
 unmodified.  Even the RawBytes serialization scheme adds length fields to 
 the data, rendering it not-so-raw.
 I often have a need to run a Unix filter on files stored in HDFS; currently, 
 the only way I can do this on the raw data is to copy the data out and run 
 the filter on one machine, which is inconvenient, slow, and unreliable.  It 
 would be very convenient to run the filter as a map-only job, allowing me to 
 build on existing (well-tested!) building blocks in the Unix tradition 
 instead of reimplementing them as mapreduce programs.
 However, most existing tools don't know about file splits, and so want to 
 process whole files; and of course many expect raw binary input and output.  
 The solution is to run a map-only job with an InputFormat and OutputFormat 
 that just pass raw bytes and don't split.  It turns out to be a little more 
 complicated with streaming; I have attached a patch with the simplest 
 solution I could come up with.  I call the format JustBytes (as RawBytes 
 was already taken), and it should be usable with most recent versions of 
 Hadoop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)