You might also find this link <https://github.com/cloudera/seismichadoop>useful.
Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <[email protected]> wrote: > Since SEGY files are flat binary files, you might have a tough > time in dealing with them as their is no native InputFormat for > that. You can strip off the EBCDIC+Binary header(Initial 3600 > Bytes) and store the SEGY file as Sequence Files, where each > trace (Trace Header+Trace Data) would be the value and the > trace no. could be the key. > > Otherwise you have to write a custom InputFormat to deal with > that. It would enhance the performance as well, since Sequence > Files are already in key-value form. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <[email protected]>wrote: > >> Look at the block size concept in Hadoop and see if that is what you are >> looking for >> >> Sent from my iPhone >> >> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist < >> [email protected]> wrote: >> >> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto >> HDFS of a 3-node Apache Hadoop cluster. >> >> To summarize, the SegY file consists of : >> >> 1. 3200 bytes *textual header* >> 2. 400 bytes *binary header* >> 3. Variable bytes *data* >> >> The 99.99% size of the file is due to the variable bytes data which is >> collection of thousands of contiguous traces. For any SegY file to make >> sense, it must have the textual header+binary header+at least one trace of >> data. What I want to achieve is to split a large SegY file onto the Hadoop >> cluster so that a smaller SegY file is available on each node for local >> processing. >> >> The scenario is as follows: >> >> 1. The SegY file is large in size(above 10GB) and is resting on the >> local file system of the NameNode machine >> 2. The file is to be split on the nodes in such a way each node has a >> small SegY file with a strict structure - 3200 bytes *textual header*+ >> 400 bytes >> *binary header* + variable bytes *data*As obvious, I can't blindly >> use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure >> the format in which the chunks of the larger file are required >> >> Please guide me as to how I must proceed. >> >> Thanks and regards ! >> >> >
