Look at the block size concept in Hadoop and see if that is what you are looking for
Sent from my iPhone On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <[email protected]> wrote: > I want to load a SegY file onto HDFS of a 3-node Apache Hadoop cluster. > > To summarize, the SegY file consists of : > > 3200 bytes textual header > 400 bytes binary header > Variable bytes data > The 99.99% size of the file is due to the variable bytes data which is > collection of thousands of contiguous traces. For any SegY file to make > sense, it must have the textual header+binary header+at least one trace of > data. What I want to achieve is to split a large SegY file onto the Hadoop > cluster so that a smaller SegY file is available on each node for local > processing. > > The scenario is as follows: > > The SegY file is large in size(above 10GB) and is resting on the local file > system of the NameNode machine > The file is to be split on the nodes in such a way each node has a small SegY > file with a strict structure - 3200 bytes textual header + 400 bytes binary > header + variable bytes dataAs obvious, I can't blindly use > FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the > format in which the chunks of the larger file are required > Please guide me as to how I must proceed. > > Thanks and regards !
