Re: Loading file to HDFS with custom chunk structure

Mohammad Tariq Wed, 16 Jan 2013 07:57:57 -0800

You might also find this link <https://github.com/cloudera/seismichadoop>useful.


Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <[email protected]> wrote:

> Since SEGY files are flat binary files, you might have a tough
> time in dealing with them as their is no native InputFormat for
> that. You can strip off the EBCDIC+Binary header(Initial 3600
> Bytes) and store the SEGY file as Sequence Files, where each
> trace (Trace Header+Trace Data) would be the value and the
> trace no. could be the key.
>
> Otherwise you have to write a custom InputFormat to deal with
> that. It would enhance the performance as well, since Sequence
> Files are already in key-value form.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <[email protected]>wrote:
>
>> Look at  the block size concept in Hadoop and see if that is what you are
>> looking for
>>
>> Sent from my iPhone
>>
>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>> [email protected]> wrote:
>>
>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>> HDFS of a 3-node Apache Hadoop cluster.
>>
>> To summarize, the SegY file consists of :
>>
>>    1. 3200 bytes *textual header*
>>    2. 400 bytes *binary header*
>>    3. Variable bytes *data*
>>
>> The 99.99% size of the file is due to the variable bytes data which is
>> collection of thousands of contiguous traces. For any SegY file to make
>> sense, it must have the textual header+binary header+at least one trace of
>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>> cluster so that a smaller SegY file is available on each node for local
>> processing.
>>
>> The scenario is as follows:
>>
>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>    local file system of the NameNode machine
>>    2. The file is to be split on the nodes in such a way each node has a
>>    small SegY file with a strict structure - 3200 bytes *textual header*+ 
>> 400 bytes
>>    *binary header* + variable bytes *data*As obvious, I can't blindly
>>    use FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure
>>    the format in which the chunks of the larger file are required
>>
>> Please guide me as to how I must proceed.
>>
>> Thanks and regards !
>>
>>
>

Re: Loading file to HDFS with custom chunk structure

Reply via email to