Re: Streaming value of (200MB) from a SequenceFile

Sandy Ryza Sun, 31 Mar 2013 22:59:39 -0700

Hi Rahul,

I don't think saving the stream for later use would work - I was just
suggesting that if only some aggregate statistics needed to be calculated,
they could be calculated at read time instead of in the mapper.  Nothing
requires a Writable to contain all the data that it reads.


That's a good point that you can pass the locations of the files.  A
drawback of this is that Hadoop attempts to co-locate mappers with where
their input data is stored, and this approach would negate the locality
advantage.

200 MB is not too small a file for Hadoop.  A typical HDFS block size is 64
MB or 128 MB, so a file that's larger than that is not unreasonable.

-Sandy

On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
[email protected]> wrote:

> Sorry for the multiple replies.
>
> There is one more thing that can be done (I guess) for streaming the
> values rather then constructing the whole object itself.We can store the
> value in hdfs as file and have the location as value of the mapper.Mapper
> can open a stream using the location specified.
>
> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
> many 200 MB size files would have any impact to the NN.
>
> Thanks,
> Rahul
>
>
>
> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
> [email protected]> wrote:
>
>> Hi Sandy,
>>
>> I am also new to Hadoop and have a question here.
>> The writable does have a DataInput stream so that the objects can be
>> constructed from the byte stream.
>> Are you suggesting to save the stream for later use ,but late we cannot
>> ascertain the state of the stream.
>> For a large value , I think we can actually take the useful part and
>> emmit it out of from a mapper , we might also have a custom input format to
>> do this thing so that large value doesn't even reach the mapper.
>>
>> Am I missing anything here?
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <[email protected]> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>>> from a MapFile.
>>> I need to stream the large value to an outputstream instead of reading
>>> the entire value before processing because it potentially uses too much
>>> memory.
>>>
>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>> val) does not return an input stream.
>>>
>>> How can I accomplish this?
>>>
>>> Thanks,
>>>
>>> Jerry
>>>
>>
>>
>

Re: Streaming value of (200MB) from a SequenceFile

Reply via email to