Hi Rahul, I don't think saving the stream for later use would work - I was just suggesting that if only some aggregate statistics needed to be calculated, they could be calculated at read time instead of in the mapper. Nothing requires a Writable to contain all the data that it reads.
That's a good point that you can pass the locations of the files. A drawback of this is that Hadoop attempts to co-locate mappers with where their input data is stored, and this approach would negate the locality advantage. 200 MB is not too small a file for Hadoop. A typical HDFS block size is 64 MB or 128 MB, so a file that's larger than that is not unreasonable. -Sandy On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee < [email protected]> wrote: > Sorry for the multiple replies. > > There is one more thing that can be done (I guess) for streaming the > values rather then constructing the whole object itself.We can store the > value in hdfs as file and have the location as value of the mapper.Mapper > can open a stream using the location specified. > > Not sure if 200 MB file would qualify as small file wrt hadoop or if too > many 200 MB size files would have any impact to the NN. > > Thanks, > Rahul > > > > On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee < > [email protected]> wrote: > >> Hi Sandy, >> >> I am also new to Hadoop and have a question here. >> The writable does have a DataInput stream so that the objects can be >> constructed from the byte stream. >> Are you suggesting to save the stream for later use ,but late we cannot >> ascertain the state of the stream. >> For a large value , I think we can actually take the useful part and >> emmit it out of from a mapper , we might also have a custom input format to >> do this thing so that large value doesn't even reach the mapper. >> >> Am I missing anything here? >> >> Thanks, >> Rahul >> >> >> >> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <[email protected]> wrote: >> >>> Hi everyone, >>> >>> I'm having a problem to stream individual key-value pair of 200MB to 1GB >>> from a MapFile. >>> I need to stream the large value to an outputstream instead of reading >>> the entire value before processing because it potentially uses too much >>> memory. >>> >>> I read the API for MapFile, the next(WritableComparable key, Writable >>> val) does not return an input stream. >>> >>> How can I accomplish this? >>> >>> Thanks, >>> >>> Jerry >>> >> >> >
