Andreas, I don't remember off the top of my head but I think both an empty locations list and cardinality of set as "length" are correct. Double-check on the hadoop map-reduce user list..
D On Mon, Jan 31, 2011 at 11:11 AM, Andreas Paepcke <[email protected]> wrote: > I pull records from a remote Web site. I have a subclass of > RecordReader, which knows how to retrieve those records one by one > from a Web stream. The Web site is set up such that I can run multiple > such readers, each pulling a distinct subset of the records from the > site. > > My strategy plan: In my subclass of InputFormat I figure out a good > load balance, given a number of mapper machines. My splits will then > identify record subsets, which each mapper is to pull from the Web > site and process at runtime. > > Two questions: > 1. InputSplit wants a "list of nodes by name where the data for the > split would be local." But the records will be pulled at runtime > by each mapper. So no data is local to a node. Is it safe to > return an empty String[] from my getLocations() implementation? > 2. The getLength() method also seems geared towards files. In my > case I would presumably just return the number of records each > split will retrieve from the Web server? I.e. the cardinality of > my subsets? > > Thanks! > > Andreas >
