I pull records from a remote Web site. I have a subclass of
RecordReader, which knows how to retrieve those records one by one
from a Web stream. The Web site is set up such that I can run multiple
such readers, each pulling a distinct subset of the records from the
site.
My strategy plan: In my subclass of InputFormat I figure out a good
load balance, given a number of mapper machines. My splits will then
identify record subsets, which each mapper is to pull from the Web
site and process at runtime.
Two questions:
1. InputSplit wants a "list of nodes by name where the data for the
split would be local." But the records will be pulled at runtime
by each mapper. So no data is local to a node. Is it safe to
return an empty String[] from my getLocations() implementation?
2. The getLength() method also seems geared towards files. In my
case I would presumably just return the number of records each
split will retrieve from the Web server? I.e. the cardinality of
my subsets?
Thanks!
Andreas