Hi Yaron
Replies inline below.
On 09/11/2012 07:41 AM, Yaron Gonen wrote:
Hi,
After reviewing the class's (not very complicated) code, I have some
questions I hope someone can answer:
* (more general question) Are there many use-cases for using
DBInputFormat? Do most Hadoop jobs take their input from files or DBs?
Bejoy's right, most jobs utilize data across HDFS or some other
distributed architecture to feed M/R at a sufficient rate. DBInputFormat
could be helpful in pulling pointers to other sources of data (e.g. file
paths for filers where actual binary content is stored).
* What happens when the database is updated during mappers' data
retrieval phase? is there a way to lock the database before the
data retrieval phase and release it afterwords?
The whole job creates a transaction against the RBDMS that ensures
consistent state throughout the job. Depending on the source and
settings, this might entirely lock a table or lock the selected rows by
the query.
* Since all mappers open a connection to the same DBS, one cannot
use hundreds of mapper. Is there a solution to this problem?
Depends on the connection limits and the number of rows requested. I've
found that the server suffered other problems first before connection
count limitations.
Thanks,
Yaron