Hi Yaron

Replies inline below.

On 09/11/2012 07:41 AM, Yaron Gonen wrote:
Hi,
After reviewing the class's (not very complicated) code, I have some questions I hope someone can answer:

  * (more general question) Are there many use-cases for using
    DBInputFormat? Do most Hadoop jobs take their input from files or DBs?

Bejoy's right, most jobs utilize data across HDFS or some other distributed architecture to feed M/R at a sufficient rate. DBInputFormat could be helpful in pulling pointers to other sources of data (e.g. file paths for filers where actual binary content is stored).

  * What happens when the database is updated during mappers' data
    retrieval phase? is there a way to lock the database before the
    data retrieval phase and release it afterwords?

The whole job creates a transaction against the RBDMS that ensures consistent state throughout the job. Depending on the source and settings, this might entirely lock a table or lock the selected rows by the query.

  * Since all mappers open a connection to the same DBS, one cannot
    use hundreds of mapper. Is there a solution to this problem?

Depends on the connection limits and the number of rows requested. I've found that the server suffered other problems first before connection count limitations.

Thanks,
Yaron


Reply via email to