Re: Some general questions about DBInputFormat

Nick Jones Tue, 11 Sep 2012 06:01:27 -0700

Hi Yaron

Replies inline below.


On 09/11/2012 07:41 AM, Yaron Gonen wrote:

Hi,
After reviewing the class's (not very complicated) code, I have somequestions I hope someone can answer:
  * (more general question) Are there many use-cases for using
    DBInputFormat? Do most Hadoop jobs take their input from files or DBs?

Bejoy's right, most jobs utilize data across HDFS or some otherdistributed architecture to feed M/R at a sufficient rate. DBInputFormatcould be helpful in pulling pointers to other sources of data (e.g. filepaths for filers where actual binary content is stored).


  * What happens when the database is updated during mappers' data
    retrieval phase? is there a way to lock the database before the
    data retrieval phase and release it afterwords?

The whole job creates a transaction against the RBDMS that ensuresconsistent state throughout the job. Depending on the source andsettings, this might entirely lock a table or lock the selected rows bythe query.


  * Since all mappers open a connection to the same DBS, one cannot
    use hundreds of mapper. Is there a solution to this problem?

Depends on the connection limits and the number of rows requested. I'vefound that the server suffered other problems first before connectioncount limitations.


Thanks,
Yaron

Re: Some general questions about DBInputFormat

Reply via email to