Thanks for the fast response. Nick, regarding locking a table: as far as I understood from the code, each mapper opens its own connection to the DB. I didn't see any code such that the job creates a transaction and passes it to the mapper. Did I miss something? again, thanks!
On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <[email protected]> wrote: > Hi Yaron > > Replies inline below. > > > On 09/11/2012 07:41 AM, Yaron Gonen wrote: > >> Hi, >> After reviewing the class's (not very complicated) code, I have some >> questions I hope someone can answer: >> >> * (more general question) Are there many use-cases for using >> >> DBInputFormat? Do most Hadoop jobs take their input from files or DBs? >> >> Bejoy's right, most jobs utilize data across HDFS or some other > distributed architecture to feed M/R at a sufficient rate. DBInputFormat > could be helpful in pulling pointers to other sources of data (e.g. file > paths for filers where actual binary content is stored). > >> >> * What happens when the database is updated during mappers' data >> >> retrieval phase? is there a way to lock the database before the >> data retrieval phase and release it afterwords? >> >> The whole job creates a transaction against the RBDMS that ensures > consistent state throughout the job. Depending on the source and settings, > this might entirely lock a table or lock the selected rows by the query. > >> >> * Since all mappers open a connection to the same DBS, one cannot >> >> use hundreds of mapper. Is there a solution to this problem? >> >> Depends on the connection limits and the number of rows requested. I've > found that the server suffered other problems first before connection count > limitations. > >> >> Thanks, >> Yaron >> > > >
