Hi Yaron,
I haven't looked at/used it in awhile but I seem to remember that each
mapper's SQL request was wrapped in a transaction to prevent the number
of rows changing. DBInputFormat uses
Connection.TRANSACTION_SERIALIZABLE from java.sql.Connection to prevent
changes in the number of rows selected from a where clause.
The locking behavior I observed may have also been related to how MySQL
was setup at the time.
On 09/11/2012 09:25 AM, Yaron Gonen wrote:
Thanks for the fast response.
Nick, regarding locking a table: as far as I understood from the code,
each mapper opens its own connection to the DB. I didn't see any code
such that the job creates a transaction and passes it to the mapper.
Did I miss something?
again, thanks!
On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <[email protected]
<mailto:[email protected]>> wrote:
Hi Yaron
Replies inline below.
On 09/11/2012 07:41 AM, Yaron Gonen wrote:
Hi,
After reviewing the class's (not very complicated) code, I
have some questions I hope someone can answer:
* (more general question) Are there many use-cases for using
DBInputFormat? Do most Hadoop jobs take their input from
files or DBs?
Bejoy's right, most jobs utilize data across HDFS or some other
distributed architecture to feed M/R at a sufficient rate.
DBInputFormat could be helpful in pulling pointers to other
sources of data (e.g. file paths for filers where actual binary
content is stored).
* What happens when the database is updated during mappers'
data
retrieval phase? is there a way to lock the database
before the
data retrieval phase and release it afterwords?
The whole job creates a transaction against the RBDMS that ensures
consistent state throughout the job. Depending on the source and
settings, this might entirely lock a table or lock the selected
rows by the query.
* Since all mappers open a connection to the same DBS, one
cannot
use hundreds of mapper. Is there a solution to this problem?
Depends on the connection limits and the number of rows requested.
I've found that the server suffered other problems first before
connection count limitations.
Thanks,
Yaron