Hi Yaron,

I haven't looked at/used it in awhile but I seem to remember that each mapper's SQL request was wrapped in a transaction to prevent the number of rows changing. DBInputFormat uses Connection.TRANSACTION_SERIALIZABLE from java.sql.Connection to prevent changes in the number of rows selected from a where clause.

The locking behavior I observed may have also been related to how MySQL was setup at the time.

On 09/11/2012 09:25 AM, Yaron Gonen wrote:
Thanks for the fast response.
Nick, regarding locking a table: as far as I understood from the code, each mapper opens its own connection to the DB. I didn't see any code such that the job creates a transaction and passes it to the mapper. Did I miss something?
again, thanks!


On Tue, Sep 11, 2012 at 4:00 PM, Nick Jones <[email protected] <mailto:[email protected]>> wrote:

    Hi Yaron

    Replies inline below.


    On 09/11/2012 07:41 AM, Yaron Gonen wrote:

        Hi,
        After reviewing the class's (not very complicated) code, I
        have some questions I hope someone can answer:

          * (more general question) Are there many use-cases for using

            DBInputFormat? Do most Hadoop jobs take their input from
        files or DBs?

    Bejoy's right, most jobs utilize data across HDFS or some other
    distributed architecture to feed M/R at a sufficient rate.
    DBInputFormat could be helpful in pulling pointers to other
    sources of data (e.g. file paths for filers where actual binary
    content is stored).


          * What happens when the database is updated during mappers'
        data

            retrieval phase? is there a way to lock the database
        before the
            data retrieval phase and release it afterwords?

    The whole job creates a transaction against the RBDMS that ensures
    consistent state throughout the job.  Depending on the source and
    settings, this might entirely lock a table or lock the selected
    rows by the query.


          * Since all mappers open a connection to the same DBS, one
        cannot

            use hundreds of mapper. Is there a solution to this problem?

    Depends on the connection limits and the number of rows requested.
    I've found that the server suffered other problems first before
    connection count limitations.


        Thanks,
        Yaron





Reply via email to