Hi Zoltan, thank you for your patch, would you mind putting it to sqoop review board (https://reviews.apache.org)?
Jarcec On Sun, Sep 16, 2012 at 04:19:53PM +0000, Zoltán Tóth-Czifra wrote: > FYI, I found a simple way to implement this and created an issue to Sqoop > with a patch. > > Let's see if it gets accepted. > > https://issues.apache.org/jira/browse/SQOOP-604 > ________________________________________ > From: Zoltán Tóth-Czifra [[email protected]] > Sent: Friday, September 14, 2012 12:35 PM > To: Jarek Jarcec Cecho; [email protected] > Subject: RE: Throttling inserts to avoid replication lags > > Hi Jarcec, > > Thank you very much for your answer! I really appreciate that you are > thinking with me. > Regarding trhe number of mappers to export, yes, we can keep it low, but as > you said, Sqoop will try its best for the highest throughput so even one > mapper can cause replication lag. > > Your idea of the non-replicated tables could work, but I'm almost sure we'll > need to discard it, because it's impossible to maintain with a few hundred > machines, all constantly changing, adding new servers, creating new exports, > etc... > > The solutions we had in mind so far: > > MySQL Proxy > http://dev.mysql.com/downloads/mysql-proxy/ > It is an unofficial project for MySQL, and it seems to be sopped somehow. It > doesn't seem to support throttling our of the box, but in theory with using > Lua scripts one can write a system to limit the number of queries. This, > however, is not a guarantee to limit data throughput (imagine one huge insert > with thousands of lines...) and doesn't seem to be ready for production > > Message Queues > We had in mind a solution where we completely discard Sqoop and write our own > solution which somehow puts exported lines from Hive to a message queue and > there we can already process it the way we want. I see this very complex and > costly solution. > > Contributing to Sqoop > This is what I see now as the best option - creating our own branch of Sqoop > and adding the throttling feature. > > If anyone has something else in mind, it's really appreciated. > > Thanks! > ________________________________________ > From: Jarek Jarcec Cecho [[email protected]] > Sent: Thursday, September 13, 2012 12:19 PM > To: [email protected] > Subject: Re: Throttling inserts to avoid replication lags > > Hi Zoltan, > Sqoop is trying for the best throughput to move data from source to > destination, so your issue might be tricky to solve. I was thinking about it > and I do have couple of ideas: > > 1) Did you tried to limit number of concurrent connections using "-m" > parameter? > > 2) I can imagine that huge parallelism in Sqoop can make hard time for MySQL > single threaded replication. Thinking out-of-the box, what about creating > table that won't be replicated (mysql can limit replication on both database > and table level) on all your nodes and performing your load to all of them > (it doesn't matter whether sequentially or in parallel). Once every node will > get the data, you can atomically switch the table on all nodes at once. I'm > not sure whether it's feasible nor whether it will actually work. I'm just > trying to help. > > Jarcec > > On Thu, Sep 13, 2012 at 08:41:13AM +0000, Zoltán Tóth-Czifra wrote: > > Hi, > > > > Thank you for your answers! > > > > I have been reading about Sqoop2, but since it's still under development it > > doesn't really serve me. Besides, my problem is not limiting connections, > > but somehow limiting the throughput of even one connection. > > > > This problem might not be Sqoop-specific, but I wondered if anyone have > > faced this and solved it somehow. > > > > Thank you! > > ________________________________________ > > From: Kathleen Ting [[email protected]] > > Sent: Thursday, September 13, 2012 1:27 AM > > To: [email protected] > > Subject: Re: Throttling inserts to avoid replication lags > > > > Chuck, Zoltán, > > > > In Sqoop 2, it has been discussed that connections will allow the > > specification of a resource policy in that resources will be managed > > by limiting the total number of physical Connections open at one time > > and with an option to disable Connections. > > > > More info: > > https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop > > > > Regards, Kathleen > > > > On Wed, Sep 12, 2012 at 8:08 AM, Connell, Chuck > > <[email protected]> wrote: > > > In my opinion, this is not a Sqoop problem. It is related to the RDBMS and > > > the way it handles high-volume updates. Those updates might be coming from > > > Sqoop, or they might be coming from a realtime stock market price feed. > > > > > > > > > > > > I would go ahead and test the system as is. Let Sqoop do all its updates. > > > If > > > you actually have a problem with inconsistencies or poor performance, > > > then I > > > would deal with it as a purely MySQL issue. > > > > > > > > > > > > (A low-tech approach… run the sqoop jobs at night??) > > > > > > > > > > > > Chuck > > > > > > > > > > > > > > > > > > From: Zoltán Tóth-Czifra [mailto:[email protected]] > > > Sent: Wednesday, September 12, 2012 10:48 AM > > > To: [email protected] > > > Subject: Throttling inserts to avoid replication lags > > > > > > > > > > > > Hi guys, > > > > > > > > > > > > We are using Sqoop (cdh3u3) to export Hive tables to relational databases. > > > Usually these databases are only used by business intelligence to further > > > analyze and filter the data. However, in certain cases we need to export > > > to > > > relational databases that are heavily accessed by our products and users. > > > > > > > > > > > > Our concern is that Sqoop exports would interfere with this random access > > > of > > > our users. Tempotal inconsistency of the data can be solved with a staging > > > table and an atomic swap, however, we are concerned about the replication > > > lag between the master and the slaves. > > > > > > > > > > > > If we write large data quickly with Sqoop to the master (even to a staging > > > table), that takes time to be replicated to the slaves (minutes) and > > > causes > > > an inconsistency we can't allow, that is, other writes from our users will > > > be queued up. I wonder if any of you had similar problems. We are talking > > > about a MySQL cluster by the way. > > > > > > > > > > > > For what I know, Sqoop doesn't have any built-in throttle funcionality > > > (for > > > example a delay between inserts). We have been thinking to solve this > > > with a > > > proxy, but the existing solutions on the market are very incomplete. > > > > > > > > > > > > Any other idea? The more transparent the best. > > > > > > > > > > > > Thanks!
signature.asc
Description: Digital signature
