Hi Zoltan,
thank you for your patch, would you mind putting it to sqoop review board 
(https://reviews.apache.org)?

Jarcec

On Sun, Sep 16, 2012 at 04:19:53PM +0000, Zoltán Tóth-Czifra wrote:
> FYI, I found a simple way to implement this and created an issue to Sqoop 
> with a patch.
> 
> Let's see if it gets accepted.
> 
> https://issues.apache.org/jira/browse/SQOOP-604
> ________________________________________
> From: Zoltán Tóth-Czifra [[email protected]]
> Sent: Friday, September 14, 2012 12:35 PM
> To: Jarek Jarcec Cecho; [email protected]
> Subject: RE: Throttling inserts to avoid replication lags
> 
> Hi Jarcec,
> 
> Thank you very much for your answer! I really appreciate that you are 
> thinking with me.
> Regarding trhe number of mappers to export, yes, we can keep it low, but as 
> you said, Sqoop will try its best for the highest throughput so even one 
> mapper can cause replication lag.
> 
> Your idea of the non-replicated tables could work, but I'm almost sure we'll 
> need to discard it, because it's impossible to maintain with a few hundred 
> machines, all constantly changing, adding new servers, creating new exports, 
> etc...
> 
> The solutions we had in mind so far:
> 
> MySQL Proxy
> http://dev.mysql.com/downloads/mysql-proxy/
> It is an unofficial project for MySQL, and it seems to be sopped somehow. It 
> doesn't seem to support throttling our of the box, but in theory with using 
> Lua scripts one can write a system to limit the number of queries. This, 
> however, is not a guarantee to limit data throughput (imagine one huge insert 
> with thousands of lines...) and doesn't seem to be ready for production
> 
> Message Queues
> We had in mind a solution where we completely discard Sqoop and write our own 
> solution which somehow puts exported lines from Hive to a message queue and 
> there we can already process it the way we want. I see this very complex and 
> costly solution.
> 
> Contributing to Sqoop
> This is what I see now as the best option - creating our own branch of Sqoop 
> and adding the throttling feature.
> 
> If anyone has something else in mind, it's really appreciated.
> 
> Thanks!
> ________________________________________
> From: Jarek Jarcec Cecho [[email protected]]
> Sent: Thursday, September 13, 2012 12:19 PM
> To: [email protected]
> Subject: Re: Throttling inserts to avoid replication lags
> 
> Hi Zoltan,
> Sqoop is trying for the best throughput to move data from source to 
> destination, so your issue might be tricky to solve. I was thinking about it 
> and I do have couple of ideas:
> 
> 1) Did you tried to limit number of concurrent connections using "-m" 
> parameter?
> 
> 2) I can imagine that huge parallelism in Sqoop can make hard time for MySQL 
> single threaded replication. Thinking out-of-the box, what about creating 
> table that won't be replicated (mysql can limit replication on both database 
> and table level) on all your nodes and performing your load to all of them 
> (it doesn't matter whether sequentially or in parallel). Once every node will 
> get the data, you can atomically switch the table on all nodes at once. I'm 
> not sure whether it's feasible nor whether it will actually work. I'm just 
> trying to help.
> 
> Jarcec
> 
> On Thu, Sep 13, 2012 at 08:41:13AM +0000, Zoltán Tóth-Czifra wrote:
> > Hi,
> >
> > Thank you for your answers!
> >
> > I have been reading about Sqoop2, but since it's still under development it 
> > doesn't really serve me. Besides, my problem is not limiting connections, 
> > but somehow limiting the throughput of even one connection.
> >
> > This problem might not be Sqoop-specific, but I wondered if anyone have 
> > faced this and solved it somehow.
> >
> > Thank you!
> > ________________________________________
> > From: Kathleen Ting [[email protected]]
> > Sent: Thursday, September 13, 2012 1:27 AM
> > To: [email protected]
> > Subject: Re: Throttling inserts to avoid replication lags
> >
> > Chuck, Zoltán,
> >
> > In Sqoop 2, it has been discussed that connections will allow the
> > specification of a resource policy in that resources will be managed
> > by limiting the total number of physical Connections open at one time
> > and with an option to disable Connections.
> >
> > More info: 
> > https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
> >
> > Regards, Kathleen
> >
> > On Wed, Sep 12, 2012 at 8:08 AM, Connell, Chuck
> > <[email protected]> wrote:
> > > In my opinion, this is not a Sqoop problem. It is related to the RDBMS and
> > > the way it handles high-volume updates. Those updates might be coming from
> > > Sqoop, or they might be coming from a realtime stock market price feed.
> > >
> > >
> > >
> > > I would go ahead and test the system as is. Let Sqoop do all its updates. 
> > > If
> > > you actually have a problem with inconsistencies or poor performance, 
> > > then I
> > > would deal with it as a purely MySQL issue.
> > >
> > >
> > >
> > > (A low-tech approach… run the sqoop jobs at night??)
> > >
> > >
> > >
> > > Chuck
> > >
> > >
> > >
> > >
> > >
> > > From: Zoltán Tóth-Czifra [mailto:[email protected]]
> > > Sent: Wednesday, September 12, 2012 10:48 AM
> > > To: [email protected]
> > > Subject: Throttling inserts to avoid replication lags
> > >
> > >
> > >
> > > Hi guys,
> > >
> > >
> > >
> > > We are using Sqoop (cdh3u3) to export Hive tables to relational databases.
> > > Usually these databases are only used by business intelligence to further
> > > analyze and filter the data. However, in certain cases we need to export 
> > > to
> > > relational databases that are heavily accessed by our products and users.
> > >
> > >
> > >
> > > Our concern is that Sqoop exports would interfere with this random access 
> > > of
> > > our users. Tempotal inconsistency of the data can be solved with a staging
> > > table and an atomic swap, however, we are concerned about the replication
> > > lag between the master and the slaves.
> > >
> > >
> > >
> > > If we write large data quickly with Sqoop to the master (even to a staging
> > > table), that takes time to be replicated to the slaves (minutes) and 
> > > causes
> > > an inconsistency we can't allow, that is, other writes from our users will
> > > be queued up. I wonder if any of you had similar problems. We are talking
> > > about a MySQL cluster by the way.
> > >
> > >
> > >
> > > For what I know, Sqoop doesn't have any built-in throttle funcionality 
> > > (for
> > > example a delay between inserts). We have been thinking to solve this 
> > > with a
> > > proxy, but the existing solutions on the market are very incomplete.
> > >
> > >
> > >
> > > Any other idea? The more transparent the best.
> > >
> > >
> > >
> > > Thanks!

Attachment: signature.asc
Description: Digital signature

Reply via email to