Re: Efficient use of buffered writes in a post-HTablePool world?

Nick Dimiduk Fri, 19 Dec 2014 11:17:28 -0800

Thanks for the reminder about the Multiplexer, Andrew. It sort-of solves
this problem, but think it's semantics of dropping writes are not desirable
in the general case. Further, my understanding was that the new connection
implementation is designed to handle this kind of use-case (hence cc'ing
Lars).


On Fri, Dec 19, 2014 at 11:02 AM, Andrew Purtell <[email protected]>
wrote:
>
> Aaron: Please post a copy of that feedback on the JIRA, pretty sure we will
> be having an improvement discussion there.
>
> On Fri, Dec 19, 2014 at 10:58 AM, Aaron Beppu <[email protected]>
> wrote:
> >
> > Nick : Thanks, I've created an issue [1].
> >
> > Pradeep : Yes, I have considered using that. However for the moment,
> we've
> > set it out of scope, since our migration from 0.94 -> 0.98 is already a
> bit
> > complicated, and we hoped to separate isolate these changes by not moving
> > to the async client until after the current migration is complete.
> >
> > Andrew : HTableMultiplexer does seem like it would solve our buffered
> write
> > problem, albeit in an awkward way -- thanks! It kind of seems like HTable
> > should then (if autoFlush == false) send writes to the multiplexer,
> rather
> > than setting it in its own, short-lived writeBuffer. If nothing else,
> it's
> > still super confusing that HTableInterface exposes setAutoFlush() and
> > setWriteBufferSize(), given that the writeBuffer won't meaningfully
> buffer
> > anything if all tables are short-lived.
> >
> > [1] https://issues.apache.org/jira/browse/HBASE-12728
> >
> > On Fri, Dec 19, 2014 at 10:31 AM, Andrew Purtell <[email protected]>
> > wrote:
> > >
> > > I believe HTableMultiplexer[1] is meant to stand in for HTablePool for
> > > buffered writing. FWIW, I've not used it.
> > >
> > > 1:
> > >
> > >
> >
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTableMultiplexer.html
> > >
> > >
> > > On Fri, Dec 19, 2014 at 9:00 AM, Nick Dimiduk <[email protected]>
> > wrote:
> > > >
> > > > Hi Aaron,
> > > >
> > > > Your analysis is spot on and I do not believe this is by design. I
> see
> > > the
> > > > write buffer is owned by the table, while I would have expected there
> > to
> > > be
> > > > a buffer per table all managed by the connection. I suggest you
> raise a
> > > > blocker ticket vs the 1.0.0 release that's just around the corner to
> > give
> > > > this the attention it needs. Let me know if you're not into JIRA, I
> can
> > > > raise one on your behalf.
> > > >
> > > > cc Lars, Enis.
> > > >
> > > > Nice work Aaron.
> > > > -n
> > > >
> > > > On Wed, Dec 17, 2014 at 6:44 PM, Aaron Beppu <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > Hi All,
> > > > >
> > > > > TLDR; in the absence of HTablePool, if HTable instances are
> > > short-lived,
> > > > > how should clients use buffered writes?
> > > > >
> > > > > I’m working on migrating a codebase from using 0.94.6 (CDH4.4) to
> > > 0.98.6
> > > > > (CDH5.2). One issue I’m confused by is how to effectively use
> > buffered
> > > > > writes now that HTablePool has been deprecated[1].
> > > > >
> > > > > In our 0.94 code, a pathway could get a table from the pool,
> > configure
> > > it
> > > > > with table.setAutoFlush(false); and write Puts to it. Those writes
> > > would
> > > > > then go to the table instance’s writeBuffer, and those writes would
> > > only
> > > > be
> > > > > flushed when the buffer was full, or when we were ready to close
> out
> > > the
> > > > > pool. We were intentionally choosing to have fewer, larger writes
> > from
> > > > the
> > > > > client to the cluster, and we knew we were giving up a degree of
> > safety
> > > > in
> > > > > exchange (i.e. if the client dies after it’s accepted a write but
> > > before
> > > > > the flush for that write occurs, the data is lost). This seems to
> be
> > a
> > > > > generally considered a reasonable choice (cf the HBase Book [2] SS
> > > > 14.8.4)
> > > > >
> > > > > However in the 0.98 world, without HTablePool, the endorsed pattern
> > [3]
> > > > > seems to be to create a new HTable via table =
> > > > > stashedHConnection.getTable(tableName, myExecutorService). However,
> > > even
> > > > if
> > > > > we do table.setAutoFlush(false), because that table instance is
> > > > > short-lived, its buffer never gets full. We’ll create a table
> > instance,
> > > > > write a put to it, try to close the table, and the close call will
> > > > trigger
> > > > > a (synchronous) flush. Thus, not having HTablePool seems like it
> > would
> > > > > cause us to have many more small writes from the client to the
> > cluster,
> > > > and
> > > > > basically wipe out the advantage of turning off autoflush.
> > > > >
> > > > > More concretely :
> > > > >
> > > > > // Given these two helpers ...
> > > > >
> > > > > private HTableInterface getAutoFlushTable(String tableName) throws
> > > > > IOException {
> > > > >   // (autoflush is true by default)
> > > > >   return storedConnection.getTable(tableName, executorService);
> > > > > }
> > > > >
> > > > > private HTableInterface getBufferedTable(String tableName) throws
> > > > > IOException {
> > > > >   HTableInterface table = getAutoFlushTable(tableName);
> > > > >   table.setAutoFlush(false);
> > > > >   return table;
> > > > > }
> > > > >
> > > > > // it's my contention that these two methods would behave almost
> > > > > identically,
> > > > > // except the first will hit a synchronous flush during the put
> call,
> > > > > and the second will
> > > > > // flush during the (hidden) close call on table.
> > > > >
> > > > > private void writeAutoFlushed(Put somePut) throws IOException {
> > > > >   try (HTableInterface table = getAutoFlushTable(tableName)) {
> > > > >     table.put(somePut); // will do synchronous flush
> > > > >   }
> > > > > }
> > > > >
> > > > > private void writeBuffered(Put somePut) throws IOException {
> > > > >   try (HTableInterface table = getBufferedTable(tableName)) {
> > > > >     table.put(somePut);
> > > > >   } // auto-close will trigger synchronous flush
> > > > > }
> > > > >
> > > > > It seems like the only way to avoid this is to have long-lived
> HTable
> > > > > instances, which get reused for multiple writes. However, since the
> > > > actual
> > > > > writes are driven from highly concurrent code, and since HTable is
> > not
> > > > > threadsafe, this would involve having a number of HTable instances,
> > > and a
> > > > > control mechanism for leasing them out to individual threads
> safely.
> > > > Except
> > > > > at this point it seems like we will have recreated HTablePool,
> which
> > > > > suggests that we’re doing something deeply wrong.
> > > > >
> > > > > What am I missing here? Since the HTableInterface.setAutoFlush
> method
> > > > still
> > > > > exists, it must be anticipated that users will still want to buffer
> > > > writes.
> > > > > What’s the recommended way to actually buffer a meaningful number
> of
> > > > > writes, from a multithreaded context, that doesn’t just amount to
> > > > creating
> > > > > a table pool?
> > > > >
> > > > > Thanks in advance,
> > > > > Aaron
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/HBASE-6580
> > > > > [2] http://hbase.apache.org/book/perf.writing.html
> > > > > [3]
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/HBASE-6580?focusedCommentId=13501302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13501302
> > > > > 
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Efficient use of buffered writes in a post-HTablePool world?

Reply via email to