Re: Add client complexity or use a coprocessor?

Jacques Tue, 10 Apr 2012 14:50:11 -0700

On Tue, Apr 10, 2012 at 9:19 AM, Tom Brown <[email protected]> wrote:


> Jacques,
>
> The technique I've been trying to use is similar to a bloom filter
> (except that it's more space efficient).


Got it.  I didn't realize.


> It's my understanding that
> bloom filters in HBase are only implemented in the context of finding
> individual columns (for improving read performance). Are there
> specific bloom operations I can use atomically on a specific cell?
>

Your understanding is correct.  My statement was about using the data
structure as a compressed version of a duplication filter, not any HBase
feature.



> Thanks!
>
> --Tom
>
> On Tue, Apr 10, 2012 at 12:01 AM, Jacques <[email protected]> wrote:
> > What about maintaining a bloom filter in addition to an increment to
> > minimize double counting? You couldn't do atomic without some custom work
> > but it would get u mostly there.  If you wanted to be fancy you could
> > actually maintain the bloom as a bunch of separate colums to avoid update
> > contention.
> > On Apr 9, 2012 10:14 PM, "Tom Brown" <[email protected]> wrote:
> >
> >> Andy,
> >>
> >> I am a big fan of the Increment class. Unfortunately, I'm not doing
> >> simple increments for the viewer count. I will be receiving duplicate
> >> messages from a particular client for a specific cube cell, and don't
> >> want them to be counted twice (my stats don't have to be 100%
> >> accurate, but the expected rate of duplicates will be higher than the
> >> allowable error rate).
> >>
> >> I created an RPC endpoint coprocessor to perform this function but
> >> performance suffered heavily under load (it appears that the endpoint
> >> performs all functions in serial).
> >>
> >> When I tried implementing it as a region observer, I was unsure of how
> >> to correctly replace the provided "put" with my own. When I issued a
> >> put from within "prePut", the server blocked the new put (waiting for
> >> the "prePut" to finish). Should I be attempting to modify the WALEdit
> >> object?
> >>
> >> Is there a way to extend the functionality of "Increment" to provide
> >> arbitrary bitwise operations on a the contents of a field?
> >>
> >> Thanks again!
> >>
> >> --Tom
> >>
> >> >If it helps, yes this is possible:
> >> >
> >> >> Can I observe updates to a
> >> >> particular table and replace the provided data with my own? (The
> >> >> client calls "put" with the actual user ID, my co-processor replaces
> >> >> it with a computed value, so the actual user ID never gets stored in
> >> >> HBase).
> >> >
> >> >Since your option #2 requires atomic updates to the data structure,
> have
> >> you considered native
> >> >atomic increments? See
> >> >
> >> >
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
> >> >
> >> >
> >> >or
> >> >
> >> >
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
> >> >
> >> >The former is a round trip for each value update. The latter allows you
> >> to pack multiple updates
> >> >into a single round trip. This would give you accurate counts even with
> >> concurrent writers.
> >> >
> >> >It should be possible for you to do partial aggregation on the client
> >> side too whenever parallel
> >> >requests colocate multiple updates to the same cube within some small
> >> window of time.
> >> >
> >> >Best regards,
> >> >
> >> >
> >> >    - Andy
> >> >
> >> >Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> >> (via Tom White)
> >> >
> >> >----- Original Message -----
> >> >> From: Tom Brown <[email protected]>
> >> >> To: [email protected]
> >> >> Cc:
> >> >> Sent: Monday, April 9, 2012 9:48 AM
> >> >> Subject: Add client complexity or use a coprocessor?
> >> >>
> >> >> To whom it may concern,
> >> >>
> >> >> Ignoring the complexities of gathering the data, assume that I will
> be
> >> >> tracking millions of unique viewers. Updates from each of our
> millions
> >> >> of clients are gathered in a centralized platform and spread among a
> >> >> group of machines for processing and inserting into HBase (assume
> that
> >> >> this group can be scaled horizontally). The data is stored in an OLAP
> >> >> cube format and one of the metrics I'm tracking across various
> >> >> attributes is viewership (how many people from Y are watching X).
> >> >>
> >> >> I'm writing this to ask for your thoughts as to the most appropriate
> >> >> way to structure my data so I can count unique TV viewers (assume a
> >> >> service like netflix or hulu).
> >> >>
> >> >> Here are the solutions I'm considering:
> >> >>
> >> >> 1. Store each unique user ID as the cell name within the cube(s) it
> >> >> occurs. This has the advantage of having 100% accuracy, but the
> >> >> downside is the enormous space required to store each unique cell.
> >> >> Consuming this data is also problematic as the only way to provide a
> >> >> viewership count is by counting each cell. To save the overhead of
> >> >> sending each cell over the network, counting them could be done by a
> >> >> coprocessor on the region server, but that still doesn't avoid the
> >> >> overhead of reading each cell from the disk. I'm also not sure what
> >> >> happens if a single row is larger than an entire region (48 bytes per
> >> >> user ID * 10,000,000 users = 480GB).
> >> >>
> >> >> 2. Store a byte array that allows estimating unique viewers (with a
> >> >> small margin of error*). Add a co-processor for updating this column
> >> >> so I can guarantee the updates to a specific OLAP cell will be
> atomic.
> >> >> The main benefit from this path is that there the nodes that update
> >> >> HBase can be less complex. Another benefit I see is that the I can
> >> >> just add more HBase regions as scale requires. However, I'm not sure
> >> >> if I can use a coprocessor the way I want; Can I observe updates to a
> >> >> particular table and replace the provided data with my own? (The
> >> >> client calls "put" with the actual user ID, my co-processor replaces
> >> >> it with a computed value, so the actual user ID never gets stored in
> >> >> HBase).
> >> >>
> >> >> 3. Store a byte array that allows estimating unique viewers (with a
> >> >> small margin of error*). Re-arrange my architecture so that each OLAP
> >> >> cell is only updated by a single node. The main benefit from this
> >> >> would be that I don't need to worry about atomic operations in HBase
> >> >> since all updates for a single cell will be atomic and in serial. The
> >> >> biggest downside is that I believe it will add significant complexity
> >> >> to my overall architecture.
> >> >>
> >> >>
> >> >> Thanks for your time, and I look forward to hearing your thoughts.
> >> >>
> >> >> Sincerely,
> >> >> Tom Brown
> >> >>
> >> >> *(For information about the byte array mentioned in #2 and #3, see:
> >> >>
> >>
> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
> >> )
> >> >>
> >>
>

Re: Add client complexity or use a coprocessor?

Reply via email to