On Tue, Apr 10, 2012 at 9:19 AM, Tom Brown <[email protected]> wrote:
> Jacques, > > The technique I've been trying to use is similar to a bloom filter > (except that it's more space efficient). Got it. I didn't realize. > It's my understanding that > bloom filters in HBase are only implemented in the context of finding > individual columns (for improving read performance). Are there > specific bloom operations I can use atomically on a specific cell? > Your understanding is correct. My statement was about using the data structure as a compressed version of a duplication filter, not any HBase feature. > Thanks! > > --Tom > > On Tue, Apr 10, 2012 at 12:01 AM, Jacques <[email protected]> wrote: > > What about maintaining a bloom filter in addition to an increment to > > minimize double counting? You couldn't do atomic without some custom work > > but it would get u mostly there. If you wanted to be fancy you could > > actually maintain the bloom as a bunch of separate colums to avoid update > > contention. > > On Apr 9, 2012 10:14 PM, "Tom Brown" <[email protected]> wrote: > > > >> Andy, > >> > >> I am a big fan of the Increment class. Unfortunately, I'm not doing > >> simple increments for the viewer count. I will be receiving duplicate > >> messages from a particular client for a specific cube cell, and don't > >> want them to be counted twice (my stats don't have to be 100% > >> accurate, but the expected rate of duplicates will be higher than the > >> allowable error rate). > >> > >> I created an RPC endpoint coprocessor to perform this function but > >> performance suffered heavily under load (it appears that the endpoint > >> performs all functions in serial). > >> > >> When I tried implementing it as a region observer, I was unsure of how > >> to correctly replace the provided "put" with my own. When I issued a > >> put from within "prePut", the server blocked the new put (waiting for > >> the "prePut" to finish). Should I be attempting to modify the WALEdit > >> object? > >> > >> Is there a way to extend the functionality of "Increment" to provide > >> arbitrary bitwise operations on a the contents of a field? > >> > >> Thanks again! > >> > >> --Tom > >> > >> >If it helps, yes this is possible: > >> > > >> >> Can I observe updates to a > >> >> particular table and replace the provided data with my own? (The > >> >> client calls "put" with the actual user ID, my co-processor replaces > >> >> it with a computed value, so the actual user ID never gets stored in > >> >> HBase). > >> > > >> >Since your option #2 requires atomic updates to the data structure, > have > >> you considered native > >> >atomic increments? See > >> > > >> > > >> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29 > >> > > >> > > >> >or > >> > > >> > > >> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html > >> > > >> >The former is a round trip for each value update. The latter allows you > >> to pack multiple updates > >> >into a single round trip. This would give you accurate counts even with > >> concurrent writers. > >> > > >> >It should be possible for you to do partial aggregation on the client > >> side too whenever parallel > >> >requests colocate multiple updates to the same cube within some small > >> window of time. > >> > > >> >Best regards, > >> > > >> > > >> > - Andy > >> > > >> >Problems worthy of attack prove their worth by hitting back. - Piet > Hein > >> (via Tom White) > >> > > >> >----- Original Message ----- > >> >> From: Tom Brown <[email protected]> > >> >> To: [email protected] > >> >> Cc: > >> >> Sent: Monday, April 9, 2012 9:48 AM > >> >> Subject: Add client complexity or use a coprocessor? > >> >> > >> >> To whom it may concern, > >> >> > >> >> Ignoring the complexities of gathering the data, assume that I will > be > >> >> tracking millions of unique viewers. Updates from each of our > millions > >> >> of clients are gathered in a centralized platform and spread among a > >> >> group of machines for processing and inserting into HBase (assume > that > >> >> this group can be scaled horizontally). The data is stored in an OLAP > >> >> cube format and one of the metrics I'm tracking across various > >> >> attributes is viewership (how many people from Y are watching X). > >> >> > >> >> I'm writing this to ask for your thoughts as to the most appropriate > >> >> way to structure my data so I can count unique TV viewers (assume a > >> >> service like netflix or hulu). > >> >> > >> >> Here are the solutions I'm considering: > >> >> > >> >> 1. Store each unique user ID as the cell name within the cube(s) it > >> >> occurs. This has the advantage of having 100% accuracy, but the > >> >> downside is the enormous space required to store each unique cell. > >> >> Consuming this data is also problematic as the only way to provide a > >> >> viewership count is by counting each cell. To save the overhead of > >> >> sending each cell over the network, counting them could be done by a > >> >> coprocessor on the region server, but that still doesn't avoid the > >> >> overhead of reading each cell from the disk. I'm also not sure what > >> >> happens if a single row is larger than an entire region (48 bytes per > >> >> user ID * 10,000,000 users = 480GB). > >> >> > >> >> 2. Store a byte array that allows estimating unique viewers (with a > >> >> small margin of error*). Add a co-processor for updating this column > >> >> so I can guarantee the updates to a specific OLAP cell will be > atomic. > >> >> The main benefit from this path is that there the nodes that update > >> >> HBase can be less complex. Another benefit I see is that the I can > >> >> just add more HBase regions as scale requires. However, I'm not sure > >> >> if I can use a coprocessor the way I want; Can I observe updates to a > >> >> particular table and replace the provided data with my own? (The > >> >> client calls "put" with the actual user ID, my co-processor replaces > >> >> it with a computed value, so the actual user ID never gets stored in > >> >> HBase). > >> >> > >> >> 3. Store a byte array that allows estimating unique viewers (with a > >> >> small margin of error*). Re-arrange my architecture so that each OLAP > >> >> cell is only updated by a single node. The main benefit from this > >> >> would be that I don't need to worry about atomic operations in HBase > >> >> since all updates for a single cell will be atomic and in serial. The > >> >> biggest downside is that I believe it will add significant complexity > >> >> to my overall architecture. > >> >> > >> >> > >> >> Thanks for your time, and I look forward to hearing your thoughts. > >> >> > >> >> Sincerely, > >> >> Tom Brown > >> >> > >> >> *(For information about the byte array mentioned in #2 and #3, see: > >> >> > >> > http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html > >> ) > >> >> > >> >
