Andy, I am a big fan of the Increment class. Unfortunately, I'm not doing simple increments for the viewer count. I will be receiving duplicate messages from a particular client for a specific cube cell, and don't want them to be counted twice (my stats don't have to be 100% accurate, but the expected rate of duplicates will be higher than the allowable error rate).
I created an RPC endpoint coprocessor to perform this function but performance suffered heavily under load (it appears that the endpoint performs all functions in serial). When I tried implementing it as a region observer, I was unsure of how to correctly replace the provided "put" with my own. When I issued a put from within "prePut", the server blocked the new put (waiting for the "prePut" to finish). Should I be attempting to modify the WALEdit object? Is there a way to extend the functionality of "Increment" to provide arbitrary bitwise operations on a the contents of a field? Thanks again! --Tom >If it helps, yes this is possible: > >> Can I observe updates to a >> particular table and replace the provided data with my own? (The >> client calls "put" with the actual user ID, my co-processor replaces >> it with a computed value, so the actual user ID never gets stored in >> HBase). > >Since your option #2 requires atomic updates to the data structure, have you >considered native >atomic increments? See > >http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29 > > >or > >http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html > >The former is a round trip for each value update. The latter allows you to >pack multiple updates >into a single round trip. This would give you accurate counts even with >concurrent writers. > >It should be possible for you to do partial aggregation on the client side too >whenever parallel >requests colocate multiple updates to the same cube within some small window >of time. > >Best regards, > > > - Andy > >Problems worthy of attack prove their worth by hitting back. - Piet Hein (via >Tom White) > >----- Original Message ----- >> From: Tom Brown <[email protected]> >> To: [email protected] >> Cc: >> Sent: Monday, April 9, 2012 9:48 AM >> Subject: Add client complexity or use a coprocessor? >> >> To whom it may concern, >> >> Ignoring the complexities of gathering the data, assume that I will be >> tracking millions of unique viewers. Updates from each of our millions >> of clients are gathered in a centralized platform and spread among a >> group of machines for processing and inserting into HBase (assume that >> this group can be scaled horizontally). The data is stored in an OLAP >> cube format and one of the metrics I'm tracking across various >> attributes is viewership (how many people from Y are watching X). >> >> I'm writing this to ask for your thoughts as to the most appropriate >> way to structure my data so I can count unique TV viewers (assume a >> service like netflix or hulu). >> >> Here are the solutions I'm considering: >> >> 1. Store each unique user ID as the cell name within the cube(s) it >> occurs. This has the advantage of having 100% accuracy, but the >> downside is the enormous space required to store each unique cell. >> Consuming this data is also problematic as the only way to provide a >> viewership count is by counting each cell. To save the overhead of >> sending each cell over the network, counting them could be done by a >> coprocessor on the region server, but that still doesn't avoid the >> overhead of reading each cell from the disk. I'm also not sure what >> happens if a single row is larger than an entire region (48 bytes per >> user ID * 10,000,000 users = 480GB). >> >> 2. Store a byte array that allows estimating unique viewers (with a >> small margin of error*). Add a co-processor for updating this column >> so I can guarantee the updates to a specific OLAP cell will be atomic. >> The main benefit from this path is that there the nodes that update >> HBase can be less complex. Another benefit I see is that the I can >> just add more HBase regions as scale requires. However, I'm not sure >> if I can use a coprocessor the way I want; Can I observe updates to a >> particular table and replace the provided data with my own? (The >> client calls "put" with the actual user ID, my co-processor replaces >> it with a computed value, so the actual user ID never gets stored in >> HBase). >> >> 3. Store a byte array that allows estimating unique viewers (with a >> small margin of error*). Re-arrange my architecture so that each OLAP >> cell is only updated by a single node. The main benefit from this >> would be that I don't need to worry about atomic operations in HBase >> since all updates for a single cell will be atomic and in serial. The >> biggest downside is that I believe it will add significant complexity >> to my overall architecture. >> >> >> Thanks for your time, and I look forward to hearing your thoughts. >> >> Sincerely, >> Tom Brown >> >> *(For information about the byte array mentioned in #2 and #3, see: >> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html) >>
