Hi,

I've been watching these presentations about real-time user segmentation
using HBase by rich relelvance:
https://www.youtube.com/watch?v=dPnuOv3CPQ0
http://www.slideshare.net/Hadoop_Summit/doctor-nguyen-june27425pmroom230av2

It's a really great detailed talk, highly recommended. They use it to
calculate segments by evaluating and combining rules like "All users who
did EventX with MetricY Between dates D1 and D2 at least N times". It seems
to be working well for them.

But, there are one or two things I can't figure out. Would anyone be
interested in talking about how they did it, or has anyone here implemented
a similar scenario and be willing to chat? I'd love to get together and
swap/discuss ideas about implementing some variations of this approach, and
very happy to share my experience so far.

The details, it looks like they used cell versioning to store multiple
click stream events per-day (they mention this briefly in another version
of the video http://vimeo.com/70500725). They must have increased the
max-versions of the column to something quite high in order to do this.

How did they use cell versions to store so many values? I was under the
impression that increasing the max versions size over about 100 would
result in very large HFiles?

Then, they somehow figure out how many times the event happened between
these dates. I'm assuming they must calculate some running totals for each
user in memory as they're iterating the results, but that could be
expensive if you need a hashtable in memory with millions of users.

Thanks,

Meena

Reply via email to