Hi Nick,
The key question to ask about this use case is the access pattern. Do you need
real-time access to new information as it is created? (I.e. if someone reads an
article, do your queries need to immediately reflect that?) If not, and a batch
approach is fine (say, nightly processing) then Hadoop is a good first step; it
can map/reduce over the logs and aggregate / index the data, either creating
static reports like you say below for all the users & pages, or inserting the
data into a database.
If, on the other hand, you need real-time ingest & reporting, that's where
HBase would be a better fit. You could write the code so that every "log" you
write is actually passed to an API that inserts the data into one or more HBase
tables in real-time.
The trick is that since HBase doesn't have built-in secondary indexing, you'd
have to write the data in ways that could provide low latency responses to the
queries below. That likely means denormalizing the data; in your case, probably
one table that's keyed by page & time ("all users that have been to the webpage
X in the last N days"), another that's keyed by user & time ("all the pages
seen by a given user"), etc. So in other words, every action would result in
multiple writes into HBase. (This is the same thing that happens in a
relational database--an "index" means you're writing the data in two
places--but there, it's hidden from you and maintained transparently). But,
once you get that working, you can have very fast access to real time data for
any of those queries.
Note: that's a lot of work. If you don't really need real-time ingest and
query, Hadoop queries over the logs are much simpler (especially with tools
like Hive, where you can write real SQL statements and it automatically
translates them into Hadoop map/reduce jobs).
Also, 10TB isn't outside of the range that a traditional database could handle
(given the right HW, schema design & indexing). You may find it is simpler to
model your problem that way, either using Hadoop as a bridge between the raw
log data and the database (if offline is OK) or inserting directly. The key
benefit of going the Hadoop/HBase route is horizontal scalability, meaning that
even if you don't know your eventual size target, you can be confident that you
can scale linearly by adding hardware. That's critical if you're Google or
Facebook, but not as frequently required for smaller businesses. Don't
over-engineer ... :)
Ian
On Nov 8, 2012, at 3:00 AM, Nick maillard wrote:
Hi everyone
I'm currently testing Hbase/Hadoop in terms of performance but also in terms off
applicability. After some tries, and reads I'm wondering If Hbase is well fitted
for the current need I'm testing.
Say I had logs on websites listing users going to webpage, reading an article,
liking a piece of data, commenting or even bookmarking.
I would store these logs on a long period and for a lot of different websites
and I would like to use the data with these questions:
- All users that have been to the webpage X in the last Ndays
- All users that have liked and then bookmarked a page in a range of Y days.
- All the pages that are commented X times in the last N days.
- All users that have commented a page W and liked a page P.
- All pages seen,liked or commented by a given user.
As you see this might a very SQL way of thinking. The way I understand the
questions being different in nature I would have different tables to answer
them.
Am I correct? How could this be represented and would sql be a better fit?
The data would be large around a 10 Tbytes.
regards