Some HBase developers had a meetup/Hackathon today at Facebook's Palo Alto
office (as announced last week on the list). Following are some notes I took
- please feel to reply with any corrections, etc. Apologies if I've
mispelled anyone's name!
Attendees/intros:
- Jonathan Gray @ FB
- Stack @ SU
- Matthieu Lieber @ Datameer (building a connector)
- Lars George - at Cloudera - writing HBase book
- Gary Helmling @ Trend Micro - been working on coprocessors
- Mingjie Lai @ Trend Micro
- Todd Lipcon @ Cloudera
- Joshua Ho @ Trend Micro
- Eugene Koontz @ Trend Micro - working on ZK authentication
- Andrew Purtell @ Trend Micro - working on CPs, also stability testing of
0.90 on a web crawler synthetic use case
- Nicolas Spiegelberg @ FB - working on miscellaneous stuff in HBase for
scaling the FB messaging system
- Karthik Ranganathan @ FB - not sure what he's going to work on
- Kannan Muthukuruppan - same as above
- Amit @ FB - same
- Ryan Rawson @ StumbleUpon - not sure what he's wokring on
- JD Cryans @ StumbleUpon - testing replication in trunk, interested in CPs
- Himanshu - student at Uni of Alberta - using CPs for his thesis topic
- Vaidhav @ GumGum using HBase in production, still running 20.6, wants to
upgrade to 90
- Ken Weiner @ GumGum - same as above.
Mingjie Lai presenting from blog post on Coprocessors
Discussion:
- Jon brings up that the pre/post hooks for get don't allow the coprocessor
to *not* call the original get.
- eg completely sub out "get" implementation for something else
- the current API is more like pre/post filters, but you can't skip the
normal implementation
- we should probably amend API to allow this
- some discussion of taking some kind of Context argument - we'll break
this out to a discussion later in the day
- Some discussion about doing aggregation using coprocessors:
- RPCs for CommandTarget are per region, not per region server
- "Toy" mapreduce is easy to do (ie no spills, failure handling) - for
simple sum/average aggregates
- HBASE-1512 has basic aggregate functions
- Some discussion about how this actually is used:
- You have to set the CP on the table descriptor
- There is no class unloading currently implemented, so rolling restart
is required to update
- There are some hacks around this, but nothing suitable for production
- For development, unit tests are a good way - there are some unit test
examples that come with CPs
- Lars has some experience some years ago with various frameworks that
might do things like this
- People are generally afraid of class unloading, since it's very hairy
- In general, if you have to patch a CP, you should use the same
procedure to upgrade a coprocessor as if there were a bug in HBase itself
- Apache Felix might do stuff like this (OSGI stack)
- This might necessitate refactoring other parts of HBase, pulling in
more dependencies, etc - lots of churn, overkill at this point
- Consensus seems to be at this point that CPs are "wizard level" -
people deploying them should treat them like part of HBase itself, not be
redeploying willy-nilly
- Long term maybe we have a coprocessor with support for some language
like JRuby for simple user-level things like triggers, etc
-- Lunch --
After lunch, discussion:
- Use cases we're thinking about:
- Eventually consistent secondary indexing
- In schema, define which columns are indexed
- Observer on put would notice puts and write index writes to WAL
- On WAL replay, make sure they've been replayed to index tables as well
- Server-side search query join
-
- Row-level increment "group commit"
- Some rows are really hot in a lot of ICV use cases
- If an increment comes in and the row is locked, queue the increment
- When row gets released, they get combined and committed together
- Don't respond to client until it's been synced
- Himanshu: discussing adding support to CP for materializing an
intermediary result set and then scanning it
- use cases: eg top K
- some discussion about validity of use cases - for top K most people
would prefer to maintain it incrementally rather than doing a batch scan
followed by scanning the result set
- can we make the more general API for just doing scanner-like actions
from CPs?
- Gary:
- Multi-row transactions like Percolator
- Todd:
- Offline blob storage - for large puts, write them to a side file on HDFS
and just store a reference.
- Ryan:
- Basic datastructures (list append, etc)
- Potentially operate on JSON structures
- Some general discussion about embedding a scripting language - Jython,
JRuby, Javascript all brought up. Some people will continue to look into
this on JIRA - looks like JavaScript seems to be winning for now since Rhino
is included in the JDK.
After discussion, different people broke off to work on different things -
some coprocessors, other general work on 0.90 stability, etc. Most people
stuck around for dinner at 6, and then filtered out between 7 and 8:30pm
with no particular wrap-up.
--
Todd Lipcon
Software Engineer, Cloudera