Hey all, I'm working on some tools for doing data integration and building machine learning models w/Crunch, Mahout, and (soon!) HCatalog, and I wrote about what I'm up to here:
http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/ and the code is here: https://github.com/cloudera/ml I wanted to answer a couple of questions preemptively, if you don't mind: Q: Why? A: I started planning out the next version of my data science course, and I was concerned that my students were going to spend too much time on data integration tasks (e.g., converting CSVs to Vectors) that really should be automated. I obviously enjoy writing my Java MR stuff in Crunch, and I thought it would be a good idea to open source the tools to showcase how awesome Crunch can be. Q: Why not do this as part of the Crunch or Mahout projects? A: Dependency management. Crunch doesn't depend on Mahout, and Mahout doesn't depend on Crunch, and I think that for the sanity of the developers of both projects, it should stay that way. Dependency management is already enough of a nightmare for Hadoop projects that I didn't want to do anything to make it worse. I will contribute anything from the toolkit back to Crunch that is deemed useful by the community (e.g., the reservoir sampling stuff in CRUNCH-178) and doesn't introduce any new dependencies. Q: Where is this going? A: I'm going to be co-developing the tools and the coursework for the class, so I have a reasonably good idea of what features I need to add, with HCatalog integration and ensemble models being the two major items on the TODO list. I'm not looking to build a tool for every ML algorithm ever invented, just some a small set of core models that are easy to use, easy to tune, and thus easy for new data scientists to get started with. If there's anything else folks are curious about, please just let me know and I'd be happy to answer. Josh -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
