In the middle of doing the architecture for a new project, which has various machine learning and related components, including: recommender systems, search engines and sequence [common intersection] matching.
Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue, backed by Redis). Though I don't have experience with Hadoop, I was thinking of using Hadoop for the machine-learning (as this will become a Big Data problem quite quickly). To push the data into Hadoop, I would use a connector of some description, or push the MongoDB backups into HDFS at set intervals. However I was thinking that it might be better to put the whole thing in Hadoop, store all persistent data in Hadoop, and maybe do all the layers in Apache Spark (with caching remaining in Redis). Is that a viable option? - Most of what I see discusses Spark (and Hadoop in general) for analytics only. Apache Phoenix exposes a nice interface for read/write over HBase, so I might use that if Spark ends up being the wrong solution. Thanks for all suggestions, Alec Taylor PS: I need this for both "Big" and "Small" data. Note that I am using the Cloudera definition of "Big Data" referring to processing/storage across more than 1 machine. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
