I have a couple mongo db structures that contain docs, terms associated
with each vector dimension, term weights, docids for similar docs,
clusters, docs included in the clusters, etc. They come from several
sequence files in HDFS so I'm just looking for a way to conveniently do
the post mahout processing. If each sequence file were in mongo with
keys indexed I can imagine how to connect the dots. Also I'm creating a
prototype so trying to find the easiest way to do it. Since the data has
to get into mongo I thought sooner in the pipeline would be simplest. I
realize that I don't need to export into human readable json and could
write to mongo directly and that is certainly an option.
I looked for a way to use mongo as a generic backing store for
hadoop/mahout but struck out (not even sure that would be a good idea
anyway). I did see the pig integration and saw your code for the
MongoDBDataModel in the recommender but they didn't seem to apply to my
case.
Any advise is appreciated.
On 3/17/12 4:01 PM, Sean Owen wrote:
What do you mean by indexed here?
On Sat, Mar 17, 2012 at 10:56 PM, Pat Ferrel<[email protected]> wrote:
I need to digest some mahout files and merge them into a MongoDB database.
Since digesting would be a lot easier if the mahout keys were indexed I
wonder if a "seqdumper --format json or mongodb" might be useful. It would
make my life easier but maybe there is already a better way to do this?