Dear All,
I was wondering if the following is possible using MapReduce.
I would like to create a job that loops over a bunch of documents,
tokenizes them into ngrams, and stores the ngrams and not only the counts
of ngrams but also _which_ document(s) had this particular ngram. In other
words, the key would be the ngram but the value would be an integer (the
count) _and_ an array of document id's.
Is this something that can be done? Any pointers would be appreciated.
I am using Java, btw.
Thank you,
Natalia Connolly