- Adding parsing logic in mappers/reducers is the simplest, least elegant way to do it, or just writing json strings is one simple way to do it.
- You get more advanced by writing custom writables which parse the data are the first way to do it. - The truly portable and "right" way is to do it is to define a schema and use Avro to parse it. Unlike manually adding parsing to app logic, or adding json deser to your mapper/reducers, proper Avro serialization has the benefit of increasing performance and app portability while also code more maintainable (it interoperates with pure java domain objects) On Tue, Apr 8, 2014 at 2:30 PM, Harsh J <[email protected]> wrote: > Yes, you can write custom writable classes that detail and serialise > your required data structure. If you have Hadoop: The Definitive > Guide, checkout its section "Serialization" under chapter "Hadoop > I/O". > > On Tue, Apr 8, 2014 at 9:16 PM, Natalia Connolly > <[email protected]> wrote: > > Dear All, > > > > I was wondering if the following is possible using MapReduce. > > > > I would like to create a job that loops over a bunch of documents, > > tokenizes them into ngrams, and stores the ngrams and not only the > counts of > > ngrams but also _which_ document(s) had this particular ngram. In other > > words, the key would be the ngram but the value would be an integer (the > > count) _and_ an array of document id's. > > > > Is this something that can be done? Any pointers would be > appreciated. > > > > I am using Java, btw. > > > > Thank you, > > > > Natalia Connolly > > > > > > -- > Harsh J > -- Jay Vyas http://jayunit100.blogspot.com
