One way that I can think of is that you basically need to define your own InputFormal and RecordReader so that each record is 'a paragraph' or a 'sentence'. The reason being that in regular case, a line terminated by standard end of line characters is considered as one record for FileInputFormat. Here, you instead want to get one paragraph as one record instead of one line. So, once you override a RecordReader, you will have control on how do you want to define a 'record' that is passed to each map task.
Some starting points...E.g. look here to define and implement your own RecordReader for FileInputFormat: http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/ http://www.infoq.com/articles/HadoopInputFormat http://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/ Regards, Shahab Regards, Shahab On Sat, Nov 1, 2014 at 11:45 AM, Raghavendra Chandra < [email protected]> wrote: > Hi There, > > I have couple of doubts in Hadoop, it would be really helpful if anyone > can answer these questions or if this is already answered somewhere, the > link to that would be helpful. > > Below are my doubts: > > 1. How to count the number of paragraphs in a text file using java map > reduce ? > > 2. How to count the number of sentences in a paragraph/file using java map > reduce ? > > Please let me know where I can get the map reduce programs list with > different use cases. > > Looking forward for your responses. > >
