It's possible you could (ab)use updateStateByKey or mapWithState for this. But honestly it's probably a lot more straightforward to just choose a reasonable batch size that gets you a reasonable file size for most of your keys, then use filecrush or something similar to deal with the hdfs small file problem.
On Mon, Feb 1, 2016 at 10:11 PM, p pathiyil <pathi...@gmail.com> wrote: > Hi, > > Are there any ways to store DStreams / RDD read from Kafka in memory to be > processed at a later time ? What we need to do is to read data from Kafka, > process it to be keyed by some attribute that is present in the Kafka > messages, and write out the data related to each key when we have > accumulated enough data for that key to write out a file that is close to > the HDFS block size, say 64MB. We are looking at ways to avoid writing out > some file of the entire Kafka content periodically and then later run a > second job to read those files and split them out to another set of files > as necessary. > > Thanks. >