I am using Cloudera’s example source to collect a sample of Twitter’s stream
partitioned by year -> month -> day -> hour.
https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java
<https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java>
timestamp of an event is set by
headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));
My agent config:
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://kronos.feeb.co:8020/user/flume/tweets/%Y/%m/%d/%H/
<hdfs://kronos.feeb.co:8020/user/flume/tweets/%25Y/%25m/%25d/%25H/>
However, I see that in almost all hours there is at least one (more often
multiple records) from the last second of the previous hour.
Is there any way to prevent having those overlaps in data?
Hourly aggregation without dropping data becomes unnecessarily messy due to
this.