I am using Cloudera’s example source to collect a sample of Twitter’s stream 
partitioned by year -> month -> day -> hour. 
https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java
 
<https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java>

timestamp of an event is set by 
headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));

My agent config:
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://kronos.feeb.co:8020/user/flume/tweets/%Y/%m/%d/%H/
 <hdfs://kronos.feeb.co:8020/user/flume/tweets/%25Y/%25m/%25d/%25H/>

However, I see that in almost all hours there is at least one (more often 
multiple records) from the last second of the previous hour. 

Is there any way to prevent having those overlaps in data? 
Hourly aggregation without dropping data becomes unnecessarily messy due to 
this.

Reply via email to