Re: Flume loss data when collect online data to hdfs

Hari Shreedharan Wed, 21 Jan 2015 19:05:51 -0800

Are you accounting for the data still being written but not yet hflushed at the 
time of the query? Basically one transaction per sink ?




Thanks, Hari

On Wed, Jan 21, 2015 at 6:42 PM, Jay Alexander <[email protected]> wrote:

> I used *flume-ng 1.5* version to collect logs.
> There are two agents in the data flow and they are on two hosts,
> respectively.
> And the data is sended *from agent1 to agent2.*
> The agents's component is as follows:
> agent1: spooling dir source --> file channel --> avro sink
> agent2: avro source --> file channel --> hdfs sink
> But it seems to loss data about 1/1000 percentage of million data.To solve
> problem I tried these steps:
>    1. look up agents log: cannot find any error or exception.
>    2. look up agents monitor metrics: the events number that put and take
>    from channel always equals
>    3. statistic the data number by hive query and hdfs file use shell,
>    respectively: the two number is equal and less than the online data number
> These are the two agents configuration:
> #agent1
> agent1.sources = src_spooldir
> agent1.channels = chan_file
> agent1.sinks = sink_avro
> #source
> agent1.sources.src_spooldir.type = spooldir
> agent1.sources.src_spooldir.spoolDir = /data/logs/flume-spooldir
> agent1.sources.src_spooldir.interceptors=i1
> #interceptors
> agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
> agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
> agent1.sources.src_spooldir.interceptors.i1.serializers=s1
> agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name=dt
> #sink
> agent1.sinks.sink_avro.type = avro
> agent1.sinks.sink_avro.hostname = 10.235.2.212
> agent1.sinks.sink_avro.port = 9910
> #channel
> agent1.channels.chan_file.type = file
> agent1.channels.chan_file.checkpointDir = /data/flume/agent1/checkpoint
> agent1.channels.chan_file.dataDirs = /data/flume/agent1/data
> agent1.sources.src_spooldir.channels = chan_file
> agent1.sinks.sink_avro.channel = chan_file
> # agent2
> agent2.sources  = source1
> agent2.channels = channel1
> agent2.sinks    = sink1
> # source
> agent2.sources.source1.type     = avro
> agent2.sources.source1.bind     = 10.235.2.212
> agent2.sources.source1.port     = 9910
> # sink
> agent2.sinks.sink1.type= hdfs
> agent2.sinks.sink1.hdfs.fileType = DataStream
> agent2.sinks.sink1.hdfs.filePrefix = log
> agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}
> agent2.sinks.sink1.hdfs.rollInterval = 600
> agent2.sinks.sink1.hdfs.rollSize = 0
> agent2.sinks.sink1.hdfs.rollCount = 0
> agent2.sinks.sink1.hdfs.idleTimeout = 300
> agent2.sinks.sink1.hdfs.round = true
> agent2.sinks.sink1.hdfs.roundValue = 10
> agent2.sinks.sink1.hdfs.roundUnit = minute
> # channel
> agent2.channels.channel1.type   = file
> agent2.channels.channel1.checkpointDir = /data/flume/agent2/checkpoint
> agent2.channels.channel1.dataDirs = /data/flume/agent2/data
> agent2.sinks.sink1.channel      = channel1
> agent2.sources.source1.channels = channel1
> Any suggestions are welcome!

Re: Flume loss data when collect online data to hdfs

Reply via email to