Hi Bojan,
Sorry about being late in responding to this.
Your setup is of course possible, using host headers, or just headers
supplied by whatever is feeding the data to flume.
The issue is when hdfs has to write to 120 different files, batches get
split up over each file, so the writes are not particularly efficient.
One approach to this is to just write everything to the same file, and
then post process it. Another is to try and group files to sinks. You
could use an interceptor to feed specific header(s) to a specific
channel. There are multiple strategies to this each with their own
benefits and disadvantages, but in the end of the day, writing one big
hdfs file is far more efficient than writing lots of small ones.
On 11/06/2013 06:39 PM, Bojan Kostić wrote:
It was late when i wrote last mail, and my explanation was not clear.
I will illustrate:
20 servers, every one with 60 different log files.
I was thinking that I could have this kind of structure on hdfs:
/logs/server0/logstat0.log
/logs/server0/logstat1.log
.
.
.
/logs/server20/logstat0.log
.
.
.
But from your info I see that I can't do that.
I could try to add server id column in every file and then aggregate
files from all files servers to one file
/logs/logstat0.log
/logs/logstat1.log
.
.
.
But again I should have 60 sinks.
On Nov 6, 2013 2:02 AM, "Roshan Naik" <[email protected]
<mailto:[email protected]>> wrote:
I assume you mean you have 120 source files to be streamed into
HDFS.
There is not a 1-1 correspondence between source files and
destination hdfs files. If they are on the same host, you can
have them all picked up through one source, once channel and one
hdfs sink... winding up in a single hdfs file.
In case you have a config with multiple HDFS sinks (part of a
single agent or spanning multiple agents) you want to ensure each
HDFS sink writes to a separate file in HDFS.
On Tue, Nov 5, 2013 at 4:41 PM, Bojan Kostić
<[email protected] <mailto:[email protected]>> wrote:
Hallo Roshan,
Thanks for response.
Bit I am now confused. If I have 120 files, do I need to
configure 120 sinks/sources/channels separately? Or I have
missed something in the docs.
Maybe I should use Fan out flow? But then again I must set 120
params.
Best regards.
On Nov 5, 2013 8:47 PM, "Roshan Naik" <[email protected]
<mailto:[email protected]>> wrote:
yes. to avoid them clobbering each other's writes.
On Tue, Nov 5, 2013 at 4:34 AM, Bojan Kostić
<[email protected] <mailto:[email protected]>> wrote:
Sorry for late response. But I lost this email somehow.
Thanks for the read, it is nice start even it is old.
And the numbers are really promising.
I'm testing memory chanel, there is like 20 data
sources(log servers) with 60 different files each.
My RPC client app is basic like in examples. But it
have load balancing for two flume agents which are
writing data to hdfs.
I think I read somewhere that you should have one sink
per file. Is that true?
Best regards, and sorry again for late response.
On Oct 22, 2013 8:50 AM, "Juhani Connolly"
<[email protected]
<mailto:[email protected]>> wrote:
Hi Bojan,
This is pretty old, but Mike did some testing on
performance about an year and a half ago:
https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Syslog+Performance+Test+2012-04-30
He was getting a max of 70k events/sec on a single
machine.
Thing is, this is a result of a huge number of
variables:
- Parallelization of flows allows better parallel
processing
- Use of memory channel as opposed to a slower
consistent channel.
- Possibly the source. I have no idea how you
wrote your app
- Batching of events is important. Also are all
events written to one file? Or are they split over
many? Every file is separately processed.
- Network congestion, your hdfs setup
Reaching 100k events per second is definitely
possible. The resources you need for it will vary
significantly depending on how your setup is. The
more HA type features you use, the slower delivery
is likely to become. On the flipside, allowing
fairly lax conditions that have a small potential
for data loss(on crash for example memory channel
contents are gone) will allow for close to 100k
even on a single machine.
On 10/14/2013 09:00 PM, Bojan Kostić wrote:
Hi, this is my first post here. But i play
with flume for some time now.
My question is how well flume scale?
Can Flume ingest +100k events per seccond? Has
anyone tried something like this?
I created simple test and results are really slow.
I wrote simple app with rpc client with
fallback using flume sdk which is reading
dummy log file.
In the end i have two flume agents which are
writing to hdfs.
rollInterval = 60
And in hdfs i get files with ~12MB.
Do i need to use some complex topology with 3
tier?
How many flume agents should write to hdfs?
Best regards.
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the
individual or entity to which it is addressed and may
contain information that is confidential, privileged and
exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are
hereby notified that any printing, copying, dissemination,
distribution, disclosure or forwarding of this
communication is strictly prohibited. If you have received
this communication in error, please contact the sender
immediately and delete it from your system. Thank You.
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or
entity to which it is addressed and may contain information that
is confidential, privileged and exempt from disclosure under
applicable law. If the reader of this message is not the intended
recipient, you are hereby notified that any printing, copying,
dissemination, distribution, disclosure or forwarding of this
communication is strictly prohibited. If you have received this
communication in error, please contact the sender immediately and
delete it from your system. Thank You.