Mohit, Are you using memory channel? You mention you are getting OOME but you don't even say what the heap you are setting on the flume jvm is?
Don't run an agent on the namenode. Occasionally you will see folks installing an agent on one of the datanodes in the cluster but its not typically recommended. It's fine to install the agent on your webserver but perhaps a more scaleable approach would be to dedicate two servers to flume agents. This will allow you to load balance your writes into the flume pipeline at some point. As you scale you will not want to have every agent writing to hdfs so at some point you may consider adding a collector tier that will aggregate the flow and reduce the connections going into your hdfs cluster. -Jeff On Thu, Apr 3, 2014 at 6:20 AM, Mohit Durgapal <durgapalmo...@gmail.com>wrote: > Hi, > > We are setting up a flume cluster but we are facing some issues related to > heap size (out of memory). Is there a standard configuration for a standard > load? > > If there is can you suggest what would it be for the load stats given > below? > > Also, we are not sure what topology to go ahead with in our use case. > > We basically have two web servers which can generate logs at the speed of > 2000 entries per second. Each entry of size around 137Bytes. > > Currently we have used rsyslog( writing to a tcp port) to which a php > script writes these logs to. And we are running a local flume agent on each > webserver , these local agents listen to a tcp port and put data directly > in hdfs. > > So localhost:tcpport is the "flume source " and "hdfs" is the flume sink. > > I am confused between three approaches: > > Approach 1: Web Server, RSyslog & Flume Agent on same machine and a Flume > collector running on the Namenode in hadoop cluster, to collect the data > and dump into hdfs. > > Approach 2: Web Server, RSyslog on same machine and a Flume collector > (listening on a remote port for events written by rsyslog on web > server)running on the Namenode in hadoop cluster, to collect the data and > dump into hdfs. > > > Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all > agents writing directly to the hdfs. > > > Also, we are using hive, so we are writing directly into partitioned > directories. So we want to think of an approach that allows us to write on > Hourly partitions. > > I hope that's not too vague. > > > > Regards > Mohit Durgapal >