Thanks Andrew, Ashish and Sharinder for your response.

I have a large number of JSON files which are 2K size each on Tomcat
servers. We are using rsync to get the files from the Tomcat servers to the
EC2 compute instances. Lets say we have 4 Tomcat servers, do we need 4
machines (EC2) with Flume on them.

On each Flume machine, we have a folder that rsync's with the Tomcat server
folder. The source of the Flume then points to the input folder and after
processing (we are planning to use Morphlines) the output is written as CSV
files and uploaded to S3.

Can anyone send me some examples of sample Flume tiered architecture. By
collection agents do you mean a set of machines, in which each machine is
getting data from multiple Tomcat servers. And after that in the Collection
layer, are there a set of machines where there is a 1-1 relationship
between the machines in the Collection tier and Transformation tier has
flume instances with Morphlines which then write the CSV output to S3.
Also, does it support HA etc.

Please advise.

Thanks.












On Thu, Sep 4, 2014 at 11:08 PM, Ashish <[email protected]> wrote:

> I would recommend using an Interceptor for this and possibly a modified
> Flume topology. If the json files have large numbers of rows or very high
> number of files, go for a Collection tier, and use another level of agents
> that uses interceptors for DB lookup and CSV generation. Something like
>
> Collection Agents -> Transformation Agents (writing to S3 Sinks)
>
> You can scale out Transformation/Collection layer agents  based on the
> traffic volume
>
> thanks
>
>
>
>
> On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <[email protected]>
> wrote:
>
>> Hello All,
>> We have the following configuration:
>> Source->Channel->Sink
>>
>> Now, the source is pointing to a folder that has lots of json files. The
>> channel is file based so that there is fault tolerance and the Sink is
>> putting CSV files on S3.
>>
>> Now, there is code written in Sink that takes the JSON events and does
>> some MySQL database lookup and generates CSV files to be put into S3.
>>
>> The question is, is it the right place for the code or should the code be
>> running in channel as the ACID gaurantees is present in Channel. Please
>> advise.
>>
>> -Kev
>>
>>
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Reply via email to