Why do you need to remove the timestamp from the file names? As far as creating the files on an hourly basis, it sounds like what you want is a dataset partitioned by hour. This is very easy to do using Kite and Flume's Kite integration. You'd create a partitioned dataset similar to the ratings dataset shown in this example:
https://github.com/kite-sdk/kite/wiki/Hands-on-Kite-Lab-1:-Using-the-Kite-CLI-(Solution) and then you'd configure Flume to write to the dataset similar to how we do here: https://github.com/kite-sdk/kite-examples/blob/snapshot/logging/flume.properties This will write your data into a directory structure that looks like: /flume/events/year=2014/month=09/day=24/hour=14 You can then access the dataset using Kite's view API to get just the last hour of data: view:hdfs:/flume/events?timestamp=($AN_HOUR_AGO,) That would get all of the data from an hour ago until now. Feel free to email the Kite user list ([email protected]) or my offline if you want more help with those aspects of it. If you wanted to do the same thing with the regular Flume HDFS Sink, you need to either have a timestamp header in your events or set hdfs.useLocalTimeStamp to true to use a local timestamp. And then you need to define your path with a pattern: agent1.sinks.hdfs-sink1.hdfs.path = hdfs://192.168.145.127:9000/flume/events/year=%y/month=%m/day=%d/hour=%H See this link for more details: http://flume.apache.org/FlumeUserGuide.html#hdfs-sink -Joey On Wed, Sep 24, 2014 at 2:31 PM, Hanish Bansal <[email protected]> wrote: > Thanks for reply !! > > Reposting a post since it was skipped. > > I have configured flume and my application to use log4j-appender. As > described in log4j-appender configuration i have configured agent properly > (avro source, memory channel and hdfs sink). > > I am able to collect logs in hdfs. > > hdfs-sink configuration is: > > agent1.sinks.hdfs-sink1.type = hdfs > agent1.sinks.hdfs-sink1.hdfs.path = > hdfs://192.168.145.127:9000/flume/events/ > agent1.sinks.hdfs-sink1.hdfs.rollInterval=30 > agent1.sinks.hdfs-sink1.hdfs.rollCount=0 > agent1.sinks.hdfs-sink1.hdfs.rollSize=0 > agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text > agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream > > Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix > configuration and also includes current-timestamp in name. > > I am having hdfs files as below format: > /flume/events/FlumeData.1411543838171 > /flume/events/FlumeData.1411544272696 > .... > .... > > Custom file name can be specified using hdfs.filePrefix and hdfs.fileSuffix > configuration. > > I want to know Could i remove timestamp from file name somehow? > > I am thinking to use flume log4j-appender in following way, please let me > know if there is any issue: > > 1. A distributed application(say map-reduce) is running on different nodes. > 2. Flume agent is running only on one node. > 3.Application on all nodes will be configured to use same agent. This way > all nodes will publish logs to same avro source. > 4. Flume agent will write data to hdfs using hdfs sink. Files should be > created on hour basis. One file will be created for logs of all nodes of > last 1 hour. > > Thanks !! > > On 24/09/2014 9:57 pm, "Hari Shreedharan" <[email protected]> wrote: >> >> You can’t write to the same path from more than one sink - either the >> directories have to be different or the hdfs.filePrefix has to be different. >> This is because HDFS actually allows only one writer per file. If multiple >> JVMs try to write to same file, HDFS will throw an exception indicating that >> the lease has expired to other JVMs. >> >> Thanks, Hari >> >> >> On Wed, Sep 24, 2014 at 7:04 AM, Hanish Bansal >> <[email protected]> wrote: >>> >>> One other way to use is: >>> >>> Run Flume agent on each node and configure hdfs sink to use same path. >>> >>> This way also logs of all nodes can be kept in single file (log >>> aggregation of all nodes) in hdfs. >>> >>> Please let me know if there is any issue in any of use case. >>> >>> >>> >>> On Wed, Sep 24, 2014 at 7:21 PM, Hanish Bansal >>> <[email protected]> wrote: >>>> >>>> Thanks Joey !! >>>> >>>> Great help for me. >>>> >>>> I have configured flume and my application to use log4j-appender. As >>>> described in log4j-appender configuration i have configured agent properly >>>> (avro source, memory channel and hdfs sink). >>>> >>>> I am able to collect logs in hdfs. >>>> >>>> hdfs-sink configuration is: >>>> >>>> agent1.sinks.hdfs-sink1.type = hdfs >>>> agent1.sinks.hdfs-sink1.hdfs.path = >>>> hdfs://192.168.145.127:9000/flume/events/ >>>> agent1.sinks.hdfs-sink1.hdfs.rollInterval=30 >>>> agent1.sinks.hdfs-sink1.hdfs.rollCount=0 >>>> agent1.sinks.hdfs-sink1.hdfs.rollSize=0 >>>> agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text >>>> agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream >>>> >>>> Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix >>>> configuration and also includes current-timestamp in name. >>>> >>>> I am having hdfs files as below format: >>>> /flume/events/FlumeData.1411543838171 >>>> /flume/events/FlumeData.1411544272696 >>>> .... >>>> .... >>>> >>>> Custom file name can be specified using hdfs.filePrefix and >>>> hdfs.fileSuffix configuration. >>>> >>>> I want to know Could i remove timestamp from file name somehow? >>>> >>>> I am thinking to use flume log4j-appender in following way, please let >>>> me know if there is any issue: >>>> >>>> A distributed application(say map-reduce) is running on different nodes. >>>> Flume agent is running only on one node. >>>> Application on all nodes will be configured to use same agent. This way >>>> all nodes will publish logs to same avro source. >>>> Flume agent will write data to hdfs. Files should be created on hour >>>> basis. One file will be created for logs of all nodes of last 1 hour. >>>> >>>> Thanks !! >>>> >>>> On Tue, Sep 23, 2014 at 9:38 PM, Joey Echeverria <[email protected]> >>>> wrote: >>>>> >>>>> Inline below. >>>>> >>>>> On Mon, Sep 22, 2014 at 7:25 PM, Hanish Bansal >>>>> <[email protected]> wrote: >>>>>> >>>>>> Thanks for reply !! >>>>>> >>>>>> According to architecture multiple agents will run(on each node one >>>>>> agent) and all agents will send log data(as events) to common collector, >>>>>> and >>>>>> that collector will write data to hdfs. >>>>>> >>>>>> Here I have some questions: >>>>>> How collector will receive the data from agents and write to hdfs? >>>>> >>>>> >>>>> Agents are typically connected by having the AvroSink of one agent >>>>> connect to the AvroSource of the next agent in the chain. >>>>>> >>>>>> Does data coming from different agents is mixed up with each other >>>>>> since collector will also write the data to hdfs in order as it arrives >>>>>> to >>>>>> collector. >>>>> >>>>> Data will enter the collector's channel in the order they arrive. Flume >>>>> doesn't do any re-ordering of data. >>>>>> >>>>>> Can we use this logging feature for real time logging? We can accept >>>>>> this if all logs are in same order in hdfs as these were generated from >>>>>> application. >>>>> >>>>> It depends on your requirements. If you need events to be sorted by >>>>> timestamp, then writing directly out of Flume to HDFS is not sufficient. >>>>> If >>>>> you want your data in HDFS, then the best bet is to write to a partitioned >>>>> Dataset and run a batch job to sort partitions. For example, if you >>>>> partitioned by hour, you'd have an hourly job to sort the last hours worth >>>>> of data. Alternatively, you could write the data to HBase which can sort >>>>> the >>>>> data by timestamp during ingest. In either case, Kite would make it easier >>>>> for you to integrate with the underlying store as it can write to both >>>>> HDFS >>>>> and HBase. >>>>> >>>>> -Joey >>>>> >>>>>> >>>>>> Regards >>>>>> Hanish >>>>>> >>>>>> On 22/09/2014 10:12 pm, "Joey Echeverria" <[email protected]> wrote: >>>>>>> >>>>>>> Hi Hanish, >>>>>>> >>>>>>> The Log4jAppender is designed to connect to a Flume agent running an >>>>>>> AvroSource. So, you'd configure Flume similar to [1] and then point >>>>>>> the Log4jAppender to your agent using the log4j properties you linked >>>>>>> to. >>>>>>> >>>>>>> The Log4jAppender will use Avro inspect the object being logged to >>>>>>> determine it's schema and to serialize it to bytes which becomes the >>>>>>> body of the events sent to Flume. If you're logging Strings, which is >>>>>>> most common, then the Schema will just be a Schema.String. There are >>>>>>> two ways that schema information can be passed. You can configure the >>>>>>> Log4jAppender with a Schema URL that will be sent in the event >>>>>>> headers >>>>>>> or you can leave that out and a JSON-reperesentation of the Schema >>>>>>> will be sent as a header with each event. The URL is more efficient >>>>>>> as >>>>>>> it avoids sending extra information with each record, but you can >>>>>>> leave it out to start your testing. >>>>>>> >>>>>>> With regards to your second question, the answer is no. Flume does >>>>>>> not >>>>>>> attempt to re-order events so your logs will appear in arrival order. >>>>>>> What I would do, is write the data to a partitioned directory >>>>>>> structure and then have a Crunch job that sorts each partition as it >>>>>>> closes. >>>>>>> >>>>>>> You might consider taking a look at the Kite SDK[2] as we have some >>>>>>> examples that show how to do the logging[3] and can also handle >>>>>>> getting the data properly partitioned on HDFS. >>>>>>> >>>>>>> HTH, >>>>>>> >>>>>>> -Joey >>>>>>> >>>>>>> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source >>>>>>> [2] http://kitesdk.org/docs/current/ >>>>>>> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging >>>>>>> >>>>>>> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal >>>>>>> <[email protected]> wrote: >>>>>>> > Hi All, >>>>>>> > >>>>>>> > I want to use flume log4j-appender for logging of a map-reduce >>>>>>> > application >>>>>>> > which is running on different nodes. My use case is have the logs >>>>>>> > from all >>>>>>> > nodes at centralized location(say HDFS) with time based >>>>>>> > synchronization. >>>>>>> > >>>>>>> > As described in below links Flume has its own appender which can be >>>>>>> > used to >>>>>>> > logging of an application in HDFS direct from log4j. >>>>>>> > >>>>>>> > >>>>>>> > http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html >>>>>>> > >>>>>>> > >>>>>>> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender >>>>>>> > >>>>>>> > Could anyone please tell me if these logs are synchronized on >>>>>>> > time-basis or >>>>>>> > not ? >>>>>>> > >>>>>>> > What i mean by time based synchronization is: Having logs from >>>>>>> > different >>>>>>> > nodes in sorted order of time. >>>>>>> > >>>>>>> > Also could anyone provide a link that describes how flume >>>>>>> > log4j-appender >>>>>>> > internally works? >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Thanks & Regards >>>>>>> > Hanish Bansal >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Joey Echeverria >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Joey Echeverria >>>> >>>> >>>> >>>> >>>> -- >>>> Thanks & Regards >>>> Hanish Bansal >>> >>> >>> >>> >>> -- >>> Thanks & Regards >>> Hanish Bansal >> >> > -- Joey Echeverria
