Azuryy, I'm pretty sure that I could 'cat'. Please see below for the evidence:
(1) >>>Flume.conf: a1.sinks.k1.hdfs.rollInterval=3600 a1.sinks.k1.hdfs.batchSize = 10 >>>I sent 21 events and I could 'cat' and verify this: $ hdfs dfs -cat /user/mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp | wc -l 21 >>>But when I submitted MapReduce job on above directory, it only picked 11(batchSize is 10 but it always process an event extra to the size) records: Map-Reduce Framework: Map input records=11 (2) >>>I then decided to send 9 more events and I could see that they've appended to the file. $ hdfs dfs -cat /user/wdtmon/atlas_xrd_mon/input/flume/test/15-01-27/data.2015.01.27.13.1422361100490.tmp | wc -l 30 >>>However, when I executed MapReduce job on the file, it still picks only those 11 events. Map-Reduce Framework: Map input records=11 Any idea what's going on? On 27 January 2015 at 08:30, Azuryy Yu <[email protected]> wrote: > Are you sure you can 'cat' the lastest batch of the data on HDFS? > for Flume, the data is available only after file rolled, because Flume > only call FileSystem.close() during file rolling. > > > On Mon, Jan 26, 2015 at 8:17 PM, Uthayan Suthakar < > [email protected]> wrote: > >> I have a Flume which stream data into HDFS sink (appends to same file), >> which I could "hdfs dfs -cat" and see it from HDFS. However, when I run >> MapReduce job on the folder that contains appended data, it only picks up >> the first batch that was flushed (bacthSize = 100) into HDFS. The rest are >> not being picked up, although I could cat and see the rest. When I execute >> the MapReduce job after the file is rolled(closed), it's picking up all >> data. >> >> Do you know why MR job is failing to find the rest of the batch even >> though it exists. >> >> So this is what I'm trying to do: >> >> 1) Read constant data flow from message queue and write them into HDFS. >> 2) Rolling is configured by intervals (1 hour) e.g hdfs.rollinterval >> =3600 >> 3) Write number of events into file before flushing into HDFS is set to >> 100 e.g hdfs.BatchSize=100 >> 4) The appending configuration is enabled at lower level e.g >> hdfs.append.support =true. >> >> Snippets from Flume source: >> >> if (conf.getBoolean("hdfs.append.support", false) == true && hdfs.isFile >> (dstPath)) { >> outStream = hdfs.append(dstPath); >> } else { >> outStream = hdfs.create(dstPath); >> } >> >> 5) Now, all configurations for appending data into HDFS are in place. >> 6) I tested the flume and I could see a hdfs://test/data/input/event1.tmp >> file get written into HDFS. >> 7) When I hdfs dfs -cat hdfs://test/data/input/event1.tmp, I could see >> all data that are being appended into the file e.g 500+ events. >> 8) However, when I executed a simple MR job to read folder >> hdfs://test/data/input , it only picked up the first 100 event, although >> it had over 500+ events. >> >> So it would appear that Flume is in fact appending data into HDFS but MR >> job is failing to pick up everything, perhaps block caching issue or >> partition issue? Has anyone come across this issue? >> > >
