Temp hdfs files picked up by textFileStream dstream

Chris Regnier Mon, 06 Jan 2014 13:19:29 -0800

Hello,

I have a spark streaming textFileStream watching a hdfs folder for new
csv files every minute. I must've gotten the timing just right since I
copied in 5 new csv files ("part-0004" was one of them) into the watched
folder and then got an IO exception about "part-0004._COPYING_" path
does not exist as the dstream kicked off a job. It looks like 'hdfs dfs
put' command creates a temp *._COPYING_ file first which got picked up
as one of the new files during the scan process, but by the time it
tried to read in the data the temp file had already been removed, thus
giving a path does not exist error.


To get around this I expect I'll have to use fileStream instead and pass
in a filter function on filenames, but I don't suppose anyone has
already created a filter that weeds out other common temp files that I
haven't run into yet?

On another note, this problem seems to make the textFileStream dstream
error prone, if not useless as is. Perhaps a file filter should be a
required parameter to create one, in which case some defaults would also
be nice.


Chris Regnier
-------------------------
Visualization Developer
Oculus Info Inc.

Temp hdfs files picked up by textFileStream dstream

Reply via email to