Assuming you are running Linux, an easy option would be just to use the
Linux tail command to extract the last line (or last couple of lines) of
a file and save them to a different file/directory, before feeding it to
Spark. It shouldn't be hard to write a shell script that executes tail
on all files in a directory (or S3 bucket if using AWS CLI). If you
really want this kind of file preprocessing done in Spark, you will have
to extend Spark's DataFrameReader API which may not be an easy task if
you don't have experienced Scala developers. Hope this helps...
-- ND
On 8/2/21 6:50 PM, Sayeh Roshan wrote:
Hi users,
Does anyone here has experience with written spark code that just read
the last line of each text file in a directory, s3 bucket, etc?
I am looking for a solution that doesn’t require reading the whole
file. I basically wonder whether you can create a data frame/Rdd using
file seek. Not sure whether there is such a thing already available in
spark.
Thank you very much in advance.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org