Spark Stre​aming
<https://spark.apache.org/docs/latest/streaming-programming-guide.html> is
the best fit for this use case. Basically you create a streaming context
pointing to that directory, also you can set the streaming interval (in
your case its 5 minutes). SparkStreaming will only process the new files
which has not been processed already. At the end of every operation, you
could possibly move the processed files to another directory just in case
if the application crashes you don't want it to process everything from the
beginning.

Example: Prints the contents of the files in the hdfs directory /sigmoid

                val ssc = new
> StreamingContext("spark://akhldz:7077","Streaming Job",*Seconds(300)*
> ,"/home/akhld/mobi/spark-streaming/spark-0.8.0-incubating",List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))
>  *val logData = ssc.textFileStream*("hdfs://127.0.0.1:54310/sigmoid/")
>  logData.print()


Thanks
Best Regards


On Tue, Aug 19, 2014 at 2:53 AM, salemi <alireza.sal...@udo.edu> wrote:

> Hi,
>
> Mine data source stores the incoming data every 10 second to hdfs. The
> naming convention save-<timestamp>.csv (see below)
>
> drwxr-xr-x      ali supergroup  0 B     0       0 B
>  save-1408396065000.csv
> drwxr-xr-x      ali supergroup  0 B     0       0 B
>  save-1408396070000.csv
> drwxr-xr-x      ali supergroup  0 B     0       0 B
>  save-1408396075000.csv
> drwxr-xr-x      ali supergroup  0 B     0       0 B
>  save-1408396080000.csv
>
> I would like to periodically (every 5min) read the files and process them.
> is there a good example out there how to implement this? How do I know what
> part of the data I have already processed?
>
> Thanks,
> Ali
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-reading-hfds-files-every-5-minutes-tp12325.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to