I don't think you need to do it this way.

Take a look here :
http://spark.apache.org/docs/latest/streaming-programming-guide.html
in this section:
Level of Parallelism in Data Receiving
 Receiving multiple data streams can therefore be achieved by creating
multiple input DStreams and configuring them to receive different
partitions of the data stream from the source(s)....
These multiple DStreams can be unioned together to create a single DStream.
Then the transformations that were being applied on a single input DStream
can be applied on the unified stream.


On Wed, Mar 30, 2016 at 11:08 PM, kramer2...@126.com <kramer2...@126.com>
wrote:

> Hi
>
> My environment is described like below:
>
> 5 nodes, each nodes generate a big csv file every 5 minutes. I need spark
> stream to analyze these 5 files in every five minutes to generate some
> report.
>
> I am planning to do it in this way:
>
> 1. Put those 5 files into HDSF directory called /data
> 2. Merge them into one big file in that directory
> 3. Use spark stream constructor textFileStream('/data') to generate my
> inputDStream
>
> The problem of this way is I do not know how to merge the 5 files in HDFS.
> It seems very difficult to do it in python.
>
> So question is
>
> 1. Can you tell me how to merge files in hdfs by python?
> 2. Do you know some other way to input those files into spark?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-design-the-input-source-of-spark-stream-tp26641.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Reply via email to