Hi My environment is described like below:
5 nodes, each nodes generate a big csv file every 5 minutes. I need spark stream to analyze these 5 files in every five minutes to generate some report. I am planning to do it in this way: 1. Put those 5 files into HDSF directory called /data 2. Merge them into one big file in that directory 3. Use spark stream constructor textFileStream('/data') to generate my inputDStream The problem of this way is I do not know how to merge the 5 files in HDFS. It seems very difficult to do it in python. So question is 1. Can you tell me how to merge files in hdfs by python? 2. Do you know some other way to input those files into spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-design-the-input-source-of-spark-stream-tp26641.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org