This is a very late reply for this thread. If you are trying to read xml files from a directory and put it into a stream, there are two ways that may work.
1. Something like this - streamingContext.fileStream[LongWritable, Text, X MLInputFormat](<directory>) The XMLInputFormat<https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java>class is what Woody suggested. If this InputFormat works correctly, then any new XML files created in the <directory> should get read as RDD in a DStream. However, there is no guarantee that it will read one file at a time. If two files got generated within a batch interval, then both will get read together in the same batch. 2. If you want to manually control how the RDDs are fed, then take a look at streamingContext.queueStream. This allows you to create RDDs manually and push them in a queue. Spark Streaming will pull those RDDs and treat them as a stream. Hope this helps. Apologies for the late response. On Thu, Jan 30, 2014 at 5:55 AM, Mayur Rustagi <[email protected]>wrote: > Hi, > I am using Spark Streaming for this, in Streaming I am trying to open the > file as text file and Dstream. > Regards > Mayur > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rustagi > > > > On Thu, Jan 30, 2014 at 7:17 PM, Woody Christy <[email protected]>wrote: > >> Take a look at the Mahout xmlinputformat class. That should get you >> started. >> >> >> On Thu, Jan 30, 2014 at 5:08 AM, Mayur Rustagi >> <[email protected]>wrote: >> >>> I am trying to load xml in streaming and convert to csv and store it. >>> When I use textfile it separates the file on "\n" and hence breaks the >>> parser. Is it possible to receive the data one file at a time from the hdfs >>> folder ? >>> >>> Mayur Rustagi >>> Ph: +919632149971 >>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com >>> https://twitter.com/mayur_rustagi >>> >> >> >> >> -- >> >> Woody Christy >> Solutions Architect | Partner Engineering | Cloudera Inc >> @woodychristy >> >> >> >> >
