Re: How to get and parse whole xml file in HDFS by Spark Streaming

Akhil Das Mon, 22 Jun 2015 03:44:36 -0700

You can use fileStream for that, look at the XMLInputFormat
<https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java>
of mahout. It should give you full XML object as on record, (as opposed to
an XML record spread across multiple line records in textFileStream). Also this
thread
<http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-td19239.html>
has some discussion around it.


Thanks
Best Regards

On Mon, Jun 22, 2015 at 12:23 AM, Yong Feng <fengyong...@gmail.com> wrote:

>
> Hi Spark Experts
>
> I have a customer who wants to monitor coming data files (with xml
> format), and then analysize them after that put analysized data into DB.
> The size of each file is about 30MB (or even less in future). Spark
> streaming seems promising.
>
> After learning Spark Streaming and also google-ing how Spark Streaming
> handle xml files, I found there seems no existing Spark Stream utility to
> recognize whole xml file and parse it. The fileStream seems line-oriented.
> There is suggestion of putting whole xml file into one line, however it
> requires pre-processing files which will bring unexpected I/O.
>
> Can anyone throw some light on it? If will be great if there are some
> sample codes for me to start with.
>
> Thanks
>
> Yong
>
>

Re: How to get and parse whole xml file in HDFS by Spark Streaming

Reply via email to