I was building a small app to stream messages from kafka via spark. The message
was an xml, every message is a new xml. I wrote a simple app to do so[ this app
expects the xml to be a single line]
from __future__ import print_function
from pyspark.sql import Row
import xml.etree.ElementTree as ET
import sys
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
## This is where you parse the XML
dict ={}
def create_dict(rt,new=None):
global parent_tag
for child in rt:
if new == None :
parent_tag = child.tag
else :
parent_tag = parent_tag
if child.getchildren():
create_dict(child,parent_tag)
else:
# if child.tag in dict.keys():
# tag = tag + child.tag
# else:
# tag=child.tag
dict[parent_tag]=child.text
return dict
def parse_xml_to_row(xmlString):
dct={}
root = ET.fromstring(xmlString.encode('utf-8'))
dct = create_dict(root)
return Row(**dct)
def toCSVLine(data):
return ','.join(str(d) for d in data)
## Parsing code part ends here
#sc.stop()
# Configure Spark
conf = SparkConf().setAppName("PythonStreamingKafkaWordCount")
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 10)
zkQuorum, topic = 'localhost:2182', 'topic-name'
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer",
{topic: 1})
lines = kvs.map(lambda x: x[1]).map(parse_xml_to_row).map(toCSVLine)
# lines.pprint()
lines.saveAsTextFiles('where you want to write the file ')
ssc.start()
ssc.awaitTerminationOrTimeout(50)
ssc.stop()
Hope this is helpful.
Puneet
From: Hyukjin Kwon [mailto:gurwls...@gmail.com]
Sent: Monday, August 22, 2016 4:34 PM
To: Diwakar Dhanuskodi
Cc: Darin McBeath; Jörn Franke; Felix Cheung; user
Subject: Re: Best way to read XML data from RDD
Do you mind share your codes and sample data? It should be okay with single XML
if I remember this correctly.
2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi
<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>:
Hi Darin,
Ate you using this utility to parse single line XML?
Sent from Samsung Mobile.
-------- Original message --------
From: Darin McBeath <ddmcbe...@yahoo.com<mailto:ddmcbe...@yahoo.com>>
Date:21/08/2016 17:44 (GMT+05:30)
To: Hyukjin Kwon <gurwls...@gmail.com<mailto:gurwls...@gmail.com>>, Jörn Franke
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>>
Cc: Diwakar Dhanuskodi
<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>, Felix
Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, user
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Best way to read XML data from RDD
Another option would be to look at spark-xml-utils. We use this extensively in
the manipulation of our XML content.
https://github.com/elsevierlabs-os/spark-xml-utils
There are quite a few examples. Depending on your preference (and what you
want to do), you could use xpath, xquery, or xslt to transform, extract, or
filter.
Like mentioned below, you want to initialize the parser in a mapPartitions call
(one of the examples shows this).
Hope this is helpful.
Darin.
________________________________
From: Hyukjin Kwon <gurwls...@gmail.com<mailto:gurwls...@gmail.com>>
To: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>>
Cc: Diwakar Dhanuskodi
<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>; Felix
Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>; user
<user@spark.apache.org<mailto:user@spark.apache.org>>
Sent: Sunday, August 21, 2016 6:10 AM
Subject: Re: Best way to read XML data from RDD
Hi Diwakar,
Spark XML library can take RDD as source.
```
val df = new XmlReader()
.withRowTag("book")
.xmlRdd(sqlContext, rdd)
```
If performance is critical, I would also recommend to take care of creation and
destruction of the parser.
If the parser is not serializble, then you can do the creation for each
partition within mapPartition just like
https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325
I hope this is helpful.
2016-08-20 15:10 GMT+09:00 Jörn Franke
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>>:
I fear the issue is that this will create and destroy a XML parser object 2 mio
times, which is very inefficient - it does not really look like a parser
performance issue. Can't you do something about the format choice? Ask your
supplier to deliver another format (ideally avro or sth like this?)?
>Otherwise you could just create one XML Parser object / node, but sharing this
>among the parallel tasks on the same node is tricky.
>The other possibility could be simply more hardware ...
>
>On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi
><diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>> wrote:
>
>
>Yes . It accepts a xml file as source but not RDD. The XML data embedded
>inside json is streamed from kafka cluster. So I could get it as RDD.
>>Right now I am using spark.xml XML.loadstring method inside RDD map
>>function but performance wise I am not happy as it takes 4 minutes to
>>parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.
>>
>>
>>
>>
>>Sent from Samsung Mobile.
>>
>>
>>-------- Original message --------
>>From: Felix Cheung
>><felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
>>Date:20/08/2016 09:49 (GMT+05:30)
>>To: Diwakar Dhanuskodi
>><diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>> , user
>><user@spark.apache.org<mailto:user@spark.apache.org>>
>>Cc:
>>Subject: Re: Best way to read XML data from RDD
>>
>>
>>Have you tried
>>
>>https://github.com/databricks/ spark-xml
>>?
>>
>>
>>
>>
>>
>>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi"
>><diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>> wrote:
>>
>>
>>Hi,
>>
>>
>>There is a RDD with json data. I could read json data using rdd.read.json .
>>The json data has XML data in couple of key-value paris.
>>
>>
>>Which is the best method to read and parse XML from rdd. Is there any
>>specific xml libraries for spark. Could anyone help on this.
>>
>>
>>Thanks.
dunnhumby limited is a limited company registered in England and Wales with
registered number 02388853 and VAT registered number 927 5871 83. Our
registered office is at Brook Green, 184 Shepherds Bush Road, London, W6 7NL,
United Kingdom. The contents of this message and any attachments to it are
confidential and may be legally privileged. If you have received this message
in error you should delete it from your system immediately and advise the
sender. dunnhumby may monitor and record all emails. The views expressed in
this email are those of the sender and not those of dunnhumby.