Re: Parsing a large XML file using Spark

2015-11-04 Thread Jin
I recently worked around datasource and parquet a bit at Spark and someone
requested me to make a XML datasource plugin. So iI did this.

https://github.com/HyukjinKwon/spark-xml

It tried to get rid of in-line format just like Json datasource in Spark.

Altough I didn't add a CI tool for this yet, this looks working for the
testcodes and rough use cases.

Maybe you can try this :).

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239p25272.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Parsing a large XML file using Spark

2014-11-21 Thread Prannoy
Hi,

Parallel processing of xml files may be an issue due to the tags in the xml
file. The xml file has to be intact as while parsing it matches the start
and end entity and if its distributed in parts to workers possibly it may
or may not find start and end tags within the same worker which will give
an exception.

Thanks.

On Wed, Nov 19, 2014 at 6:26 AM, ssimanta [via Apache Spark User List] 
ml-node+s1001560n19239...@n3.nabble.com wrote:

 If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
 that all revision information also) that is stored in HDFS, is it possible
 to parse it in parallel/faster using Spark? Or do we have to use something
 like a PullParser or Iteratee?

 My current solution is to read the single XML file in the first pass -
 write it to HDFS and then read the small files in parallel on the Spark
 workers.

 Thanks
 -Soumya





 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239p19477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Parsing a large XML file using Spark

2014-11-21 Thread Paul Brown
Unfortunately, unless you impose restrictions on the XML file (e.g., where
namespaces are declared, whether entity replacement is used, etc.), you
really can't parse only a piece of it even if you have start/end elements
grouped together.  If you want to deal effectively (and scalably) with
large XML files consisting of many records, the right thing to do is to
write them as one XML document per line just like the one JSON document per
line, at which point the data can be split effectively.  Something like
Woodstox and a little custom code should make an effective pre-processor.

Once you have the line-delimited XML, you can shred it however you want:
 JAXB, Jackson XML, etc.

—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Fri, Nov 21, 2014 at 3:38 AM, Prannoy pran...@sigmoidanalytics.com
wrote:

 Hi,

 Parallel processing of xml files may be an issue due to the tags in the
 xml file. The xml file has to be intact as while parsing it matches the
 start and end entity and if its distributed in parts to workers possibly it
 may or may not find start and end tags within the same worker which will
 give an exception.

 Thanks.

 On Wed, Nov 19, 2014 at 6:26 AM, ssimanta [via Apache Spark User List] 
 [hidden
 email] http://user/SendEmail.jtp?type=nodenode=19477i=0 wrote:

 If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
 that all revision information also) that is stored in HDFS, is it possible
 to parse it in parallel/faster using Spark? Or do we have to use something
 like a PullParser or Iteratee?

 My current solution is to read the single XML file in the first pass -
 write it to HDFS and then read the small files in parallel on the Spark
 workers.

 Thanks
 -Soumya





 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239.html
  To start a new topic under Apache Spark User List, email [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19477i=1
 To unsubscribe from Apache Spark User List, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Parsing a large XML file using Spark
 http://apache-spark-user-list.1001560.n3.nabble.com/Parsing-a-large-XML-file-using-Spark-tp19239p19477.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
Actually, it's a real

On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for
 one solution.

 One issue with those XML files is that they cannot be processed line by
 line in parallel; plus you inherently need shared/global state to parse XML
 or check for well-formedness, I think. (Same issue with multi-line JSON, by
 the way.)

 Tobias




Re: Parsing a large XML file using Spark

2014-11-21 Thread andy petrella
(sorry about the previous spam... google inbox didn't allowed me to cancel
the miserable sent action :-/)

So what I was about to say: it's a real PAIN tin the ass to parse the
wikipedia articles in the dump due to this mulitline articles...

However, there is a way to manage that quite easily, although I found it
rather slow.

*1/ use XML reader*
Use the org.apache.hadoop % hadoop-streaming % 1.0.4

*2/ configure the hadoop job*
import org.apache.hadoop.streaming.StreamXmlRecordReader
import org.apache.hadoop.mapred.JobConf
val jobConf = new JobConf()
jobConf.set(stream.recordreader.class,
org.apache.hadoop.streaming.StreamXmlRecordReader)
jobConf.set(stream.recordreader.begin, page)
jobConf.set(stream.recordreader.end, /page)
org.apache.hadoop.mapred.FileInputFormat.addInputPaths(jobConf,
shdfs://$master:9000/data.xml)

// Load documents (one per line).
val documents = sparkContext.hadoopRDD(jobConf,
classOf[org.apache.hadoop.streaming.StreamInputFormat],
classOf[org.apache.hadoop.io.Text],
classOf[org.apache.hadoop.io.Text])


*3/ use the result as XML doc*
import scala.xml.XML
val texts = documents.map(_._1.toString)
 .map{ s =
   val xml = XML.loadString(s)
   val id = (xml \ id).text.toDouble
   val title = (xml \ title).text
   val text = (xml \ revision \
text).text.replaceAll(\\W,  )
   val tknzed = text.split(\\W).filter(_.size 
3).toList
   (id, title, tknzed )
 }

HTH
andy
On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for
 one solution.

 One issue with those XML files is that they cannot be processed line by
 line in parallel; plus you inherently need shared/global state to parse XML
 or check for well-formedness, I think. (Same issue with multi-line JSON, by
 the way.)

 Tobias




Re: Parsing a large XML file using Spark

2014-11-18 Thread Tobias Pfeiffer
Hi,

see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for one
solution.

One issue with those XML files is that they cannot be processed line by
line in parallel; plus you inherently need shared/global state to parse XML
or check for well-formedness, I think. (Same issue with multi-line JSON, by
the way.)

Tobias