Re: SparkSQL to read XML Blob data to create multiple rows

Amol Talap Sat, 08 Jul 2017 14:24:06 -0700

Hi Zhang & All,

Thanks so much for your earlier response. I am trying to get final
solution. We could parse the data successfully however I am getting
Nullpointerexception while mapping it back. Can you please suggest on
below findings ?


spark-shell --packages com.databricks:spark-xml_2.11:0.4.1
//----------------------------------
scala> spark.version
res19: String = 2.0.0

scala >
def xml2DF (xml:String) = {
     | val xRDD = sc.parallelize(Seq(xml))
     | val df = new XmlReader().xmlRdd(spark.sqlContext,xRDD)
     | val new_df = df.withColumn("comment",
explode(df("Comments.Comment"))).select($"comment.Description",$"comment.Title")
     | new_df.collectAsList}
xml2DF: (xml: String)java.util.List[org.apache.spark.sql.Row]
//-----------------------------------
scala> val 
xml3="<books><Comments><Comment><Title>Title3.1</Title><Description>Descript3.1</Description></Comment><Comment><Title>Title3333.1</Title><Description>Descript3333.1</Description></Comment></Comments></books>"

scala> xml2DF(xml3)
res18: java.util.List[org.apache.spark.sql.Row] =
[[Descript3.1,Title3.1], [Descript3333.1,Title3333.1]]
//xml2DF works as expected
//-----------------------------------
val rdd = sc.textFile("file:///home/spark/XML_Project/data.txt")
val xml_pRDDs = rdd.map(x=>(x.split(',')(0).toInt, x.split(',')(3)))
scala> xml_pRDDs.map(x=>(x._1,"call xml2DF "+x._2)).collect
res17: Array[(Int, String)] = Array((1,call xml2DF
<books><Comments><Comment><Title>Title1.1</Title><Description>Descript1.1</D..
//-----------------------------------
xml_pRDDs.map(x=>(x._1,xml2DF(x._2))).collect
//This gives Nullpointerexception, I was expecting result as below
//Array[(Int, java.util.List[org.apache.spark.sql.Row])] =
Array((1,[[Descript3.1,Title3.1], [Descript3333.1,Title3333.1]]),....
====================
Below is content of data.txt file.
1,Amol,Kolhapur,<books><Comments><Comment><Title>Title1.1</Title><Description>Descript1.1</Description></Comment><Comment><Title>Title1.2</Title><Description>Descript1.2</Description></Comment><Comment><Title>Title1.3</Title><Description>Descript1.3</Description></Comment></Comments></books>
2,Ameet,Bangalore,<books><Comments><Comment><Title>Title2.1</Title><Description>Descript2.1</Description></Comment><Comment><Title>Title2.2</Title><Description>Descript2.2</Description></Comment></Comments></books>
3,Rajesh,Jaipur,<books><Comments><Comment><Title>Title3.1</Title><Description>Descript3.1</Description></Comment><Comment><Title>Title3.2</Title><Description>Descript3.2</Description></Comment><Comment><Title>Title3.3</Title><Description>Descript3.3</Description></Comment><Comment><Title>Title3.4</Title><Description>Descript3.4</Description></Comment></Comments></books>

Regards,
Amol



On 6/29/17, Yong Zhang <java8...@hotmail.com> wrote:
> scala>spark.version
> res6: String = 2.1.1
>
> scala>val rdd  =
> sc.parallelize(Seq("""<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description></Comment>
> <Comment><Title>Title1.2</Title><Description>Description_1.2</Description></Comment>
> <Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment>
> </Comments></books>"""))
> rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
> parallelize at <console>:24
>
> scala>import com.databricks.spark.xml.XmlReader
>
> scala>val df = new XmlReader().xmlRdd(spark.sqlContext, rdd)
> df: org.apache.spark.sql.DataFrame = [Comments: struct<Comment:
> array<struct<Description:string,Title:string>>>]
>
> scala>df.printSchema
> root
>  |-- Comments: struct (nullable = true)
>  |    |-- Comment: array (nullable = true)
>  |    |    |-- element: struct (containsNull = true)
>  |    |    |    |-- Description: string (nullable = true)
>  |    |    |    |-- Title: string (nullable = true)
>
> scala>df.show(false)
> +--------------------------------------------------------------------------------------------------+
> |Comments
>                       |
> +--------------------------------------------------------------------------------------------------+
> |[WrappedArray([Description_1.1,Title1.1], [Description_1.2,Title1.2],
> [Description_1.3,Title1.3])]|
> +--------------------------------------------------------------------------------------------------+
>
>
> scala>df.withColumn("comment",
> explode(df("Comments.Comment"))).select($"comment.Description",
> $"comment.Title").show
> +---------------+--------+
> |    Description|   Title|
> +---------------+--------+
> |Description_1.1|Title1.1|
> |Description_1.2|Title1.2|
> |Description_1.3|Title1.3|
> +---------------+--------+
>
>
>
> ________________________________
> From: Talap, Amol <amol.ta...@capgemini.com>
> Sent: Thursday, June 29, 2017 9:38 AM
> To: Judit Planas; user@spark.apache.org
> Subject: RE: SparkSQL to read XML Blob data to create multiple rows
>
>
> Thanks Judit, Ayan
>
> Judit,
>
> You almost got it. The explode might help here.
>
> But when I tried I see load() doesn’t like to read from xmlcomment column on
> oracle_data.
>
>
>
> scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")
>
> 17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM
> oracle_data
>
> 17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed
>
> …
>
> scala> val xmlDF_flattened =
> xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
>
> <console>:22: error: overloaded method value load with alternatives:
>
>   ()org.apache.spark.sql.DataFrame <and>
>
>   (path: String)org.apache.spark.sql.DataFrame
>
> cannot be applied to (org.apache.spark.sql.ColumnName)
>
>        val xmlDF_flattened =
> xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
>
>
>
> Ayan,
>
> Output of books_inexp.show was as below
> title,                             author
> Midnight Rain,            Ralls, Kim
> Maeve Ascendant,      Corets, Eva
>
>
>
> Regards,
>
> Amol
>
> From: Judit Planas [mailto:judit.pla...@epfl.ch]
> Sent: Thursday, June 29, 2017 3:46 AM
> To: user@spark.apache.org
> Subject: Re: SparkSQL to read XML Blob data to create multiple rows
>
>
>
> Hi Amol,
>
> Not sure I understand completely your question, but the SQL function
> "explode" may help you:
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
>
> pyspark.sql module — PySpark 2.1.1
> documentation<http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode>
> spark.apache.org
> pyspark.sql.SparkSession Main entry point for DataFrame and SQL
> functionality. pyspark.sql.DataFrame A distributed collection of data
> grouped into named columns.
>
>
>
>
> Here you can find a nice example:
> https://stackoverflow.com/questions/38210507/explode-in-pyspark
> [https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-i...@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/38210507/explode-in-pyspark>
>
> python - Explode in PySpark - Stack
> Overflow<https://stackoverflow.com/questions/38210507/explode-in-pyspark>
> stackoverflow.com
> I would like to transform from a DataFrame that contains lists of words into
> a DataFrame with each word in its own row. How do I do explode on a column
> in a DataFrame?
>
>
>
>
> HTH,
> Judit
>
> On 29/06/17 09:05, ayan guha wrote:
>
> Hi
>
>
>
> Not sure if I follow your issue. Can you please post output of
> books_inexp.show()?
>
>
>
> On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol
> <amol.ta...@capgemini.com<mailto:amol.ta...@capgemini.com>> wrote:
>
> Hi:
>
>
>
> We are trying to parse XML data to get below output from given input
> sample.
>
> Can someone suggest a way to pass one DFrames output into load() function or
> any other alternative to get this output.
>
>
>
> Input Data from Oracle Table XMLBlob:
>
> SequenceID
>
>
> Name
>
>
> City
>
>
> XMLComment
>
>
> 1
>
>
> Amol
>
>
> Kolhapur
>
>
> <books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>
>
>
> 2
>
>
> Suresh
>
>
> Mumbai
>
>
> <books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>
>
>
> 3
>
>
> Vishal
>
>
> Delhi
>
>
> <books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>
>
>
> 4
>
>
> Swastik
>
>
> Bangalore
>
>
> <books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>
>
>
>
>
> Output Data Expected using Spark SQL:
>
> SequenceID
>
>
> Name
>
>
> City
>
>
> Title
>
>
> Description
>
>
> 1
>
>
> Amol
>
>
> Kolhapur
>
>
> Title1.1
>
>
> Description_1.1
>
>
> 1
>
>
> Amol
>
>
> Kolhapur
>
>
> Title1.1
>
>
> Description_1.2
>
>
> 1
>
>
> Amol
>
>
> Kolhapur
>
>
> Title1.3
>
>
> Description_1.3
>
>
> 2
>
>
> Suresh
>
>
> Mumbai
>
>
> Title2
>
>
> Description_2
>
>
> 3
>
>
> Vishal
>
>
> Delhi
>
>
> Title3.1
>
>
> Description_3.1
>
>
> 4
>
>
> Swastik
>
>
> Bangalore
>
>
> Title4
>
>
> Description_4
>
>
>
>
> I am able to parse single XML using below approach in spark-shell using
> example below but how do we apply the same recursively for all rows ?
>
> https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.
>
> val dfX =
> sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")
>
>
> val xData = dfX.registerTempTable("books")
>
>
> dfX.printSchema()
>
>
> val books_inexp =sqlContext.sql("select title,author from books where
> price<10")
>
>
> books_inexp.show
>
>
>
>
> Regards,
>
> Amol
>
> This message contains information that may be privileged or confidential and
> is the property of the Capgemini Group. It is intended only for the person
> to whom it is addressed. If you are not the intended recipient, you are not
> authorized to read, print, retain, copy, disseminate, distribute, or use
> this message or any part thereof. If you receive this message in error,
> please notify the sender immediately and delete all copies of this message.
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>
>
> This message contains information that may be privileged or confidential and
> is the property of the Capgemini Group. It is intended only for the person
> to whom it is addressed. If you are not the intended recipient, you are not
> authorized to read, print, retain, copy, disseminate, distribute, or use
> this message or any part thereof. If you receive this message in error,
> please notify the sender immediately and delete all copies of this message.
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SparkSQL to read XML Blob data to create multiple rows

Reply via email to