Thanks Judit, Ayan Judit, You almost got it. The explode might help here. But when I tried I see load() doesn’t like to read from xmlcomment column on oracle_data.
scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data") 17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT * FROM oracle_data 17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed … scala> val xmlDF_flattened = xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment"))) <console>:22: error: overloaded method value load with alternatives: ()org.apache.spark.sql.DataFrame <and> (path: String)org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.sql.ColumnName) val xmlDF_flattened = xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment"))) Ayan, Output of books_inexp.show was as below title, author Midnight Rain, Ralls, Kim Maeve Ascendant, Corets, Eva Regards, Amol From: Judit Planas [mailto:judit.pla...@epfl.ch] Sent: Thursday, June 29, 2017 3:46 AM To: user@spark.apache.org Subject: Re: SparkSQL to read XML Blob data to create multiple rows Hi Amol, Not sure I understand completely your question, but the SQL function "explode" may help you: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode Here you can find a nice example: https://stackoverflow.com/questions/38210507/explode-in-pyspark HTH, Judit On 29/06/17 09:05, ayan guha wrote: Hi Not sure if I follow your issue. Can you please post output of books_inexp.show()? On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol <amol.ta...@capgemini.com<mailto:amol.ta...@capgemini.com>> wrote: Hi: We are trying to parse XML data to get below output from given input sample. Can someone suggest a way to pass one DFrames output into load() function or any other alternative to get this output. Input Data from Oracle Table XMLBlob: SequenceID Name City XMLComment 1 Amol Kolhapur <books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books> 2 Suresh Mumbai <books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books> 3 Vishal Delhi <books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books> 4 Swastik Bangalore <books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books> Output Data Expected using Spark SQL: SequenceID Name City Title Description 1 Amol Kolhapur Title1.1 Description_1.1 1 Amol Kolhapur Title1.1 Description_1.2 1 Amol Kolhapur Title1.3 Description_1.3 2 Suresh Mumbai Title2 Description_2 3 Vishal Delhi Title3.1 Description_3.1 4 Swastik Bangalore Title4 Description_4 I am able to parse single XML using below approach in spark-shell using example below but how do we apply the same recursively for all rows ? https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html. val dfX = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml") val xData = dfX.registerTempTable("books") dfX.printSchema() val books_inexp =sqlContext.sql("select title,author from books where price<10") books_inexp.show Regards, Amol This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. -- Best Regards, Ayan Guha This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.