This is the code I am using for parsing xml file:


import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.sql.{DataFrame,SQLContext}
import com.databricks.spark.xml


object XmlProcessing {

def main(args : Array[String]) = {

    val conf = new SparkConf()
        .setAppName("XmlProcessing")
        .setMaster("local")

    val sc = new SparkContext(conf)
    val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc)

    loadXMLdata(sqlContext)

    }

def loadXMLdata(sqlContext : SQLContext) = {

    var df : DataFrame = null

    var newDf : DataFrame = null

    df = sqlContext.read
        .format("com.databricks.spark.xml")
        .option("rowTag","book")
        .load("/home/prathamsh/Workspace/Xml/datafiles/sample.xml")

    df.printSchema()


    }

}






On Sun, Feb 21, 2016 at 7:10 PM, Sebastian Piu <[email protected]>
wrote:

> Can you paste the code you are using?
>
> On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte <[email protected]>
> wrote:
>
>> I am trying to parse xml file using spark-xml. But for some reason when i
>> print schema it only shows  root instead of the hierarchy. I am using
>> sqlcontext to read the data. I am proceeding according to this video :
>> https://www.youtube.com/watch?v=NemEp53yGbI
>>
>> The structure of xml file is somewhat like this:
>>
>> <books>
>>   <book>
>>      <name></name>
>>      <price></price>
>>      <orderId></orderId>
>>   </book>
>>    <book>
>>        //Some more data
>>    </book>
>> </books>
>>
>> For some books there,are multiple orders i.e. large number of orders
>> while for some it just occurs once as empty. I use the "rowtag" attribute
>> as book. How do i proceed or is there any other way to tackle this
>> problem?  Help would be much appreciated. Thank you.
>>
>

Reply via email to