The XPath stuff works reasonably well for simple XML files.

However for complex XML files that change frequently and need to be
ingested in realtime you might look at a 3rd party solution, e.g. here:
https://dataworkssummit.com/san-jose-2018/session/add-a-spark-to-your-etl/

On Mon, Jun 11, 2018 at 3:05 PM, kristijan berta <kik...@gmail.com> wrote:

> thanks Jorn. The only alternative is to use xpath UDF? Works as shown
> below but tedious
>
> Like the example below
>
> *$cat employees.xml*
> <employee>
> <id>1</id>
> <name>Satish Kumar</name>
> <designation>Technical Lead</designation>
> </employee>
> <employee>
> <id>2</id>
> <name>Ramya</name>
> <designation>Testing</designation>
> </employee>
>
> *Step:1 Bring each record to one line, by executing below command*
>
> $cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed
> 's|</employee>|</employee>\n|g' | grep -v '^\s*$' > employees_records.xml
>
> *$cat employees_records.xml*
> <employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical
> Lead</designation> </employee>
> <employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation>
> </employee>
>
> *tep:2 Load the file to HDFS*
>
> *$hadoop fs -mkdir /user/hive/sample-xml-inputs*
>
> *$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs*
>
> *$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml*
> <employee> <id>1</id> <name>Satish Kumar</name><designation>Technical
> Lead</designation> </employee>
> <employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation>
> </employee>
>
> *Step:3 Create a Hive table and point to xml file*
>
> *hive>create external table xml_table_org( xmldata string) LOCATION
> '/user/hive/sample-xml-inputs/';*
>
> *hive> select * from xml_table_org;*
> *OK*
> <employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical
> Lead</designation> </employee>
> <employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation>
> </employee>
>
> *Step 4: From the stage table we can query the elements and load it to
> other table.*
>
> *hive> CREATE TABLE xml_table AS SELECT
> xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation')
> FROM xml_table_org;*
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 9 June 2018 at 07:42, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Yes.
>>
>> Serde must have been removed then in 2.x.
>>
>>
>>
>> On 8. Jun 2018, at 23:52, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Ok I am looking at this jar file
>>
>>  jar tf hive-serde-3.0.0.jar|grep -i abstractserde
>> org/apache/hadoop/hive/serde2/AbstractSerDe.class
>>
>> Is this the correct one?
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 June 2018 at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Thanks Jorn so what is the resolution? do I need another jar file?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 June 2018 at 21:56, Jörn Franke <jornfra...@gmail.com> wrote:
>>>
>>>> Oha i see now Serde is a deprecated Interface , if i am not wrong it
>>>> has been replaced by the abstract class abstractserde
>>>>
>>>> On 8. Jun 2018, at 22:22, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks Jorn.
>>>>
>>>> Spark 2.3.3 (labelled as stable)
>>>>
>>>> First I put the jar file hivexmlserde-1.0.5.3.jar under $HIVE_HOME/lib
>>>> and explicitly loaded with ADD JAR as well in hive session
>>>>
>>>> hive> ADD JAR hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar;
>>>> Added 
>>>> [/tmp/hive/7feb5165-780b-4ab6-aca8-f516d0388823_resources/hivexmlserde-1.0.5.3.jar]
>>>> to class path
>>>> Added resources: [hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar]
>>>>
>>>> Then I ran a simple code given here
>>>> <https://github.com/dvasilen/Hive-XML-SerDe/issues/41>
>>>>
>>>> hive> CREATE  TABLE xml_41 (imap map<string,string>)     > ROW FORMAT
>>>> SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'     > WITH
>>>> SERDEPROPERTIES (     > 
>>>> "column.xpath.imap"="/file-format/data-set/element",
>>>>     > "xml.map.specification.element"="@name->#content"     > )     >
>>>> STORED AS     > INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
>>>>     > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
>>>>     > TBLPROPERTIES (     > "xmlinput.start"="<file-format>",     >
>>>> "xmlinput.end"="</file-format>"     > ); FAILED: Execution Error,
>>>> return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
>>>> org/apache/hadoop/hive/serde2/SerDe And this is full error
>>>> 2018-06-08T21:17:20,775  INFO [7feb5165-780b-4ab6-aca8-f516d0388823
>>>> main] ql.Driver: Starting task [Stage-0:DDL] in serial mode
>>>> 2018-06-08T21:17:20,776 ERROR [7feb5165-780b-4ab6-aca8-f516d0388823
>>>> main] exec.DDLTask: java.lang.NoClassDefFoundError:
>>>> org/apache/hadoop/hive/serde2/SerDe         at
>>>> java.lang.ClassLoader.defineClass1(Native Method)         at
>>>> java.lang.ClassLoader.defineClass(ClassLoader.java:763)         at
>>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>>>>         at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>>>>         at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>>>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>>>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>         at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>>>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
>>>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>         at java.lang.Class.forName0(Native Method)         at
>>>> java.lang.Class.forName(Class.java:348)         at
>>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
>>>>         at 
>>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
>>>>         at 
>>>> org.apache.hadoop.hive.ql.exec.DDLTask.validateSerDe(DDLTask.java:4213)
>>>>         at 
>>>> org.apache.hadoop.hive.ql.plan.CreateTableDesc.toTable(CreateTableDesc.java:723)
>>>>         at 
>>>> org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4321)
>>>>         at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
>>>>         at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>>>>         at 
>>>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>>>>         at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
>>>>         at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
>>>>         at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>>>>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
>>>>         at 
>>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
>>>>         at 
>>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>>>>         at 
>>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
>>>>         at 
>>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
>>>>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
>>>>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>         at 
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>         at 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>         at java.lang.reflect.Method.invoke(Method.java:498)         at
>>>> org.apache.hadoop.util.RunJar.run(RunJar.java:221)         at
>>>> org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by:
>>>> java.lang.ClassNotFoundException: org.apache.hadoop.hive.serde2.SerDe
>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)         at
>>>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)         ...
>>>> 40 more The jar file has the classes!
>>>>
>>>> jar tf hivexmlserde-1.0.5.3.jar
>>>> META-INF/
>>>> META-INF/MANIFEST.MF
>>>> com/
>>>> com/ibm/
>>>> com/ibm/spss/
>>>> com/ibm/spss/hive/
>>>> com/ibm/spss/hive/serde2/
>>>> com/ibm/spss/hive/serde2/xml/
>>>> com/ibm/spss/hive/serde2/xml/objectinspector/
>>>> com/ibm/spss/hive/serde2/xml/processor/
>>>> com/ibm/spss/hive/serde2/xml/processor/java/
>>>> com/ibm/spss/hive/serde2/xml/HiveXmlRecordReader.class
>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlListObjectIn
>>>> spector.class
>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlMapObjectIns
>>>> pector.class
>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspec
>>>> torFactory$1.class
>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspec
>>>> torFactory.class
>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObject
>>>> Inspector$1.class
>>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObject
>>>> Inspector.class
>>>> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$1.class
>>>> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$2.class
>>>> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor.class
>>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$1.class
>>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$2.class
>>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor.class
>>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlQuery.class
>>>> com/ibm/spss/hive/serde2/xml/processor/java/NodeArray.class
>>>> com/ibm/spss/hive/serde2/xml/processor/SerDeArray.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlMapEntry.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet$Type.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlNode$1.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlNode$2.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlNode.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlNodeArray.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlProcessor.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlProcessorContext.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlQuery.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlTransformer.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlUtils$1.class
>>>> com/ibm/spss/hive/serde2/xml/processor/XmlUtils.class
>>>> com/ibm/spss/hive/serde2/xml/SplittableXmlInputFormat.class
>>>> com/ibm/spss/hive/serde2/xml/XmlInputFormat$XmlRecordReader.class
>>>> com/ibm/spss/hive/serde2/xml/XmlInputFormat.class
>>>> com/ibm/spss/hive/serde2/xml/XmlSerDe$1.class
>>>> com/ibm/spss/hive/serde2/xml/XmlSerDe.class
>>>> META-INF/maven/
>>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/
>>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/
>>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.xml
>>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.properties
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 8 June 2018 at 17:58, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>
>>>>> Can you get the log files and start Hive with more detailled logs?
>>>>> In could be that not all libraries are loaded (i don’t remember
>>>>> anymore but I think this one needs more , I can look next week in my docs)
>>>>> or that it does not support maps (not sure).
>>>>> You can try first with a more simpler extraction with a String field
>>>>> to see if it works .
>>>>>
>>>>> Hive has always had external libraries for xml support and I used the
>>>>> one below with Hive 1.x, but it should also work with 2.x (3 not sure, but
>>>>> it should if it works in 2.x)
>>>>>
>>>>>
>>>>> On 8. Jun 2018, at 17:53, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I tried Hive 2.0.1, 2.3.2 and now Hive 3/
>>>>>
>>>>> I explicitly added hivexmlserde  jar file as ADD JAR shown below
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default> ADD JAR
>>>>> hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar;
>>>>> No rows affected (0.002 seconds)
>>>>>
>>>>> But still cannot create an xml table
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default> CREATE  TABLE xml_41 (imap
>>>>> map<string,string>) ROW FORMAT SERDE 
>>>>> 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
>>>>> WITH SERDEPROPERTIES ("column.xpath.imap"="/file-fo
>>>>> rmat/data-set/element","xml.map.specification.element"="@name->#content")
>>>>> STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
>>>>> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
>>>>> TBLPROPERTIES ("xmlinput.start"="<file-forma
>>>>> t>","xmlinput.end"="</file-format>");
>>>>>
>>>>> Error: Error while processing statement: FAILED: Execution Error,
>>>>> return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
>>>>> org/apache/hadoop/hive/serde2/SerDe (state=08S01,code=1)
>>>>>
>>>>> Does anyone know the cause of this or which version of Hive supports
>>>>> creating an XML table?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
>

Reply via email to