The XPath stuff works reasonably well for simple XML files. However for complex XML files that change frequently and need to be ingested in realtime you might look at a 3rd party solution, e.g. here: https://dataworkssummit.com/san-jose-2018/session/add-a-spark-to-your-etl/
On Mon, Jun 11, 2018 at 3:05 PM, kristijan berta <kik...@gmail.com> wrote: > thanks Jorn. The only alternative is to use xpath UDF? Works as shown > below but tedious > > Like the example below > > *$cat employees.xml* > <employee> > <id>1</id> > <name>Satish Kumar</name> > <designation>Technical Lead</designation> > </employee> > <employee> > <id>2</id> > <name>Ramya</name> > <designation>Testing</designation> > </employee> > > *Step:1 Bring each record to one line, by executing below command* > > $cat employees.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed > 's|</employee>|</employee>\n|g' | grep -v '^\s*$' > employees_records.xml > > *$cat employees_records.xml* > <employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical > Lead</designation> </employee> > <employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> > </employee> > > *tep:2 Load the file to HDFS* > > *$hadoop fs -mkdir /user/hive/sample-xml-inputs* > > *$hadoop fs -put employees_records.xml /user/hive/sample-xml-inputs* > > *$hadoop fs -cat /user/hive/sample-xml-inputs/employees_records.xml* > <employee> <id>1</id> <name>Satish Kumar</name><designation>Technical > Lead</designation> </employee> > <employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> > </employee> > > *Step:3 Create a Hive table and point to xml file* > > *hive>create external table xml_table_org( xmldata string) LOCATION > '/user/hive/sample-xml-inputs/';* > > *hive> select * from xml_table_org;* > *OK* > <employee> <id>1</id> <name>Satish Kumar</name> <designation>Technical > Lead</designation> </employee> > <employee> <id>2</id> <name>Ramya</name> <designation>Testing</designation> > </employee> > > *Step 4: From the stage table we can query the elements and load it to > other table.* > > *hive> CREATE TABLE xml_table AS SELECT > xpath_int(xmldata,'employee/id'),xpath_string(xmldata,'employee/name'),xpath_string(xmldata,'employee/designation') > FROM xml_table_org;* > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 9 June 2018 at 07:42, Jörn Franke <jornfra...@gmail.com> wrote: > >> Yes. >> >> Serde must have been removed then in 2.x. >> >> >> >> On 8. Jun 2018, at 23:52, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Ok I am looking at this jar file >> >> jar tf hive-serde-3.0.0.jar|grep -i abstractserde >> org/apache/hadoop/hive/serde2/AbstractSerDe.class >> >> Is this the correct one? >> >> Thanks >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 8 June 2018 at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Thanks Jorn so what is the resolution? do I need another jar file? >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 8 June 2018 at 21:56, Jörn Franke <jornfra...@gmail.com> wrote: >>> >>>> Oha i see now Serde is a deprecated Interface , if i am not wrong it >>>> has been replaced by the abstract class abstractserde >>>> >>>> On 8. Jun 2018, at 22:22, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> Thanks Jorn. >>>> >>>> Spark 2.3.3 (labelled as stable) >>>> >>>> First I put the jar file hivexmlserde-1.0.5.3.jar under $HIVE_HOME/lib >>>> and explicitly loaded with ADD JAR as well in hive session >>>> >>>> hive> ADD JAR hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar; >>>> Added >>>> [/tmp/hive/7feb5165-780b-4ab6-aca8-f516d0388823_resources/hivexmlserde-1.0.5.3.jar] >>>> to class path >>>> Added resources: [hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar] >>>> >>>> Then I ran a simple code given here >>>> <https://github.com/dvasilen/Hive-XML-SerDe/issues/41> >>>> >>>> hive> CREATE TABLE xml_41 (imap map<string,string>) > ROW FORMAT >>>> SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe' > WITH >>>> SERDEPROPERTIES ( > >>>> "column.xpath.imap"="/file-format/data-set/element", >>>> > "xml.map.specification.element"="@name->#content" > ) > >>>> STORED AS > INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' >>>> > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' >>>> > TBLPROPERTIES ( > "xmlinput.start"="<file-format>", > >>>> "xmlinput.end"="</file-format>" > ); FAILED: Execution Error, >>>> return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. >>>> org/apache/hadoop/hive/serde2/SerDe And this is full error >>>> 2018-06-08T21:17:20,775 INFO [7feb5165-780b-4ab6-aca8-f516d0388823 >>>> main] ql.Driver: Starting task [Stage-0:DDL] in serial mode >>>> 2018-06-08T21:17:20,776 ERROR [7feb5165-780b-4ab6-aca8-f516d0388823 >>>> main] exec.DDLTask: java.lang.NoClassDefFoundError: >>>> org/apache/hadoop/hive/serde2/SerDe at >>>> java.lang.ClassLoader.defineClass1(Native Method) at >>>> java.lang.ClassLoader.defineClass(ClassLoader.java:763) at >>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) >>>> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) >>>> at java.net.URLClassLoader.access$100(URLClassLoader.java:73) >>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:368) >>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:362) >>>> at java.security.AccessController.doPrivileged(Native Method) >>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:361) >>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) >>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:411) >>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >>>> at java.lang.Class.forName0(Native Method) at >>>> java.lang.Class.forName(Class.java:348) at >>>> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134) >>>> at >>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099) >>>> at >>>> org.apache.hadoop.hive.ql.exec.DDLTask.validateSerDe(DDLTask.java:4213) >>>> at >>>> org.apache.hadoop.hive.ql.plan.CreateTableDesc.toTable(CreateTableDesc.java:723) >>>> at >>>> org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4321) >>>> at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354) >>>> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) >>>> at >>>> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) >>>> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) >>>> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) >>>> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) >>>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) >>>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) >>>> at >>>> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) >>>> at >>>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) >>>> at >>>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) >>>> at >>>> org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) >>>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) >>>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:498) at >>>> org.apache.hadoop.util.RunJar.run(RunJar.java:221) at >>>> org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: >>>> java.lang.ClassNotFoundException: org.apache.hadoop.hive.serde2.SerDe >>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381) >>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at >>>> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) >>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... >>>> 40 more The jar file has the classes! >>>> >>>> jar tf hivexmlserde-1.0.5.3.jar >>>> META-INF/ >>>> META-INF/MANIFEST.MF >>>> com/ >>>> com/ibm/ >>>> com/ibm/spss/ >>>> com/ibm/spss/hive/ >>>> com/ibm/spss/hive/serde2/ >>>> com/ibm/spss/hive/serde2/xml/ >>>> com/ibm/spss/hive/serde2/xml/objectinspector/ >>>> com/ibm/spss/hive/serde2/xml/processor/ >>>> com/ibm/spss/hive/serde2/xml/processor/java/ >>>> com/ibm/spss/hive/serde2/xml/HiveXmlRecordReader.class >>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlListObjectIn >>>> spector.class >>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlMapObjectIns >>>> pector.class >>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspec >>>> torFactory$1.class >>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlObjectInspec >>>> torFactory.class >>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObject >>>> Inspector$1.class >>>> com/ibm/spss/hive/serde2/xml/objectinspector/XmlStructObject >>>> Inspector.class >>>> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$1.class >>>> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor$2.class >>>> com/ibm/spss/hive/serde2/xml/processor/AbstractXmlProcessor.class >>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$1.class >>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor$2.class >>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlProcessor.class >>>> com/ibm/spss/hive/serde2/xml/processor/java/JavaXmlQuery.class >>>> com/ibm/spss/hive/serde2/xml/processor/java/NodeArray.class >>>> com/ibm/spss/hive/serde2/xml/processor/SerDeArray.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlMapEntry.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet$Type.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlMapFacet.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlNode$1.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlNode$2.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlNode.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlNodeArray.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlProcessor.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlProcessorContext.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlQuery.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlTransformer.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlUtils$1.class >>>> com/ibm/spss/hive/serde2/xml/processor/XmlUtils.class >>>> com/ibm/spss/hive/serde2/xml/SplittableXmlInputFormat.class >>>> com/ibm/spss/hive/serde2/xml/XmlInputFormat$XmlRecordReader.class >>>> com/ibm/spss/hive/serde2/xml/XmlInputFormat.class >>>> com/ibm/spss/hive/serde2/xml/XmlSerDe$1.class >>>> com/ibm/spss/hive/serde2/xml/XmlSerDe.class >>>> META-INF/maven/ >>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/ >>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/ >>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.xml >>>> META-INF/maven/com.ibm.spss.hive.serde2.xml/hivexmlserde/pom.properties >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 8 June 2018 at 17:58, Jörn Franke <jornfra...@gmail.com> wrote: >>>> >>>>> Can you get the log files and start Hive with more detailled logs? >>>>> In could be that not all libraries are loaded (i don’t remember >>>>> anymore but I think this one needs more , I can look next week in my docs) >>>>> or that it does not support maps (not sure). >>>>> You can try first with a more simpler extraction with a String field >>>>> to see if it works . >>>>> >>>>> Hive has always had external libraries for xml support and I used the >>>>> one below with Hive 1.x, but it should also work with 2.x (3 not sure, but >>>>> it should if it works in 2.x) >>>>> >>>>> >>>>> On 8. Jun 2018, at 17:53, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>> wrote: >>>>> >>>>> I tried Hive 2.0.1, 2.3.2 and now Hive 3/ >>>>> >>>>> I explicitly added hivexmlserde jar file as ADD JAR shown below >>>>> >>>>> 0: jdbc:hive2://rhes75:10099/default> ADD JAR >>>>> hdfs://rhes75:9000/jars/hivexmlserde-1.0.5.3.jar; >>>>> No rows affected (0.002 seconds) >>>>> >>>>> But still cannot create an xml table >>>>> >>>>> 0: jdbc:hive2://rhes75:10099/default> CREATE TABLE xml_41 (imap >>>>> map<string,string>) ROW FORMAT SERDE >>>>> 'com.ibm.spss.hive.serde2.xml.XmlSerDe' >>>>> WITH SERDEPROPERTIES ("column.xpath.imap"="/file-fo >>>>> rmat/data-set/element","xml.map.specification.element"="@name->#content") >>>>> STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat' >>>>> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' >>>>> TBLPROPERTIES ("xmlinput.start"="<file-forma >>>>> t>","xmlinput.end"="</file-format>"); >>>>> >>>>> Error: Error while processing statement: FAILED: Execution Error, >>>>> return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. >>>>> org/apache/hadoop/hive/serde2/SerDe (state=08S01,code=1) >>>>> >>>>> Does anyone know the cause of this or which version of Hive supports >>>>> creating an XML table? >>>>> >>>>> Thanks >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>> >>> >> > > > >