Hi All, Greetings ! I needed some help to read a Hive table via Pyspark for which the transactional property is set to 'True' (In other words ACID property is enabled). Following is the entire stacktrace and the description of the hive table. Would you please be able to help me resolve the error:
18/03/01 11:06:22 INFO BlockManagerMaster: Registered BlockManager 18/03/01 11:06:22 INFO EventLoggingListener: Logging events to hdfs:///spark-history/local-1519923982155 Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.6.3 /_/ Using Python version 2.7.12 (default, Jul 2 2016 17:42:40) SparkContext available as sc, HiveContext available as sqlContext. >>> from pyspark.sql import HiveContext >>> hive_context = HiveContext(sc) >>> hive_context.sql("select count(*) from load_etl.trpt_geo_defect_prod_dec07_del_blank").show() 18/03/01 11:09:45 INFO HiveContext: Initializing execution hive, version 1.2.1 18/03/01 11:09:45 INFO ClientWrapper: Inspected Hadoop version: 2.7.3.2.6.0.3-8 18/03/01 11:09:45 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.3.2.6.0.3-8 18/03/01 11:09:46 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 18/03/01 11:09:46 INFO ObjectStore: ObjectStore, initialize called 18/03/01 11:09:46 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 18/03/01 11:09:46 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 18/03/01 11:09:50 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 18/03/01 11:09:50 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 18/03/01 11:09:50 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 18/03/01 11:09:53 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 18/03/01 11:09:53 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 18/03/01 11:09:54 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY 18/03/01 11:09:54 INFO ObjectStore: Initialized ObjectStore 18/03/01 11:09:54 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 18/03/01 11:09:54 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 18/03/01 11:09:54 INFO HiveMetaStore: Added admin role in metastore 18/03/01 11:09:54 INFO HiveMetaStore: Added public role in metastore 18/03/01 11:09:55 INFO HiveMetaStore: No user is added in admin role, since config is empty 18/03/01 11:09:55 INFO HiveMetaStore: 0: get_all_databases 18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com ip=unknown-ip-addr cmd=get_all_databases 18/03/01 11:09:55 INFO HiveMetaStore: 0: get_functions: db=default pat=* 18/03/01 11:09:55 INFO audit: ugi=devu...@ip.com ip=unknown-ip-addr cmd=get_functions: db=default pat=* 18/03/01 11:09:55 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table. 18/03/01 11:09:55 INFO SessionState: Created local directory: /tmp/22ea9ac9-23d1-4247-9e02-ce45809cd9ae_resources 18/03/01 11:09:55 INFO SessionState: Created HDFS directory: /tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae 18/03/01 11:09:55 INFO SessionState: Created local directory: /tmp/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae 18/03/01 11:09:55 INFO SessionState: Created HDFS directory: /tmp/hive/hdetldev/22ea9ac9-23d1-4247-9e02-ce45809cd9ae/_tmp_space.db 18/03/01 11:09:55 INFO HiveContext: default warehouse location is /user/hive/warehouse 18/03/01 11:09:55 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. 18/03/01 11:09:55 INFO ClientWrapper: Inspected Hadoop version: 2.7.3.2.6.0.3-8 18/03/01 11:09:55 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.7.3.2.6.0.3-8 18/03/01 11:09:56 INFO metastore: Trying to connect to metastore with URI thrift://ip.com:9083 18/03/01 11:09:56 INFO metastore: Connected to metastore. 18/03/01 11:09:56 INFO SessionState: Created local directory: /tmp/24379bb3-8ddf-4716-b68d-07ac0f92d9f1_resources 18/03/01 11:09:56 INFO SessionState: Created HDFS directory: /tmp/hive/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1 18/03/01 11:09:56 INFO SessionState: Created local directory: /tmp/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1 18/03/01 11:09:56 INFO SessionState: Created HDFS directory: /tmp/hive/hdetldev/24379bb3-8ddf-4716-b68d-07ac0f92d9f1/_tmp_space.db 18/03/01 11:09:56 INFO ParseDriver: Parsing command: select count(*) from load_etl.trpt_geo_defect_prod_dec07_del_blank 18/03/01 11:09:57 INFO ParseDriver: Parse Completed 18/03/01 11:09:57 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 813.6 KB, free 510.3 MB) 18/03/01 11:09:57 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 57.5 KB, free 510.3 MB) 18/03/01 11:09:57 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35508 (size: 57.5 KB, free: 511.1 MB) 18/03/01 11:09:57 INFO SparkContext: Created broadcast 0 from showString at NativeMethodAccessorImpl.java:-2 18/03/01 11:09:58 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> 18/03/01 11:09:58 INFO deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.py", line 257, in show print(self._jdf.showString(n, truncate)) File "/var/opt/teradata/anaconda4.1.1/anaconda/lib/python2.7/site-packages/py4j-0.10.6-py2.7.egg/py4j/java_gateway.py", line 1160, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/var/opt/teradata/anaconda4.1.1/anaconda/lib/python2.7/site-packages/py4j-0.10.6-py2.7.egg/py4j/protocol.py", line 320, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o44.showString. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#60L]) +- TungstenExchange SinglePartition, None +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#63L]) +- HiveTableScan MetastoreRelation load_etl, trpt_geo_defect_prod_dec07_del_blank, None at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:80) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:187) at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2087) at org.apache.spark.sql.DataFrame.org $apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1499) at org.apache.spark.sql.DataFrame.org $apache$spark$sql$DataFrame$$collect(DataFrame.scala:1506) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1376) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2100) at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1375) at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1457) at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenExchange SinglePartition, None +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#63L]) +- HiveTableScan MetastoreRelation load_etl, trpt_geo_defect_prod_dec07_del_blank, None at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:86) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:80) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... 36 more Caused by: java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:91) at org.apache.spark.sql.execution.Exchange.prepareShuffleDependency(Exchange.scala:220) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:254) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:248) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... 44 more Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0003024_0000" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 75 more Caused by: java.lang.NumberFormatException: For input string: "0003024_0000" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.hive.ql.io.AcidUtils.parseDelta(AcidUtils.java:310) at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:379) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:634) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:620) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more Here is the detail of the table creation: Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.6.0.3-8 by Apache Hive 0: jdbc:hive2://toplxhdmd001.rights.com> show create table load_etl.trpt_geo_defect_prod_dec07_del_blank; +-----------------------------------------------------------------------------------------------+--+ | createtab_stmt | +-----------------------------------------------------------------------------------------------+--+ | CREATE TABLE `load_etl.trpt_geo_defect_prod_dec07_del_blank`( | | `line_seg_nbr` int, | | `track_type` string, | | `track_sdtk_nbr` string, | | `mile_post_beg` double, | | `ss_nbr` int, | | `ss_len` int, | | `ris1mpb` double, | | `mile_label` string, | | `test_dt` string, | | `def_prty` string, | | `def_nbr` int, | | `def_type` string, | | `def_ampltd` double, | | `def_lgth` int, | | `car_cd` string, | | `tsc_cd` string, | | `class` string, | | `test_fspd` string, | | `test_pspd` string, | | `restr_fspd` string, | | `restr_pspd` string, | | `def_land_mark` string, | | `repeat_cd` string, | | `mp_incr_cd` string, | | `test_trk_dir` string, | | `eff_dt` string, | | `trk_file` string, | | `dfct_cor_dt` string, | | `dfct_acvt` string, | | `dfct_slw_ord_ind` string, | | `emp_id` string, | | `eff_ts` string, | | `dfct_cor_tm` string, | | `dfct_freight_spd` int, | | `dfct_amtrak_spd` int, | | `mile_post_sfx` string, | | `work_order_id` string, | | `loc_id_beg` string, | | `loc_id_end` string, | | `link_id` string, | | `lst_maint_ts` string, | | `del_ts` string, | | `gps_longitude` double, | | `gps_latitude` double, | | `geo_car_nme` string, | | `rept_gc_nme` string, | | `rept_dfct_tst` string, | | `rept_dfct_nbr` int, | | `restr_trk_cls` string, | | `tst_hist_cd` string, | | `cret_ts` string, | | `ylw_grp_nbr` int, | | `geo_dfct_grp_nme` string, | | `supv_rollup_cd` string, | | `dfct_stat_cd` string, | | `lst_maint_id` string, | | `del_rsn_cd` string, | | `umt_prcs_user_id` string, | | `gdfct_vinsp_srestr` string, | | `gc_opr_init` string) | | CLUSTERED BY ( | | geo_car_nme) | | INTO 2 BUCKETS | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | | LOCATION | | 'hdfs://HADOOP02/apps/hive/warehouse/load_etl.db/trpt_geo_defect_prod_dec07_del_blank' | | TBLPROPERTIES ( | | 'numFiles'='4', | | 'numRows'='0', | | 'rawDataSize'='0', | | 'totalSize'='2566942', | | 'transactional'='true', | | 'transient_lastDdlTime'='1518695199') | +-----------------------------------------------------------------------------------------------+--+ Thanks, D