RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
thanks a lot, Hao, finally solved this problem, changes of CSVSerDe are here: https://github.com/chutium/csv-serde/commit/22c667c003e705613c202355a8791978d790591e btw, add jar in spark hive or hive-thriftserver always doesn't work, we build the spark with libraryDependencies += csv-serde ... or maybe should try to add it to SPARK_CLASSPATH ? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8166.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
Hi Cheng, thank you very much for helping me to finally find out the secret of this magic... actually we defined this external table with SID STRING REQUEST_ID STRING TIMES_DQ TIMESTAMP TOTAL_PRICE FLOAT ... using desc table ext_fullorders it is only shown as [# col_name data_type comment ] ... [times_dq string from deserializer ] [total_pricestring from deserializer ] ... because, as you said, CSVSerde sets all field object inspectors to javaStringObjectInspector and therefore there are comments from deserializer but in StorageDescriptor, are the real user defined types, using desc extended table ext_fullorders we can see his sd:StorageDescriptor is: FieldSchema(name:times_dq, type:timestamp, comment:null), FieldSchema(name:total_price, type:float, comment:null) and Spark HiveContext reads the schema info from this StorageDescriptor https://github.com/apache/spark/blob/7e191fe29bb09a8560cd75d453c4f7f662dff406/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L316 so, in the SchemaRDD, the fields in Row were filled with strings (via fillObject, all of values were retrieved from CSVSerDe with javaStringObjectInspector) but Spark considers that some of them are float or timestamp (schema info were got from sd:StorageDescriptor) crazy... and sorry for update on the weekend... a little more about how i fand this problem and why it is a trouble for us. we use the new spark thrift server, to query normal managed hive table, it works fine but when we try to access the external tables with custom SerDe such as this CSVSerDe, then we will get this ClassCastException, such as: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float the reason is https://github.com/apache/spark/blob/d94a44d7caaf3fe7559d9ad7b10872fa16cf81ca/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala#L104-L105 here Spark's thrift server try to get a float value from SparkRow, because in the schema info (sd:StorageDescriptor) this column is float, but actually in SparkRow, this field was filled with string value... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8157.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
Yes, the root cause for that is the output ObjectInspector in SerDe implementation doesn't reflect the real typeinfo. Hive actually provides the API like TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(TypeInfo) for the mapping. You probably need to update the code at https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L60. -Original Message- From: chutium [mailto:teng@gmail.com] Sent: Monday, September 01, 2014 2:58 AM To: d...@spark.incubator.apache.org Subject: Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised... Hi Cheng, thank you very much for helping me to finally find out the secret of this magic... actually we defined this external table with SID STRING REQUEST_ID STRING TIMES_DQ TIMESTAMP TOTAL_PRICE FLOAT ... using desc table ext_fullorders it is only shown as [# col_name data_type comment ] ... [times_dq string from deserializer ] [total_pricestring from deserializer ] ... because, as you said, CSVSerde sets all field object inspectors to javaStringObjectInspector and therefore there are comments from deserializer but in StorageDescriptor, are the real user defined types, using desc extended table ext_fullorders we can see his sd:StorageDescriptor is: FieldSchema(name:times_dq, type:timestamp, comment:null), FieldSchema(name:total_price, type:float, comment:null) and Spark HiveContext reads the schema info from this StorageDescriptor https://github.com/apache/spark/blob/7e191fe29bb09a8560cd75d453c4f7f662dff406/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L316 so, in the SchemaRDD, the fields in Row were filled with strings (via fillObject, all of values were retrieved from CSVSerDe with javaStringObjectInspector) but Spark considers that some of them are float or timestamp (schema info were got from sd:StorageDescriptor) crazy... and sorry for update on the weekend... a little more about how i fand this problem and why it is a trouble for us. we use the new spark thrift server, to query normal managed hive table, it works fine but when we try to access the external tables with custom SerDe such as this CSVSerDe, then we will get this ClassCastException, such as: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float the reason is https://github.com/apache/spark/blob/d94a44d7caaf3fe7559d9ad7b10872fa16cf81ca/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala#L104-L105 here Spark's thrift server try to get a float value from SparkRow, because in the schema info (sd:StorageDescriptor) this column is float, but actually in SparkRow, this field was filled with string value... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8157.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
I believe in your case, the “magic” happens in TableReader.fillObject https://github.com/apache/spark/blob/4fa2fda88fc7beebb579ba808e400113b512533b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L706-L712. Here we unwrap the field value according to the object inspector of that field. It seems that somehow a FloatObjectInspector is specified for the total_price field. I don’t think CSVSerde is responsible for this, since it sets all field object inspectors to javaStringObjectInspector (here https://github.com/ogrodnek/csv-serde/blob/f315c1ae4b21a8288eb939e7c10f3b29c1a854ef/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L59-L61 ). Which version of Spark SQL are you using? If you are using a snapshot version, please provide the exact Git commit hash. Thanks! On Tue, Aug 26, 2014 at 8:29 AM, chutium teng@gmail.com wrote: oops, i tried on a managed table, column types will not be changed so it is mostly due to the serde lib CSVSerDe ( https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123 ) or maybe CSVReader from opencsv?... but if the columns are defined as string, no matter what type returned from custom SerDe or CSVReader, they should be cast to string at the end right? why do not use the schema from hive metadata directly? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8039.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
is there any dataType auto convert or detect or something in HiveContext ?all columns of a table is defined as string in hive metastoreone column is total_price with values like 123.45, then this column will be recognized as dataType Float in HiveContext...this is a feature or a bug? it really surprised me... how is it implemented? if it is a feature, can i turn it off? i want to get a schemaRDD with exactly the same datatype defined in hive metadata, i know the column total_price should be float values, but they must not be, and what happens if there is some broken line in my huge CSV file? or maybe some total_price is 9,123.45 or $123.45 or something==some example for this in our env.MapR v3 cluster, newest spark github master clone from yesterdaybuilt withsbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assemblyhive-site.xml configured==spark-shell scripts:val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)hiveContext.sql(use our_live_db)hiveContext.sql(desc formatted et_fullorders).collect.foreach(println)..14/08/26 15:47:09 INFO SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408 s[# col_name data_type comment ][ ][sidstring from deserializer ][request_id string from deserializer ][*times_dq string* from deserializer ][*total_pricestring* from deserializer ][order_id string from deserializer ][ ][# Partition Information ][# col_name data_type comment ][][wt_datestring None][countrystring None ][][# Detailed Table Information][Database: our_live_db][Owner: client02 ][CreateTime:Fri Jan 31 12:23:40 CET 2014 ][LastAccessTime: UNKNOWN ][Protect Mode: None ][Retention: 0][Location: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders ][Table Type:EXTERNAL_TABLE ][Table Parameters: ][ EXTERNALTRUE][ transient_lastDdlTime 1391167420 ][][# Storage Information ][SerDe Library: com.bizo.hive.serde.csv.CSVSerde ][InputFormat: org.apache.hadoop.mapred.TextInputFormat ][OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ][Compressed:No ][Num Buckets: -1 ][Bucket Columns:[] ][Sort Columns: [] ][Storage Desc Params: ][ separatorChar ; ][ serialization.format1 ]then, create a schemaRDD from this tableval result = hiveContext.sql(select sid, order_id, total_price, times_dq from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5)ok now, printSchema...scala result.printSchemaroot |-- sid: string (nullable = true) |-- order_id: string (nullable = true) |-- *total_price: float* (nullable = true) |-- *times_dq: timestamp* (nullable = true)total_price was STRING but now in schemaRDD is FLOATandtimes_dq, now is TIMESTAMPreally strange and surprised...and more strange is:scala result.map(row = row.getString(2)).collect.foreach(println)i got240.0045.8321.6795.83120.83butscala result.map(row = row.getFloat(2)).collect.foreach(println)14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8)java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Floatat scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114)==btw, files in this external table are gzipped csv files:14/08/26 15:49:56 INFO HadoopRDD: Input split: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990and the data in it:scala result.collect.foreach(println)[51402123123,12344000123454,240.00,2014-04-14 00:03:49.082000][51402110123,12344000123455,45.83,2014-04-14 00:04:13.639000][51402129123,12344000123458,21.67,2014-04-14 00:09:12.276000][51402092123,12344000132457,95.83,2014-04-14 00:09:42.228000][51402135123,12344000123460,120.83,2014-04-14 00:12:44.742000]we use CSVSerDe https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jarmaybe
HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
is there any dataType auto convert or detect or something in HiveContext ? all columns of a table is defined as string in hive metastore one column is total_price with values like 123.45, then this column will be recognized as dataType Float in HiveContext... this is a feature or a bug? it really surprised me... how is it implemented? if it is a feature, can i turn it off? i want to get a schemaRDD with exactly the same datatype defined in hive metadata, i know the column total_price should be float values, but they must not be, and what happens if there is some broken line in my huge CSV file? or maybe some total_price is 9,123.45 or $123.45 or something == some example for this in our env. MapR v3 cluster, newest spark github master clone from yesterday built with sbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assembly hive-site.xml configured == spark-shell scripts: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.sql(use our_live_db) hiveContext.sql(desc formatted et_fullorders).collect.foreach(println) ... ... 14/08/26 15:47:09 INFO SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408 s [# col_name data_type comment ] [] [sidstring from deserializer ] [request_id string from deserializer ] [*times_dq string* from deserializer ] [*total_pricestring* from deserializer ] [order_id string from deserializer ] [] [# Partition Information ] [# col_name data_type comment ] [] [wt_datestring None] [countrystring None] [] [# Detailed Table Information] [Database: our_live_db] [Owner: client02 ] [CreateTime:Fri Jan 31 12:23:40 CET 2014 ] [LastAccessTime:UNKNOWN ] [Protect Mode: None ] [Retention: 0] [Location: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders ] [Table Type:EXTERNAL_TABLE ] [Table Parameters: ] [ EXTERNALTRUE] [ transient_lastDdlTime 1391167420 ] [] [# Storage Information ] [SerDe Library: com.bizo.hive.serde.csv.CSVSerde ] [InputFormat: org.apache.hadoop.mapred.TextInputFormat ] [OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ] [Compressed:No ] [Num Buckets: -1 ] [Bucket Columns:[] ] [Sort Columns: [] ] [Storage Desc Params:] [ separatorChar ; ] [ serialization.format1 ] then, create a schemaRDD from this table val result = hiveContext.sql(select sid, order_id, total_price, times_dq from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5) ok now, printSchema... scala result.printSchema root |-- sid: string (nullable = true) |-- order_id: string (nullable = true) |-- *total_price: float* (nullable = true) |-- *times_dq: timestamp* (nullable = true) total_price was STRING but now in schemaRDD is FLOAT and times_dq, now is TIMESTAMP really strange and surprised... and more strange is: scala result.map(row = row.getString(2)).collect.foreach(println) i got 240.00 45.83 21.67 95.83 120.83 but scala result.map(row = row.getFloat(2)).collect.foreach(println) 14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8) java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114) == btw, files in this external table are gzipped csv files: 14/08/26 15:49:56 INFO HadoopRDD: Input split: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990 and the data in it: scala result.collect.foreach(println) [51402123123,12344000123454,240.00,2014-04-14 00:03:49.082000] [51402110123,12344000123455,45.83,2014-04-14 00:04:13.639000] [51402129123,12344000123458,21.67,2014-04-14 00:09:12.276000] [51402092123,12344000132457,95.83,2014-04-14 00:09:42.228000] [51402135123,12344000123460,120.83,2014-04-14 00:12:44.742000]
Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
oops, i tried on a managed table, column types will not be changed so it is mostly due to the serde lib CSVSerDe (https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L123) or maybe CSVReader from opencsv?... but if the columns are defined as string, no matter what type returned from custom SerDe or CSVReader, they should be cast to string at the end right? why do not use the schema from hive metadata directly? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8039.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org