Re: Inconsistent Spark SQL behavior when column names contain dots

Budde, Adam Thu, 31 Jul 2014 11:49:36 -0700

I still see the same “Unresolved attributes” error when using hql + backticks.

Here’s a code snippet that replicates this behavior:

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val sampleRDD = sc.parallelize(Array("""{"key.one": "value1", "key.two": 
"value2"}"""))
val sampleTable = hiveContext.jsonRDD(sampleRDD)
sampleTable.registerAsTable("sample_table")
hiveContext.hql("SELECT `key.one` FROM sample_table")

From: Michael Armbrust <mich...@databricks.com<mailto:mich...@databricks.com>>
Reply-To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Date: Thursday, July 31, 2014 at 11:20 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Inconsistent Spark SQL behavior when column names contain dots

Ideally you'd use backticks to reference columns that contain weird characters. 
 I don't believe this works in sql parser, but I'm curious if using the hql 
parser in HiveContext would work for you?

If you wanted to add support for this in the sql parser I'd check out 
SqlParser.scala.  Thought it is likely we will abandon that code in the next 
release for something more complete.

On Thu, Jul 31, 2014 at 11:16 AM, Budde, Adam 
<bu...@amazon.com<mailto:bu...@amazon.com>> wrote:
I’m working with a dataset where each row is stored as a single-line flat JSON 
object. I want to leverage Spark SQL to run relational queries on this data. 
Many of the object keys in this dataset have dots in them, e.g.:

{ “key.number1”: “value1”, “key.number2”: “value2” … }

I can successfully load the data as an RDD in Spark and construct a Spark SQL 
table using the jsonRDD function. If I print the schema of the table, I see 
that Spark SQL infers the full object key, dot included, as the column name:

> sqlTable.printSchema()
root
|-- key.number1: StringType
|-- key.number2: StringType
…

However, when I try to use one of these column names in a query, it seems that 
the Spark SQL parser always assumes I’m trying to reference a nested attribute. 
The same thing happens when using HiveQL. If there’s a way to escape the dot in 
the column name, I haven’t found it:

> sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first
== Query Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: ‘key.number1, tree
Project [‘key.number1]
…

This is not a critical issue by any means— it’s simple enough to use map() to 
transform the dots to underscores after loading the JSON data as text. I just 
wanted to reach out to the community for some guidance as to whether or not 
this issue warrants a bug report. To me, this behavior seems to be 
inconsistent— you can create a table with column names containing dots, but 
AFAICT you cannot include such columns in a query.

Also, I’d greatly appreciate it if anybody has any pointers as to where in the 
source I should be looking if I wanted to patch this issue in my local branch. 
I’ve taken a glance at some of the Spark SQL Catalyst code but I’m afraid I’m 
too much of a Scala novice to make much headway here.

For reference, I’m using Spark 1.0.1. Thanks for your input.

Adam

Re: Inconsistent Spark SQL behavior when column names contain dots

Reply via email to