Re: Inconsistent Spark SQL behavior when column names contain dots

Michael Armbrust Thu, 31 Jul 2014 11:21:54 -0700

Ideally you'd use backticks to reference columns that contain weird
characters.  I don't believe this works in sql parser, but I'm curious if
using the hql parser in HiveContext would work for you?


If you wanted to add support for this in the sql parser I'd check out
SqlParser.scala.  Thought it is likely we will abandon that code in the
next release for something more complete.


On Thu, Jul 31, 2014 at 11:16 AM, Budde, Adam <bu...@amazon.com> wrote:

>  I’m working with a dataset where each row is stored as a single-line
> flat JSON object. I want to leverage Spark SQL to run relational queries on
> this data. Many of the object keys in this dataset have dots in them, e.g.:
>
>  { “key.number1”: “value1”, “key.number2”: “value2” … }
>
>  I can successfully load the data as an RDD in Spark and construct a
> Spark SQL table using the jsonRDD function. If I print the schema of the
> table, I see that Spark SQL infers the full object key, dot included, as
> the column name:
>
>  > sqlTable.printSchema()
> root
> |-- key.number1: StringType
> |-- key.number2: StringType
> …
>
>  However, when I try to use one of these column names in a query, it
> seems that the Spark SQL parser always assumes I’m trying to reference a
> nested attribute. The same thing happens when using HiveQL. If there’s a
> way to escape the dot in the column name, I haven’t found it:
>
>  > sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first
> == Query Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes: ‘key.number1, tree
> Project [‘key.number1]
> …
>
>  This is not a critical issue by any means— it’s simple enough to use
> map() to transform the dots to underscores after loading the JSON data as
> text. I just wanted to reach out to the community for some guidance as to
> whether or not this issue warrants a bug report. To me, this behavior seems
> to be inconsistent— you can create a table with column names containing
> dots, but AFAICT you cannot include such columns in a query.
>
>  Also, I’d greatly appreciate it if anybody has any pointers as to where
> in the source I should be looking if I wanted to patch this issue in my
> local branch. I’ve taken a glance at some of the Spark SQL Catalyst code
> but I’m afraid I’m too much of a Scala novice to make much headway here.
>
>  For reference, I’m using Spark 1.0.1. Thanks for your input.
>
>  Adam
>

Re: Inconsistent Spark SQL behavior when column names contain dots

Reply via email to