Re: Inconsistent Spark SQL behavior when column names contain dots

Yin Huai Thu, 31 Jul 2014 11:54:16 -0700

I have created https://issues.apache.org/jira/browse/SPARK-2775 to track it.



On Thu, Jul 31, 2014 at 11:47 AM, Budde, Adam <bu...@amazon.com> wrote:

>  I still see the same “Unresolved attributes” error when using hql +
> backticks.
>
>  Here’s a code snippet that replicates this behavior:
>
>  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val sampleRDD = sc.parallelize(Array("""{"key.one": "value1", "key.two":
> "value2"}"""))
> val sampleTable = hiveContext.jsonRDD(sampleRDD)
> sampleTable.registerAsTable("sample_table")
> hiveContext.hql("SELECT `key.one` FROM sample_table")
>
>   From: Michael Armbrust <mich...@databricks.com>
> Reply-To: "user@spark.apache.org" <user@spark.apache.org>
> Date: Thursday, July 31, 2014 at 11:20 AM
> To: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: Inconsistent Spark SQL behavior when column names contain
> dots
>
>   Ideally you'd use backticks to reference columns that contain weird
> characters.  I don't believe this works in sql parser, but I'm curious if
> using the hql parser in HiveContext would work for you?
>
>  If you wanted to add support for this in the sql parser I'd check out
> SqlParser.scala.  Thought it is likely we will abandon that code in the
> next release for something more complete.
>
>
> On Thu, Jul 31, 2014 at 11:16 AM, Budde, Adam <bu...@amazon.com> wrote:
>
>>  I’m working with a dataset where each row is stored as a single-line
>> flat JSON object. I want to leverage Spark SQL to run relational queries on
>> this data. Many of the object keys in this dataset have dots in them, e.g.:
>>
>>  { “key.number1”: “value1”, “key.number2”: “value2” … }
>>
>>  I can successfully load the data as an RDD in Spark and construct a
>> Spark SQL table using the jsonRDD function. If I print the schema of the
>> table, I see that Spark SQL infers the full object key, dot included, as
>> the column name:
>>
>>  > sqlTable.printSchema()
>> root
>> |-- key.number1: StringType
>> |-- key.number2: StringType
>> …
>>
>>  However, when I try to use one of these column names in a query, it
>> seems that the Spark SQL parser always assumes I’m trying to reference a
>> nested attribute. The same thing happens when using HiveQL. If there’s a
>> way to escape the dot in the column name, I haven’t found it:
>>
>>  > sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT
>> 1”).first
>> == Query Plan ==
>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>> Unresolved attributes: ‘key.number1, tree
>> Project [‘key.number1]
>> …
>>
>>  This is not a critical issue by any means— it’s simple enough to use
>> map() to transform the dots to underscores after loading the JSON data as
>> text. I just wanted to reach out to the community for some guidance as to
>> whether or not this issue warrants a bug report. To me, this behavior seems
>> to be inconsistent— you can create a table with column names containing
>> dots, but AFAICT you cannot include such columns in a query.
>>
>>  Also, I’d greatly appreciate it if anybody has any pointers as to where
>> in the source I should be looking if I wanted to patch this issue in my
>> local branch. I’ve taken a glance at some of the Spark SQL Catalyst code
>> but I’m afraid I’m too much of a Scala novice to make much headway here.
>>
>>  For reference, I’m using Spark 1.0.1. Thanks for your input.
>>
>>  Adam
>>
>
>

Re: Inconsistent Spark SQL behavior when column names contain dots

Reply via email to