Re: Spark SQL: preferred syntax for column reference?

Michael Armbrust Thu, 14 May 2015 00:17:01 -0700

Depends on which compile time you are talking about.

*scala compile time*: No, the information about which columns are available
is usually coming from a file or an external database which may or may not
be available to scalac.

*query compile time*: While your program is running, but before any spark
jobs are launched, the query is analyzed and exceptions will be thrown if
either syntax is used to reference an invalid column.

There was a prototype implementation
<https://github.com/marmbrus/sql-typed> that
invoked the Spark SQL analyzer from within the Scala compiler using macros,
which gave true compile time checking.  However, this required scalac to
have access to the Hive metastore or the files being scanned, which is
often difficult to achieve in practice.  Additionally, since you can
construct DataFrames using arbitrary scala code, even with proper
configuration it is not always possible to figure out when a reference is
valid, unless you have actually run the code that constructs the DataFrame
you are referencing.

Michael

On Wed, May 13, 2015 at 7:43 PM, Dean Wampler <deanwamp...@gmail.com> wrote:

> Is the $"foo" or mydf("foo") or both checked at compile time to verify
> that the column reference is valid? Thx.
>
> Dean
>
> On Wednesday, May 13, 2015, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> I would not say that either method is preferred (neither is
>> old/deprecated).  One advantage to the second is that you are referencing a
>> column from a specific dataframe, instead of just providing a string that
>> will be resolved much like an identifier in a SQL query.
>>
>> This means given:
>> df1 = [id: int, name: string ....]
>> df2 = [id: int, zip: int]
>>
>> I can do something like:
>>
>> df1.join(df2, df1("id") === df2("id"))
>>
>> Where as I would need aliases if I was only using strings:
>>
>> df1.as("a").join(df2.as("b"), $"a.id" === $"b.id")
>>
>> On Wed, May 13, 2015 at 9:55 AM, Diana Carroll <dcarr...@cloudera.com>
>> wrote:
>>
>>> I'm just getting started with Spark SQL and DataFrames in 1.3.0.
>>>
>>> I notice that the Spark API shows a different syntax for referencing
>>> columns in a dataframe than the Spark SQL Programming Guide.
>>>
>>> For instance, the API docs for the select method show this:
>>> df.select($"colA", $"colB")
>>>
>>>
>>> Whereas the programming guide shows this:
>>> df.filter(df("name") > 21).show()
>>>
>>> I tested and both the $"column" and df(column) syntax works, but I'm
>>> wondering which is *preferred*.  Is one the original and one a new
>>> feature we should be using?
>>>
>>> Thanks,
>>> Diana
>>> (Spark Curriculum Developer for Cloudera)
>>>
>>
>>
>
> --
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
>

Re: Spark SQL: preferred syntax for column reference?

Reply via email to