Depends on which compile time you are talking about. *scala compile time*: No, the information about which columns are available is usually coming from a file or an external database which may or may not be available to scalac.
*query compile time*: While your program is running, but before any spark jobs are launched, the query is analyzed and exceptions will be thrown if either syntax is used to reference an invalid column. There was a prototype implementation <https://github.com/marmbrus/sql-typed> that invoked the Spark SQL analyzer from within the Scala compiler using macros, which gave true compile time checking. However, this required scalac to have access to the Hive metastore or the files being scanned, which is often difficult to achieve in practice. Additionally, since you can construct DataFrames using arbitrary scala code, even with proper configuration it is not always possible to figure out when a reference is valid, unless you have actually run the code that constructs the DataFrame you are referencing. Michael On Wed, May 13, 2015 at 7:43 PM, Dean Wampler <deanwamp...@gmail.com> wrote: > Is the $"foo" or mydf("foo") or both checked at compile time to verify > that the column reference is valid? Thx. > > Dean > > On Wednesday, May 13, 2015, Michael Armbrust <mich...@databricks.com> > wrote: > >> I would not say that either method is preferred (neither is >> old/deprecated). One advantage to the second is that you are referencing a >> column from a specific dataframe, instead of just providing a string that >> will be resolved much like an identifier in a SQL query. >> >> This means given: >> df1 = [id: int, name: string ....] >> df2 = [id: int, zip: int] >> >> I can do something like: >> >> df1.join(df2, df1("id") === df2("id")) >> >> Where as I would need aliases if I was only using strings: >> >> df1.as("a").join(df2.as("b"), $"a.id" === $"b.id") >> >> On Wed, May 13, 2015 at 9:55 AM, Diana Carroll <dcarr...@cloudera.com> >> wrote: >> >>> I'm just getting started with Spark SQL and DataFrames in 1.3.0. >>> >>> I notice that the Spark API shows a different syntax for referencing >>> columns in a dataframe than the Spark SQL Programming Guide. >>> >>> For instance, the API docs for the select method show this: >>> df.select($"colA", $"colB") >>> >>> >>> Whereas the programming guide shows this: >>> df.filter(df("name") > 21).show() >>> >>> I tested and both the $"column" and df(column) syntax works, but I'm >>> wondering which is *preferred*. Is one the original and one a new >>> feature we should be using? >>> >>> Thanks, >>> Diana >>> (Spark Curriculum Developer for Cloudera) >>> >> >> > > -- > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > >