(Apologies if this has arrived more than once. I've subscribed to the list, and tried posting via email with no success. This an intentional repost to see if things are going through yet.)
I've been having lots of trouble with DataFrames whose columns have dots in their names today. I know that in many places, backticks can be used to quote column names, but the problem I'm running into now is that I can't drop a column that has *no* dots in its name when there are *other* columns in the table that do. Here's some code that tries four ways of dropping the column. One throws a weird exception, one is a semi-expected no-op, and the other two work. public class SparkExample { public static void main(String[] args) { /* Get the spark and sql contexts. Setting spark.ui.enabled to false * keeps Spark from using its built in dependency on Jersey. */ SparkConf conf = new SparkConf() .setMaster("local[*]") .setAppName("test") .set("spark.ui.enabled", "false"); JavaSparkContext sparkContext = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sparkContext); /* Create a schema with two columns, one of which as no dots (a_b), * and the other which does (a.b). */ StructType schema = new StructType(new StructField[] { DataTypes.createStructField("a_b", DataTypes.StringType, false), DataTypes.createStructField("a.c", DataTypes.IntegerType, false) }); /* Create an RDD of Rows, and then convert it into a DataFrame. */ List<Row> rows = Arrays.asList( RowFactory.create("t", 2), RowFactory.create("u", 4)); JavaRDD<Row> rdd = sparkContext.parallelize(rows); DataFrame df = sqlContext.createDataFrame(rdd, schema); /* Four ways to attempt dropping a_b from the DataFrame. * We'll try calling each one of these and looking at * the results (or the resulting exception). */ Function<DataFrame,DataFrame> x1 = d -> d.drop("a_b"); // exception Function<DataFrame,DataFrame> x2 = d -> d.drop("`a_b`"); // no-op Function<DataFrame,DataFrame> x3 = d -> d.drop(d.col("a_b")); // works Function<DataFrame,DataFrame> x4 = d -> d.drop(d.col("`a_b`")); // works int i=0; for (Function<DataFrame,DataFrame> x : Arrays.asList(x1, x2, x3, x4)) { System.out.println("Case "+i++); try { x.apply(df).show(); } catch (Exception e) { e.printStackTrace(System.out); } } } } Here's the output. Case 1 is a no-op, which I think I can understand, because DataFrame.drop(String) doesn't do any resolution (it doesn't need to), so d.drop("`a_b`") doesn't do anything because there's no column whose name is literally "`a_b`". The third and fourth cases work, because DataFrame.col() does do resolution, and both "a_b" and "`a_b`" resolve correctly. But why does the first case fail? And why with the message that it does? Why is it trying to resolve "a.c" at all in this case? Case 0 org.apache.spark.sql.AnalysisException: cannot resolve 'a.c' given input columns a_b, a.c; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751) at org.apache.spark.sql.DataFrame.drop(DataFrame.scala:1286) at SparkExample.lambda$0(SparkExample.java:45) at SparkExample.main(SparkExample.java:54) Case 1 +---+---+ |a_b|a.c| +---+---+ | t| 2| | u| 4| +---+---+ Case 2 +---+ |a.c| +---+ | 2| | 4| +---+ Case 3 +---+ |a.c| +---+ | 2| | 4| +---+ Thanks in advance, Joshua -- Joshua Taylor, http://www.cs.rpi.edu/~tayloj/ --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org