Trouble dropping columns from a DataFrame that has other columns with dots in their names

Joshua TAYLOR Mon, 25 Jan 2016 13:08:35 -0800

(Apologies if this has arrived more than once.  I've subscribed to the list,
and tried posting via email with no success.  This an intentional
repost to see if things are going through yet.)


I've been having lots of trouble with DataFrames whose columns have dots in
their names today.  I know that in many places, backticks can be used to
quote column names, but the problem I'm running into now is that I can't
drop a column that has *no* dots in its name when there are *other* columns
in the table that do.  Here's some code that tries four ways of dropping the
column.  One throws a weird exception, one is a semi-expected no-op, and the
other two work.

public class SparkExample {
    public static void main(String[] args) {
        /* Get the spark and sql contexts.  Setting spark.ui.enabled to
false
         * keeps Spark from using its built in dependency on Jersey. */
        SparkConf conf = new SparkConf()
                .setMaster("local[*]")
                .setAppName("test")
                .set("spark.ui.enabled", "false");
        JavaSparkContext sparkContext = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sparkContext);

        /* Create a schema with two columns, one of which as no dots (a_b),
         * and the other which does (a.b). */
        StructType schema = new StructType(new StructField[] {
                DataTypes.createStructField("a_b", DataTypes.StringType,
false),
                DataTypes.createStructField("a.c", DataTypes.IntegerType,
false)
        });

        /* Create an RDD of Rows, and then convert it into a DataFrame. */
        List<Row> rows = Arrays.asList(
                RowFactory.create("t", 2),
                RowFactory.create("u", 4));
        JavaRDD<Row> rdd = sparkContext.parallelize(rows);
        DataFrame df = sqlContext.createDataFrame(rdd, schema);

        /* Four ways to attempt dropping a_b from the DataFrame.
         * We'll try calling each one of these and looking at
         * the results (or the resulting exception). */
        Function<DataFrame,DataFrame> x1 = d -> d.drop("a_b");          //
exception
        Function<DataFrame,DataFrame> x2 = d -> d.drop("`a_b`");        //
no-op
        Function<DataFrame,DataFrame> x3 = d -> d.drop(d.col("a_b"));   //
works
        Function<DataFrame,DataFrame> x4 = d -> d.drop(d.col("`a_b`")); //
works

        int i=0;
        for (Function<DataFrame,DataFrame> x : Arrays.asList(x1, x2, x3,
x4)) {
            System.out.println("Case "+i++);
            try {
                x.apply(df).show();
            } catch (Exception e) {
                e.printStackTrace(System.out);
            }
        }
    }
}

Here's the output.  Case 1 is a no-op, which I think I can understand,
because DataFrame.drop(String) doesn't do any resolution (it doesn't need
to), so d.drop("`a_b`") doesn't do anything because there's no column whose
name is literally "`a_b`".  The third and fourth cases work, because
DataFrame.col() does do resolution, and both "a_b" and "`a_b`" resolve
correctly.  But why does the first case fail?  And why with the message that
it does?  Why is it trying to resolve "a.c" at all in this case?

Case 0
org.apache.spark.sql.AnalysisException: cannot resolve 'a.c' given input
columns a_b, a.c;
    at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
    at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
    at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
    at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
    at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
    at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
    at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
    at
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
    at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
    at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
    at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
    at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
    at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
    at
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
    at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
    at
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
    at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
    at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
    at
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
    at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751)
    at org.apache.spark.sql.DataFrame.drop(DataFrame.scala:1286)
    at SparkExample.lambda$0(SparkExample.java:45)
    at SparkExample.main(SparkExample.java:54)
Case 1
+---+---+
|a_b|a.c|
+---+---+
|  t|  2|
|  u|  4|
+---+---+

Case 2
+---+
|a.c|
+---+
|  2|
|  4|
+---+

Case 3
+---+
|a.c|
+---+
|  2|
|  4|
+---+


Thanks in advance,
Joshua

-- 
Joshua Taylor, http://www.cs.rpi.edu/~tayloj/

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Trouble dropping columns from a DataFrame that has other columns with dots in their names

Reply via email to