Re: Issue with column named "count" in a DataFrame

Michael Armbrust Wed, 22 Jul 2015 16:26:50 -0700

I believe this will be fixed in Spark 1.5

https://github.com/apache/spark/pull/7237


On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T <matthew.t.yo...@intel.com
> wrote:

> I'm trying to do some simple counting and aggregation in an IPython
> notebook with Spark 1.4.0 and I have encountered behavior that looks like a
> bug.
>
> When I try to filter rows out of an RDD with a column name of count I get
> a large error message. I would just avoid naming things count, except for
> the fact that this is the default column name created with the count()
> operation in pyspark.sql.GroupedData
>
> The small example program below demonstrates the issue.
>
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
> counts = dataFrame.groupBy('title').count()
> counts.filter("title = 'foo'").show() # Works
> counts.filter("count > 1").show()     # Errors out
>
>
> I can even reproduce the issue in a PySpark shell session by entering
> these commands.
>
> I suspect that the error has something to with Spark wanting to call the
> count() function in place of looking at the count column.
>
> The error message is as follows:
>
>
> Py4JJavaError                             Traceback (most recent call last)
> <ipython-input-29-62a1b7c71f21> in <module>()
> ----> 1 counts.filter("count > 1").show() # Errors Out
>
> C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc
> in filter(self, condition)
>     774         """
>     775         if isinstance(condition, basestring):
> --> 776             jdf = self._jdf.filter(condition)
>     777         elif isinstance(condition, Column):
>     778             jdf = self._jdf.filter(condition._jc)
>
> C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self,
> *args)
>     536         answer = self.gateway_client.send_command(command)
>     537         return_value = get_return_value(answer,
> self.gateway_client,
> --> 538                 self.target_id, self.name)
>     539
>     540         for temp_arg in temp_args:
>
> C:\Python27\lib\site-packages\py4j\protocol.pyc in
> get_return_value(answer, gateway_client, target_id, name)
>     298                 raise Py4JJavaError(
>     299                     'An error occurred while calling {0}{1}{2}.\n'.
> --> 300                     format(target_id, '.', name), value)
>     301             else:
>     302                 raise Py4JError(
>
> Py4JJavaError: An error occurred while calling o229.filter.
> : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found
>
> count > 1
>       ^
>         at scala.sys.package$.error(package.scala:27)
>         at
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
>         at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>         at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>         at py4j.Gateway.invoke(Gateway.java:259)
>         at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:207)
>         at java.lang.Thread.run(Unknown Source)
>
>
>
> Is there a recommended workaround to the inability to filter on a column
> named count? Do I have to make a new DataFrame and rename the column just
> to work around this bug? What's the best way to do that?
>
> Thanks,
>
> -- Matthew Young
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Issue with column named "count" in a DataFrame

Reply via email to