I believe this will be fixed in Spark 1.5 https://github.com/apache/spark/pull/7237
On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T <matthew.t.yo...@intel.com > wrote: > I'm trying to do some simple counting and aggregation in an IPython > notebook with Spark 1.4.0 and I have encountered behavior that looks like a > bug. > > When I try to filter rows out of an RDD with a column name of count I get > a large error message. I would just avoid naming things count, except for > the fact that this is the default column name created with the count() > operation in pyspark.sql.GroupedData > > The small example program below demonstrates the issue. > > from pyspark.sql import SQLContext > sqlContext = SQLContext(sc) > dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"]) > counts = dataFrame.groupBy('title').count() > counts.filter("title = 'foo'").show() # Works > counts.filter("count > 1").show() # Errors out > > > I can even reproduce the issue in a PySpark shell session by entering > these commands. > > I suspect that the error has something to with Spark wanting to call the > count() function in place of looking at the count column. > > The error message is as follows: > > > Py4JJavaError Traceback (most recent call last) > <ipython-input-29-62a1b7c71f21> in <module>() > ----> 1 counts.filter("count > 1").show() # Errors Out > > C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc > in filter(self, condition) > 774 """ > 775 if isinstance(condition, basestring): > --> 776 jdf = self._jdf.filter(condition) > 777 elif isinstance(condition, Column): > 778 jdf = self._jdf.filter(condition._jc) > > C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, > *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, > self.gateway_client, > --> 538 self.target_id, self.name) > 539 > 540 for temp_arg in temp_args: > > C:\Python27\lib\site-packages\py4j\protocol.pyc in > get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > > Py4JJavaError: An error occurred while calling o229.filter. > : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found > > count > 1 > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45) > at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Unknown Source) > > > > Is there a recommended workaround to the inability to filter on a column > named count? Do I have to make a new DataFrame and rename the column just > to work around this bug? What's the best way to do that? > > Thanks, > > -- Matthew Young > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >