Re: DataFrame filter referencing error

Francesco Bigarella Sat, 02 May 2015 13:59:37 -0700

First of all, thank you for your replies.

I was previously doing this via normal jdbc connection and it worked
without problems. Then I liked the idea that sparksql could take care of
opening/closing the connection.


I tried also with single quotes, since that was my first guess but didn't
work. I fear I will have to look at the spark code but I'm the only one
with this issue.

BTW I'm testing with spark 1.3.0

Best,
Francesco

On Fri, May 1, 2015, 00:54 ayan guha <guha.a...@gmail.com> wrote:

> I think you need to specify new in single quote. My guess is the query
> showing up in dB is like
> ...where status=new or
> ...where status="new"
> Either case mysql assumes new is a column.
> What you need is the form below
> ...where status='new'
>
> You need to provide your quotes accordingly.
>
> Easiest way would be to do it would in a separate jdbc conn to mysql using
> a simple standalone programme, not in spark.
> On 1 May 2015 07:47, "Burak Yavuz" <brk...@gmail.com> wrote:
>
>> Is "new" a reserved word for MySQL?
>>
>> On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella <
>> francesco.bigare...@gmail.com> wrote:
>>
>>> Do you know how I can check that? I googled a bit but couldn't find a
>>> clear explanation about it. I also tried to use explain() but it doesn't
>>> really help.
>>> I still find unusual that I have this issue only for the equality
>>> operator but not for the others.
>>>
>>> Thank you,
>>> F
>>>
>>> On Wed, Apr 29, 2015 at 3:03 PM ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> Looks like you DF is based on a MySQL DB using jdbc, and error is
>>>> thrown from mySQL. Can you see what SQL is finally getting fired in MySQL?
>>>> Spark is pushing down the predicate to mysql so its not a spark problem
>>>> perse
>>>>
>>>> On Wed, Apr 29, 2015 at 9:56 PM, Francesco Bigarella <
>>>> francesco.bigare...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I was testing the DataFrame filter functionality and I found what I
>>>>> think is a strange behaviour.
>>>>> My dataframe testDF, obtained loading aMySQL table via jdbc, has the
>>>>> following schema:
>>>>> root
>>>>>  | -- id: long (nullable = false)
>>>>>  | -- title: string (nullable = true)
>>>>>  | -- value: string (nullable = false)
>>>>>  | -- status: string (nullable = false)
>>>>>
>>>>> What I want to do is filter my dataset to obtain all rows that have a
>>>>> status = "new".
>>>>>
>>>>> scala> testDF.filter(testDF("id") === 1234).first()
>>>>> works fine (also with the integer value within double quotes), however
>>>>> if I try to use the same statement to filter on the status column (also
>>>>> with changes in the syntax - see below), suddenly the program breaks.
>>>>>
>>>>> Any of the following
>>>>> scala> testDF.filter(testDF("status") === "new")
>>>>> scala> testDF.filter("status = 'new'")
>>>>> scala> testDF.filter($"status" === "new")
>>>>>
>>>>> generates the error:
>>>>>
>>>>> INFO scheduler.DAGScheduler: Job 3 failed: runJob at
>>>>> SparkPlan.scala:121, took 0.277907 s
>>>>>
>>>>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>>>> Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in
>>>>> stage 3.0 (TID 12, <node name>):
>>>>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column
>>>>> 'new' in 'where clause'
>>>>>
>>>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>> Method)
>>>>> at
>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>> at
>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>>>>> at com.mysql.jdbc.Util.getInstance(Util.java:386)
>>>>> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1052)
>>>>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3597)
>>>>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3529)
>>>>> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1990)
>>>>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2151)
>>>>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2625)
>>>>> at
>>>>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2119)
>>>>> at
>>>>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2283)
>>>>> at org.apache.spark.sql.jdbc.JDBCRDD$anon$1.<init>(JDBCRDD.scala:328)
>>>>> at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:309)
>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>>>> at
>>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244
>>>>> at
>>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>>>> at
>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>> Does filter work only on columns of the integer type? What is the
>>>>> exact behaviour of the filter function and what is the best way to handle
>>>>> the query I am trying to execute?
>>>>>
>>>>> Thank you,
>>>>> Francesco
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>

Re: DataFrame filter referencing error

Reply via email to