[jira] [Comment Edited] (PHOENIX-3664) Pyspark: pushing filter by date against apache phoenix

Pablo Castilla (JIRA) Thu, 16 Feb 2017 01:58:32 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869655#comment-15869655
 ]


Pablo Castilla edited comment on PHOENIX-3664 at 2/16/17 9:57 AM:
------------------------------------------------------------------

At the end we have switched from python to scala and all seems to work at very 
good speed. We haven't found the phoenixTableAsRDD in python.

We would prefer python as we use it in machine learning implementations, but 
they are very very similar so moving to scala is not a big deal.

Thanks for helping! :)


was (Author: pablocastilla):
At the end we have switched from python to scala and all seems to work at very 
good speed. We haven't found the phoenixTableAsRDD in python.

We would prefer python as we use it in machine learning implementations, but 
they are very very similar so moving to scala is not a big deal.

> Pyspark: pushing filter by date against apache phoenix
> ------------------------------------------------------
>
>                 Key: PHOENIX-3664
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3664
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.7.0
>         Environment: Azure HDIndight (HDI 3.5) - pyspark using phoenix 
> client. (Spark 1.6.3 - HBase 1.1.2 under HDP 2.5)
>            Reporter: Pablo Castilla
>
> I am trying to filter by date in apache phoenix from pyspark. The column in 
> phoenix is created as Date and the filter is a datetime. When I use explain I 
> see spark doesn't push the filter to phoenix. I have tried a lot of 
> combinations without luck.
> Any way to do it?
> df = sqlContext.read \
>    .format("org.apache.phoenix.spark") \
>   .option("table", "TABLENAME") \
>   .option("zkUrl",zookepperServer +":2181:/hbase-unsecure" ) \
>   .load()
> print(df.printSchema())
> startValidation = datetime.datetime.now()
> print(df.filter(df['FH'] >startValidation).explain(True))
> Results:
> root
>  |-- METER_ID: string (nullable = true)
>  |-- FH: date (nullable = true)
> None
>    == Parsed Logical Plan ==
> 'Filter (FH#53 > 1486726683446150)
> +- 
> Relation[METER_ID#52,FH#53,SUMMERTIME#54,MAGNITUDE#55,SOURCE#56,ENTRY_DATETIME#57,BC#58,T_VAL_AE#59,T_VAL_AI#60,T_VAL_R1#61,T_VAL_R2#62,T_VAL_R3#63,T_VAL_R4#64]
>  PhoenixRelation(DAILYREADS,10.0.0.13:2181:/hbase-unsecure)
> == Analyzed Logical Plan ==
> METER_ID: string, FH: date, SUMMERTIME: string, MAGNITUDE: int, SOURCE: int, 
> ENTRY_DATETIME: date, BC: string, T_VAL_AE: int, T_VAL_AI: int, T_VAL_R1: 
> int, T_VAL_R2: int, T_VAL_R3: int, T_VAL_R4: int
> Filter (cast(FH#53 as string) > cast(1486726683446150 as string))
> +- 
> Relation[METER_ID#52,FH#53,SUMMERTIME#54,MAGNITUDE#55,SOURCE#56,ENTRY_DATETIME#57,BC#58,T_VAL_AE#59,T_VAL_AI#60,T_VAL_R1#61,T_VAL_R2#62,T_VAL_R3#63,T_VAL_R4#64]
>  PhoenixRelation(DAILYREADS,10.0.0.13:2181:/hbase-unsecure)
> == Optimized Logical Plan ==
> Filter (cast(FH#53 as string) > 2017-02-10 11:38:03.44615)
> +- 
> Relation[METER_ID#52,FH#53,SUMMERTIME#54,MAGNITUDE#55,SOURCE#56,ENTRY_DATETIME#57,BC#58,T_VAL_AE#59,T_VAL_AI#60,T_VAL_R1#61,T_VAL_R2#62,T_VAL_R3#63,T_VAL_R4#64]
>  PhoenixRelation(DAILYREADS,10.0.0.13:2181:/hbase-unsecure)
> == Physical Plan ==
> Filter (cast(FH#53 as string) > 2017-02-10 11:38:03.44615)
> +- Scan 
> PhoenixRelation(DAILYREADS,10.0.0.13:2181:/hbase-unsecure)[METER_ID#52,FH#53,SUMMERTIME#54,MAGNITUDE#55,SOURCE#56,ENTRY_DATETIME#57,BC#58,T_VAL_AE#59,T_VAL_AI#60,T_VAL_R1#61,T_VAL_R2#62,T_VAL_R3#63,T_VAL_R4#64]
> None
> if I set the FH column as timestamp it pushes the filter but throws an 
> exception:
> Caused by: org.apache.phoenix.exception.PhoenixParserException: ERROR 604 
> (42P00): Syntax error. Mismatched input. Expecting "RPAREN", got "12" at line 
> 1, column 219.
>     at 
> org.apache.phoenix.exception.PhoenixParserException.newException(PhoenixParserException.java:33)
>     at org.apache.phoenix.parse.SQLParser.parseStatement(SQLParser.java:111)
>     at 
> org.apache.phoenix.jdbc.PhoenixStatement$PhoenixStatementParser.parseStatement(PhoenixStatement.java:1280)
>     at 
> org.apache.phoenix.jdbc.PhoenixStatement.parseStatement(PhoenixStatement.java:1363)
>     at 
> org.apache.phoenix.jdbc.PhoenixStatement.compileQuery(PhoenixStatement.java:1373)
>     at 
> org.apache.phoenix.jdbc.PhoenixStatement.optimizeQuery(PhoenixStatement.java:1368)
>     at 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:122)
>     ... 102 more
> Caused by: MismatchedTokenException(106!=129)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.recoverFromMismatchedToken(PhoenixSQLParser.java:360)
>     at 
> org.apache.phoenix.shaded.org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.not_expression(PhoenixSQLParser.java:6862)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.and_expression(PhoenixSQLParser.java:6677)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.or_expression(PhoenixSQLParser.java:6614)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.expression(PhoenixSQLParser.java:6579)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.single_select(PhoenixSQLParser.java:4615)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.unioned_selects(PhoenixSQLParser.java:4697)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.select_node(PhoenixSQLParser.java:4763)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.oneStatement(PhoenixSQLParser.java:789)
>     at 
> org.apache.phoenix.parse.PhoenixSQLParser.statement(PhoenixSQLParser.java:508)
>     at org.apache.phoenix.parse.SQLParser.parseStatement(SQLParser.java:108)
>     ... 107 more



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (PHOENIX-3664) Pyspark: pushing filter by date against apache phoenix

Reply via email to