I ended up doing this for the time being.  It works but I *think* that
timestamp seems like a rational partitionColumn and I'm wondering if
there's a more built in way:


> df = spark.read.jdbc(
>
>     url=os.environ["JDBCURL"],
>
>     table="schema.table",
>
>     predicates=predicates
>
> )
>

where Predicates is a list of:

>  ["timestamp between '2017-08-01 00:00:01' and '2017-09-01 00:00:01'",

 "timestamp between '2017-09-01 00:00:01' and '2017-10-01 00:00:01'",

 "timestamp between '2017-10-01 00:00:01' and '2017-11-01 00:00:01'"]




Gets quite nice performance (better than I expected).

Thanks!

On 18 September 2017 at 13:21, lucas.g...@gmail.com <lucas.g...@gmail.com>
wrote:

>  I'm pretty sure you can use a timestamp as a partitionColumn, It's
> Timestamp type in MySQL.  It's at base a numeric type and Spark requires a
> numeric type passed in.
>
> This doesn't work as the where parameter in MySQL becomes raw numerics
> which won't query against the mysql Timestamp.
>
>
> minTimeStamp = 1325605540 <-- This is wrong, but I'm not sure what to put
>> in here.
>>
>> maxTimeStamp = 1505641420
>>
>> numPartitions = 20*7
>>
>>
>> dt = spark.read \
>>
>>     .format("jdbc") \
>>
>>     .option("url", os.environ["JDBC_URL"]) \
>>
>>     .option("dbtable", "schema.table") \
>>
>>     .option("numPartitions", numPartitions) \
>>
>>     .option("partitionColumn", "Timestamp") \
>>
>>     .option("lowerBound", minTimeStamp) \
>>
>>     .option("upperBound", maxTimeStamp) \
>>
>>     .load()
>>
>
> mysql DB schema:
>
>> create table table
>>
>> (
>>
>> EventId VARCHAR(50) not null primary key,
>>
>> userid VARCHAR(200) null,
>>
>> Timestamp TIMESTAMP(19) default CURRENT_TIMESTAMP not null,
>>
>> Referrer VARCHAR(4000) null,
>>
>> ViewedUrl VARCHAR(4000) null
>>
>> );
>>
>> create index Timestamp on Fact_PageViewed (Timestamp);
>>
>
> I'm obviously doing it wrong, but couldn't find anything obvious while
> digging around.
>
> The query that gets generated looks like this (not exactly, it's optimized
> to include some upstream query parameters):
>
>>
>> *SELECT *`Timestamp`,`Referrer`,`EventId`,`UserId`,`ViewedUrl`
>> *FROM *schema.table  (*Timestamp*)
>> *WHERE  Timestamp *>= 1452916570 *AND Timestamp *< 1454202540;  <-- this
>> doesn't query against mysql timestamp type meaningfully.
>
>
> Thanks!
>
> Gary Lucas
>
>

Reply via email to