I ended up doing this for the time being. It works but I *think* that timestamp seems like a rational partitionColumn and I'm wondering if there's a more built in way:
> df = spark.read.jdbc( > > url=os.environ["JDBCURL"], > > table="schema.table", > > predicates=predicates > > ) > where Predicates is a list of: > ["timestamp between '2017-08-01 00:00:01' and '2017-09-01 00:00:01'", "timestamp between '2017-09-01 00:00:01' and '2017-10-01 00:00:01'", "timestamp between '2017-10-01 00:00:01' and '2017-11-01 00:00:01'"] Gets quite nice performance (better than I expected). Thanks! On 18 September 2017 at 13:21, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: > I'm pretty sure you can use a timestamp as a partitionColumn, It's > Timestamp type in MySQL. It's at base a numeric type and Spark requires a > numeric type passed in. > > This doesn't work as the where parameter in MySQL becomes raw numerics > which won't query against the mysql Timestamp. > > > minTimeStamp = 1325605540 <-- This is wrong, but I'm not sure what to put >> in here. >> >> maxTimeStamp = 1505641420 >> >> numPartitions = 20*7 >> >> >> dt = spark.read \ >> >> .format("jdbc") \ >> >> .option("url", os.environ["JDBC_URL"]) \ >> >> .option("dbtable", "schema.table") \ >> >> .option("numPartitions", numPartitions) \ >> >> .option("partitionColumn", "Timestamp") \ >> >> .option("lowerBound", minTimeStamp) \ >> >> .option("upperBound", maxTimeStamp) \ >> >> .load() >> > > mysql DB schema: > >> create table table >> >> ( >> >> EventId VARCHAR(50) not null primary key, >> >> userid VARCHAR(200) null, >> >> Timestamp TIMESTAMP(19) default CURRENT_TIMESTAMP not null, >> >> Referrer VARCHAR(4000) null, >> >> ViewedUrl VARCHAR(4000) null >> >> ); >> >> create index Timestamp on Fact_PageViewed (Timestamp); >> > > I'm obviously doing it wrong, but couldn't find anything obvious while > digging around. > > The query that gets generated looks like this (not exactly, it's optimized > to include some upstream query parameters): > >> >> *SELECT *`Timestamp`,`Referrer`,`EventId`,`UserId`,`ViewedUrl` >> *FROM *schema.table (*Timestamp*) >> *WHERE Timestamp *>= 1452916570 *AND Timestamp *< 1454202540; <-- this >> doesn't query against mysql timestamp type meaningfully. > > > Thanks! > > Gary Lucas > >