(total newb - longtime SQL Server person, but new to Python & Arrow) I'm trying to export a dataset for use with AWS Athena. As part of that, I want to partition it.
In order to partition by day & hour (I only have a datetime aka TIMESTAMP), in my SELECT I computed two columns (a date and an hour), passed them to the Table, then used them as the partition_cols. So far, so good. But reading through the Athena documentation, you can't create the partitions on fields that exist within the data. https://docs.aws.amazon.com/athena/latest/ug/create-table.html "Partitioned columns don't exist within the table data itself. If you use a value for col_name that is the same as a table column, you get an error. For more information, see Partitioning Data." So, I'm stumped. Short of explicitly pointing each partition at a folder, is there a way to do this with Arrow? Thanks! import pandas as pd import pypyodbc import pyarrow.parquet as pq import pyarrow as pa con_string = ('Driver={SQL Server};' 'Server=myserver;' 'Database=mydb;' 'App=myname;' #It's not application name! 'Trusted_Connection=yes') cnxn = pypyodbc.connect(con_string) query = """ SELECT *, convert(date,submitted_datetime) as subdate, datepart(hour,submitted_datetime) as subhour FROM mytable where submitted_datetime >='20190720' and submitted_datetime <'20190723' """ result_port_map = pd.read_sql(query, cnxn) table = pa.Table.from_pandas(result_port_map) pq.write_to_dataset(table, root_path='mytable', partition_cols=['subdate','subhour'])
