AH! No, I missed that in the documentation. I was making sure I had everything right before I actually did it. Let me go play with it some more - I'll comment again if there's issues with them not being removed. Thanks for the quick response!
-----Original Message----- From: Wes McKinney <[email protected]> Sent: Wednesday, July 31, 2019 5:04 PM To: [email protected] Subject: (External Email) Re: Trying to partition for AWS Athena - the partition name can't be a field, but that's how Arrow does it? Are you getting an error of some kind running this code? write_to_dataset is supposed to remove the partition columns from each fragment that is written to the dataset. On Wed, Jul 31, 2019 at 4:28 PM Bourgon, Michael <[email protected]> wrote: > > (total newb – longtime SQL Server person, but new to Python & Arrow) > > > > I’m trying to export a dataset for use with AWS Athena. As part of that, I > want to partition it. > > > > In order to partition by day & hour (I only have a datetime aka TIMESTAMP), > in my SELECT I computed two columns (a date and an hour), passed them to the > Table, then used them as the partition_cols. So far, so good. > > > > But reading through the Athena documentation, you can’t create the partitions > on fields that exist within the data. > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs > .aws.amazon.com%2Fathena%2Flatest%2Fug%2Fcreate-table.html&data=02 > %7C01%7Cmbourgon%40changehealthcare.com%7C53aa15b153b648544dd108d71603 > 2769%7Cd13a3cbde4ce45e191c298859a236439%7C0%7C0%7C637002075111461075&a > mp;sdata=cErYrK1IimVcZeiqk3d3P6GNnAcj6mSnHftiiFdT7n0%3D&reserved=0 > > “Partitioned columns don't exist within the table data itself. If you use a > value for col_name that is the same as a table column, you get an error. For > more information, see Partitioning Data.” > > > > > > So, I’m stumped. Short of explicitly pointing each partition at a folder, is > there a way to do this with Arrow? > > > > Thanks! > > > > > > import pandas as pd > > import pypyodbc > > import pyarrow.parquet as pq > > import pyarrow as pa > > con_string = ('Driver={SQL Server};' > > 'Server=myserver;' > > 'Database=mydb;' > > 'App=myname;' #It's not application name! > > 'Trusted_Connection=yes') > > cnxn = pypyodbc.connect(con_string) > > query = """ > > SELECT *, > > convert(date,submitted_datetime) as subdate, > > datepart(hour,submitted_datetime) as subhour > > FROM mytable > > where submitted_datetime >='20190720' and submitted_datetime <'20190723' > > """ > > result_port_map = pd.read_sql(query, cnxn) > > > > table = pa.Table.from_pandas(result_port_map) > > > > pq.write_to_dataset(table, root_path='mytable', > > partition_cols=['subdate','subhour'])
