I'd like to optimize querying and converting a list of Oracle tables into
pandas dataframes.
The eventual goal is to convert to Parquet, write to disk, then upload to
S3, but for now I just want to focus on the pandas / sqlalchemy /
parallelism part. My code sort of works, but it's very slow and seems to
hang after completing 10 tables.
Any advice for speeding things up or alternative suggestions?
import sqlalchemyfrom sqlalchemy.orm import sessionmaker, scoped_sessionfrom
multiprocessing.dummy import Pool as ThreadPool from multiprocessing import
Poolimport pyarrow as paimport pyarrow.parquet as pq
def process_chunk(chunk, table_name, index):
table = pa.Table.from_pandas(chunk)
local_file_name = "./" + table_name + "-" + str(index) + ".parquet"
pq.write_table(table, local_file_name)
def process_table(table):
db_session = DBSession()
# helper function that creates the SQL query (select col1, col2, col3, ...,
colX from table)
query = setup_query(table)
i=0
# is this the right way to use the db_session?
for chunk in pd.read_sql(query, db_session.bind, chunksize=300000):
process_chunk(chunk, table, i)
i+=1
oracle_connect_str = #string_here#
oracle_engine = sqlalchemy.create_engine(
oracle_connect_str,
arraysize=10000)
# set up session object to be used by threadsDBSession = scoped_session(
sessionmaker(
autoflush=True,
autocommit=False,
bind=oracle_engine
))
pool = ThreadPool(4)
table_list = ['tbl1','tbl2','tbl3','tbl4','tbl5',...,'tbl20']
# use pool.map instead of creating boiler-plate threading class
pool.map(process_table, table_list)
# are these in the right spots?
pool.close()
pool.join()
Thanks!
--
SQLAlchemy -
The Python SQL Toolkit and Object Relational Mapper
http://www.sqlalchemy.org/
To post example code, please provide an MCVE: Minimal, Complete, and Verifiable
Example. See http://stackoverflow.com/help/mcve for a full description.
---
You received this message because you are subscribed to the Google Groups
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/sqlalchemy/0e7bf9a8-0335-4a07-af90-6b217daeb2ea%40googlegroups.com.