The resulting library is on github: https://github.com/EDS-APHP/spark-postgres While there is room for improvements it is also able to read/write postgres data with the COPY statement allowing reading/writing **very large** tables without problems.
On Sat, Dec 29, 2018 at 01:06:00PM +0100, Nicolas Paris wrote: > Hi > > The spark postgres JDBC reader is limited because it relies on basic > SELECT statements with fetchsize and crashes on large tables even if > multiple partitions are setup with lower/upper bounds. > > I am about writing a new postgres JDBC reader based on "COPY TO STDOUT". > It would stream the data and produce CSV on the fileSystem (hdfs or > local). The CSV would be then parsed with the spark CSV reader to > produce a dataframe. It would send multiple "COPY TO STDOUT" for each > executor. > > Right now, I am able to loop over an output stream and write the string > somewhere. > I am wondering what would be the best way to process the resulting > string stream. In particular the best way to direct it to a hdfs folder > or maybe parse it on the fly into a dataframe. > > Thanks, > > -- > nicolas > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > -- nicolas --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org