I'm wondering if you can export the json from Postgres to a json document. And then write it to parquet using Drill. This link may have some ideas:
https://hashrocket.com/blog/posts/create-quick-json-data-dumps-from-postgresql Thanks. --Robert On Wed, Aug 15, 2018 at 10:16 AM, Reid Thompson <[email protected]> wrote: > Thanks for your help. > > Yes, it solves the out of memory failure, but then I ran into the json > issue. > > I'm not sure that the ticket represents what I currently need. I.E. At > this point, I'm attempting to do a simple pull of data from PostgreSQL > and write to parquet, nothing more. I think that the ticket indicates > wanting to be able to directly query for content in json(b) columns from > drill. > > thanks, > Reid > > On Wed, 2018-08-15 at 19:27 +0300, Vitalii Diravka wrote: > > [EXTERNAL SOURCE] > > > > Glad to see that it helps and you've solved the issue. > > > > I have seen you asked about PostgreSQL JSONB in another topic, > > but looks like it is not supported by now and should be implemented in > context of DRILL-5087 > > > > Kind regards > > Vitalii > > > > > > On Wed, Aug 15, 2018 at 3:36 PM Reid Thompson < > [email protected]> wrote: > > > Vitalii, > > > > > > yes. Per https://urldefense.proofpoint.com/v2/url?u=https-3A__jdbc. > postgresql.org_documentation_head_connect.html&d=DwIGaQ&c= > cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m= > HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s= > Y1QZGgkI2OWmGL84IKvDRdTMPlWy28ZLLLR8fPLALKk&e= > > > (page lists numerous settings available) > > > > > > defaultRowFetchSize = int > > > > > > Determine the number of rows fetched in ResultSet by one fetch with > trip to the database. Limiting the number of rows are fetch with each trip > to the database allow avoids unnecessary memory > > > consumption and as a consequence OutOfMemoryException. > > > > > > The default is zero, meaning that in ResultSet will be fetch all rows > at once. Negative number is not available. > > > > > > > > > on another topic, > > > is there any way to have drill properly recognize postgresql's json and > > > jsonb types? I have tables with both, and am getting this error > > > > > > org.apache.drill.common.exceptions.UserException: > UNSUPPORTED_OPERATION > > > ERROR: A column you queried has a data type that is not currently > > > supported by the JDBC storage plugin. The column's name was actionjson > > > and its JDBC data type was OTHER. > > > > > > > > > thanks, > > > reid > > > > > > On Wed, 2018-08-15 at 14:44 +0300, Vitalii Diravka wrote: > > > > [EXTERNAL SOURCE] > > > > > > > > Hi Reid, > > > > > > > > Am I right, defaultRowFetchSize=10000 property in URL solves that > OOM issue? > > > > If so possibly it can be useful to have this information in Drill > docs [1]. > > > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__drill. > apache.org_docs_rdbms-2Dstorage-2Dplugin_&d=DwIGaQ& > c=cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m= > HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=tvskvH61dBj_z89kZ6NYTxnR-6_ > E6bHXJ4kcGXNfqQI&e= > > > > > > > > Kind regards > > > > Vitalii > > > > > > > > > > > > On Tue, Aug 14, 2018 at 4:17 PM Reid Thompson < > [email protected]> wrote: > > > > > using the below parameters in the URL and looking in the defined > logfile > > > > > indicates that the fetch size is being set to 10000, as expected. > > > > > > > > > > just to note that it appears that the param defaultRowFetchSize > sets the > > > > > fetch size and signifies that a cursor should be used. It is > different > > > > > from the originally noted defaultFetchSize param, and it appears > that > > > > > postgresql doesn't require the useCursorFetch=true or the > defaultAutoCommit=false. > > > > > > > > > > ...snip.. > > > > > "url": "jdbc:postgresql://myhost.mydomain.com/mydb? > useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/ > tmp/jdbc.log&defaultRowFetchSize=10000", > > > > > ...snip.. > > > > > > > > > > > > > > > > > > > > On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote: > > > > > > attempting with the below still fails. > > > > > > looking at pg_stat_activity it doesn't appear that a cursor is > being > > > > > > created. It's still attempting to pull all the data at once. > > > > > > > > > > > > thanks, > > > > > > reid > > > > > > On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote: > > > > > > > Vitalii, > > > > > > > > > > > > > > Ok, thanks, I had found that report, but didn't note the > option related > > > > > > > to defaultAutoCommit. > > > > > > > > [1] https://urldefense.proofpoint. > com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL- > 2D4177&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m= > HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=02dphqxg7r_ > 7IjLwMMl9Sd-GmrO3EjVN_mD37PgjcTQ&e= > > > > > > > > > > > > > > > > > > > > > so, something along the lines of > > > > > > > > > > > > > > ..snip.. > > > > > > > "url": "jdbc:postgresql://myhost.mydomain.com/ateb? > useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false", > > > > > > > ..snip.. > > > > > > > > > > > > > > > > > > > > > thanks, > > > > > > > reid > > > > > > > > > > > > > > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote: > > > > > > > > [EXTERNAL SOURCE] > > > > > > > > > > > > > > > > Hi Reid, > > > > > > > > > > > > > > > > Look like your issue is similar to DRILL-4177 [1]. > > > > > > > > It was related to MySQL connection. Looks like the similar > issue is with PostgreSQL. > > > > > > > > Looking at the Postgres documentation, the code needs to > explicitly set the connection autocommit mode > > > > > > > > to false e.g. conn.setAutoCommit(false) [2]. For data size > of 10 million plus, this is a must. > > > > > > > > > > > > > > > > You could disable "Auto Commit" option as session option [3] > > > > > > > > or to do it within plugin config URL with the following > property: defaultAutoCommit=false [4] > > > > > > > > > > > > > > > > [1] https://urldefense.proofpoint. > com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL- > 2D4177&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m= > HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=02dphqxg7r_ > 7IjLwMMl9Sd-GmrO3EjVN_mD37PgjcTQ&e= > > > > > > > > [2] https://urldefense.proofpoint. > com/v2/url?u=https-3A__jdbc.postgresql.org_documentation_ > 93_query.html-23fetchsize-2Dexample&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r= > GXRJhB4g1YFDJsrcglHwUA&m=HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s= > QVTkxdxQrN6ClYDj1gBm1buRnmH5ra3fQ8rsLCGHO6w&e= > > > > > > > > [3] https://urldefense.proofpoint. > com/v2/url?u=https-3A__www.postgresql.org_docs_9.3_ > static_ecpg-2Dsql-2Dset-2Dautocommit.html&d=DwIGaQ&c= > cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m= > HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s= > qVoZA1PBAjd7DGfRYBUH3Huqh7GN8MYfxZ6Hw7ocAz0&e= > > > > > > > > [4] https://urldefense.proofpoint. > com/v2/url?u=https-3A__jdbc.postgresql.org_documentation_ > head_ds-2Dcpds.html&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r= > GXRJhB4g1YFDJsrcglHwUA&m=HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s= > rGZnDFuuSDpTWs8LUr8RiwZNsaqQ31AexNwuC3reBTE&e= > > > > > > > > > > > > > > > > Kind regards > > > > > > > > Vitalii > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson < > [email protected]> wrote: > > > > > > > > > My standalone host is configured with 16GB RAM, 8 cpus. > Using > > > > > > > > > drill-embedded (single host standalone), I am attempting > to pull data > > > > > > > > > from PostgreSQL tables to parquet files via CTAS. Smaller > datasets work > > > > > > > > > fine, but larger data sets fail (for example ~11GB) with > > > > > > > > > "java.lang.OutOfMemoryError: GC overhead limit exceeded" > Can someone > > > > > > > > > advise on how to get past this? > > > > > > > > > > > > > > > > > > Is there a way to have drill stream this data from > PostgreSQL to parquet > > > > > > > > > files on disk, or does the data set have to be completely > loaded into > > > > > > > > > memory before it can be written to disk? The > documentation indicates > > > > > > > > > that drill will spill to disk to avoid memory issues, so I > had hoped > > > > > > > > > that it would be straightforward to extract from the DB to > disk. > > > > > > > > > > > > > > > > > > Should I not be attempting this via CTAS? What are the > other options? > > > > > > > > > > > > > > > > > > > > > > > > > > > thanks, > > > > > > > > > reid > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
