Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Reid Thompson Wed, 15 Aug 2018 10:16:17 -0700

Thanks for your help.

Yes, it solves the out of memory failure, but then I ran into the json
issue.


I'm not sure that the ticket represents what I currently need.  I.E.  At
this point, I'm attempting to do a simple pull of data from PostgreSQL
and write to parquet, nothing more.   I think that the ticket indicates
wanting to be able to directly query for content in json(b) columns from
drill.

thanks,
Reid

On Wed, 2018-08-15 at 19:27 +0300, Vitalii Diravka wrote:
> [EXTERNAL SOURCE]
>  
> Glad to see that it helps and you've solved the issue.
> 
> I have seen you asked about PostgreSQL JSONB in another topic, 
> but looks like it is not supported by now and should be implemented in 
> context of DRILL-5087
> 
> Kind regards
> Vitalii
> 
> 
> On Wed, Aug 15, 2018 at 3:36 PM Reid Thompson <[email protected]> 
> wrote:
> > Vitalii,
> > 
> > yes. Per https://jdbc.postgresql.org/documentation/head/connect.html 
> > (page lists numerous settings available)
> > 
> > defaultRowFetchSize = int
> > 
> > Determine the number of rows fetched in ResultSet by one fetch with trip to 
> > the database. Limiting the number of rows are fetch with each trip to the 
> > database allow avoids unnecessary memory
> > consumption and as a consequence OutOfMemoryException.
> > 
> > The default is zero, meaning that in ResultSet will be fetch all rows at 
> > once. Negative number is not available.
> > 
> > 
> > on another topic,
> > is there any way to have drill properly recognize postgresql's json and
> > jsonb types?  I have tables with both, and am getting this error
> > 
> >  org.apache.drill.common.exceptions.UserException: UNSUPPORTED_OPERATION
> >  ERROR: A column you queried has a data type that is not currently
> >  supported by the JDBC storage plugin. The column's name was actionjson
> >  and its JDBC data type was OTHER.
> > 
> > 
> > thanks,
> > reid
> > 
> > On Wed, 2018-08-15 at 14:44 +0300, Vitalii Diravka wrote:
> > > [EXTERNAL SOURCE]
> > >  
> > > Hi Reid,
> > > 
> > > Am I right, defaultRowFetchSize=10000 property in URL solves that OOM 
> > > issue?
> > > If so possibly it can be useful to have this information in Drill docs 
> > > [1].
> > > 
> > > [1] https://drill.apache.org/docs/rdbms-storage-plugin/
> > > 
> > > Kind regards
> > > Vitalii
> > > 
> > > 
> > > On Tue, Aug 14, 2018 at 4:17 PM Reid Thompson 
> > > <[email protected]> wrote:
> > > > using the below parameters in the URL and looking in the defined logfile
> > > > indicates that the fetch size is being set to 10000, as expected. 
> > > > 
> > > > just to note that it appears that the param defaultRowFetchSize sets the
> > > > fetch size and signifies that a cursor should be used.  It is different
> > > > from the originally noted defaultFetchSize param, and it appears that
> > > > postgresql doesn't require the useCursorFetch=true or the 
> > > > defaultAutoCommit=false.
> > > > 
> > > > ...snip..
> > > >   "url": 
> > > > "jdbc:postgresql://myhost.mydomain.com/mydb?useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&defaultRowFetchSize=10000",
> > > > ...snip..
> > > > 
> > > > 
> > > > 
> > > > On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote:
> > > > > attempting with the below still fails.
> > > > > looking at pg_stat_activity it doesn't appear that a cursor is being
> > > > > created.  It's still attempting to pull all the data at once.
> > > > > 
> > > > > thanks,
> > > > > reid
> > > > > On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> > > > > > Vitalii,
> > > > > > 
> > > > > > Ok, thanks, I had found that report, but didn't note the option 
> > > > > > related
> > > > > > to defaultAutoCommit. 
> > > > > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > > > > 
> > > > > > 
> > > > > > so, something along the lines of
> > > > > > 
> > > > > > ..snip..
> > > > > >   "url": 
> > > > > > "jdbc:postgresql://myhost.mydomain.com/ateb?useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false",
> > > > > > ..snip..
> > > > > > 
> > > > > > 
> > > > > > thanks,
> > > > > > reid
> > > > > > 
> > > > > > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > > > > > > [EXTERNAL SOURCE]
> > > > > > >  
> > > > > > > Hi Reid,
> > > > > > > 
> > > > > > > Look like your issue is similar to DRILL-4177 [1].
> > > > > > > It was related to MySQL connection. Looks like the similar issue 
> > > > > > > is with PostgreSQL.
> > > > > > > Looking at the Postgres documentation, the code needs to 
> > > > > > > explicitly set the connection autocommit mode 
> > > > > > > to false e.g. conn.setAutoCommit(false) [2]. For data size of 10 
> > > > > > > million plus, this is a must.
> > > > > > > 
> > > > > > > You could disable "Auto Commit" option as session option [3] 
> > > > > > > or to do it within plugin config URL with the following property: 
> > > > > > > defaultAutoCommit=false [4]
> > > > > > > 
> > > > > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > > > > > [2] 
> > > > > > > https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
> > > > > > > [3] 
> > > > > > > https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
> > > > > > > [4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html
> > > > > > > 
> > > > > > > Kind regards
> > > > > > > Vitalii
> > > > > > > 
> > > > > > > 
> > > > > > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson 
> > > > > > > <[email protected]> wrote:
> > > > > > > > My standalone host is configured with 16GB RAM, 8 cpus.  Using
> > > > > > > > drill-embedded (single host standalone), I am attempting to 
> > > > > > > > pull data
> > > > > > > > from PostgreSQL tables to parquet files via CTAS. Smaller 
> > > > > > > > datasets work
> > > > > > > > fine, but larger data sets fail (for example ~11GB) with
> > > > > > > > "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can 
> > > > > > > > someone
> > > > > > > > advise on how to get past this?
> > > > > > > > 
> > > > > > > > Is there a way to have drill stream this data from PostgreSQL 
> > > > > > > > to parquet
> > > > > > > > files on disk, or does the data set have to be completely 
> > > > > > > > loaded into
> > > > > > > > memory before it can be written to disk?  The documentation 
> > > > > > > > indicates
> > > > > > > > that drill will spill to disk to avoid memory issues, so I had 
> > > > > > > > hoped
> > > > > > > > that it would be straightforward to extract from the DB to disk.
> > > > > > > > 
> > > > > > > > Should I not be attempting this via CTAS?  What are the other 
> > > > > > > > options?
> > > > > > > > 
> > > > > > > > 
> > > > > > > > thanks,
> > > > > > > > reid
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > 
> >

Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Reply via email to