Re: Python phoenixdb adapter and JSON serialization on PQS

Manoj Ganesan Fri, 09 Nov 2018 11:15:01 -0800

It was probably lost in the wall of text, but the point I was trying to
convey was that fetching all rows via protobuf crashed PQS, while fetching
it via JSON didn't. Basically, the only way I can get protobuf
serialization to work is to 1. bump up the PQS memory and/or 2. fetch in
batches, while I don't have to do either with JSON.


- Manoj

On Fri, Nov 9, 2018 at 10:59 AM Josh Elser <els...@apache.org> wrote:

> Manoj,
>
> re: #2 Please start by reading:
> https://plumbr.io/outofmemoryerror/gc-overhead-limit-exceeded
>
> This should give you a thorough explanation of what this error means.
> You need to increase the memory footprint and/or JVM GC properties for
> PQS to address your issue. When you request all of the rows at once, of
> course PQS needs to hold those all in memory. Please see my earlier
> instructions on how you do set JVM properties for PQS.
>
> re: #3
>
> I don't understand what you mean by "need to make protobuf faster". Just
> telling me what you observe from a high-elevation doesn't help me give
> you any kind of recommendations. I would imagine that you first need to
> solve the above issue, before you start requesting large batches of
> records from PQS (with *either* serialization).
>
> It sounds like the JSON serialization in the Python driver does not
> match what PQS/Avatica currently expects. Avatica provides no guarantee
> of wire compatibility for JSON, so you may just be hitting code that
> needs to be updated.
>
> On 11/8/18 7:14 PM, Manoj Ganesan wrote:
> > Thanks for the pointers Josh.
> >
> > Here’s a set of instructions to demonstrate the performance issues I’m
> > seeing when using protobuf serialization. Hopefully these are
> > comprehensive enough to reproduce the issue. If not, please let me know
> > and I’m happy to provide clarifications.
> >
> > Basic outline of the instructions:
> > 1. Create and populate a test table
> > 2. Use the python adapter to fetch rows
> > 3. JSON serialization
> >
> > 1. Create and populate a test table
> >
> > 1.1. Create a table:
> >
> > $> /usr/lib/phoenix/bin/sqlline.py localhost
> > CREATE TABLE test_perf (ind_id integer not null, a tinyint , b tinyint ,
> > c varchar(12), constraint pk primary key (ind_id));
> >
> > 1.2. I wrote this simple python script to populate the table with about
> > 6 million rows:
> >
> > # data_generator.py
> > import string
> > import csv
> > import random
> >
> > for i in range(0, 6000000):
> >      print ','.join([str(i), str(random.randint(-127, 127)),
> > str(random.randint(-127, 127)),
> > ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in
> > range(12))])
> >
> > # run the script
> > $> python data_generator.py > in.csv
> >
> > 1.3. Import rows into the table
> >
> > $> /opt/phoenix/bin/psql.py -t TEST_PERF <cluster_address> in.csv
> >
> > 2. Use the python adapter to fetch rows
> >
> > 2.1. Install the Python adapter
> >
> > The instructions are detailed on the project README
> > (https://github.com/apache/phoenix/tree/master/python/phoenixdb).
> >
> > pip install phoenixdb
> >
> > 2.2. Write a script to fetch rows
> >
> > import phoenixdb
> > from time import time
> >
> > database_url = ‘<your_pqs_url>’
> > sql = 'select ind_id, a, b, c from test_perf'
> >
> > conn = phoenixdb.connect(database_url, autocommit=True)
> >
> > print 'connecting to %s' % database_url
> > cursor = conn.cursor()
> > cursor.itersize = -2  # to fetch all rows at once
> > s = time()
> > print 'executing query %s' % sql
> > cursor.execute()
> > print 'time to execute = %s' % (time() - s)
> >
> > s = time()
> > print 'fetching results'
> > results = cursor.fetchall()
> > print 'time to fetchall = %s' % (time() - s)
> >
> > 2.3. Execute the fetch script
> >
> > When I execute the above script to fetch all rows at once (i.e., using
> > cursor.itersize = -2), PQS crashes and returns and error:
> >
> > Traceback (most recent call last):
> >    File "queries/protobuf_client.py", line 12, in <module>
> >      cursor = conn.cursor()
> >    File
> >
> "/Users/mganesan/ttam/local-scripts/venv/lib/python2.7/site-packages/phoenixdb/cursor.py",
>
> > line 201, in execute
> >      operation, first_frame_max_size=self.itersize)
> >    File
> >
> "/Users/mganesan/ttam/local-scripts/venv/lib/python2.7/site-packages/phoenixdb/avatica.py",
>
> > line 405, in prepare_and_execute
> >      response_data = self._apply(request, 'ExecuteResponse')
> >    File
> >
> "/Users/mganesan/ttam/local-scripts/venv/lib/python2.7/site-packages/phoenixdb/avatica.py",
>
> > line 213, in _apply
> >      parse_error_page(response_body)
> >    File
> >
> "/Users/mganesan/ttam/local-scripts/venv/lib/python2.7/site-packages/phoenixdb/avatica.py",
>
> > line 117, in parse_error_page
> >      raise errors.InternalError(message)
> > phoenixdb.errors.InternalError: ('GC overhead limit exceeded', None,
> > None, None)
> >
> > If I fetch it 2000 rows at a time (i.e., cursor.itersize = 2000, which
> > is also the default), the fetch finishes in 360s.
> >
> > 3. JSON serialization
> >
> > One of the issues I mentioned in my previous email was about how JSON
> > serialization seems to be not supported via the Python client even
> > though I was making the queries as per the Avatica JSON reference
> > (https://calcite.apache.org/avatica/docs/json_reference.html). The
> > client I wrote always received 0 rows, but no errors. On digging through
> > the source a bit, I realized that with the version being used within
> > Phoenix (the latest?), 2 new arguments are expected with the
> > ‘prepareAndExecute’ method: ‘maxRowsTotal’ and ‘maxRowsInFirstFrame’.
> > Essentially the request should look like the following (note the last 2
> > args):
> > {
> >    'request': 'prepareAndExecute',
> >    'sql': 'select ind_id, a, b, c from test_perf’,
> >    'connectionId': '2747c631-440e-405b-8185-44d1cc7f266c',
> >    'statementId': 1,
> >    'maxRowCount': -2,
> >    'maxRowsTotal': -2,
> >    'maxRowsInFirstFrame': -2,
> > }
> >
> > “sqlline-thin.json -s JSON” works because the Avatica client driver
> > Phoenix/Sqlline uses makes the right HTTP calls, with the correct args.
> >
> > The response time with using JSON serialization and fetching *all* rows
> > (cursor.itersize = -2) was 60s, which is much faster than the 360s that
> > it takes with protobuf when passing cursor.itersize=2K (note that
> > protobuf did not work with fetching all rows at once, i.e.,
> > cursor.itersize=-2)
> >
> > Because of these issues, for now we have switched PQS to using JSON
> > serialization and updated our clients to use the same. We’re obviously
> > very much interested in understanding how the protobuf path can be made
> > faster.
> >
> > Thanks again for the help!
> > Manoj
> >
> > On Tue, Nov 6, 2018 at 9:51 AM Josh Elser <els...@apache.org
> > <mailto:els...@apache.org>> wrote:
> >
> >
> >
> >     On 11/5/18 10:10 PM, Manoj Ganesan wrote:
> >      > Thanks for the pointers Josh. I'm working on getting a
> >     representative
> >      > concise test to demonstrate the issue.
> >      >
> >      > Meanwhile, I had one question regarding the following:
> >      >
> >      >     You are right that the operations in PQS should be exactly
> >     the same,
> >      >     regardless of the client you're using -- that is how this
> >      >     architecture works.
> >      >
> >      >
> >      > IIUC, this means the following 2 methods should yield the same
> >     result:
> >      >
> >      >  1. sqlline-thin.py -s JSON <query_file>
> >      >  2. using a python avatica client script making JSON requests
> >
> >     That's correct. Any client which speaks to PQS should see the same
> >     results. There may be bugs in the client implementation, of course,
> >     which make this statement false.
> >
> >      > I made the following change in hbase-site.xml on the PQS host:
> >      >
> >      > <property>
> >      >      <name>phoenix.queryserver.serialization</name>
> >      >      <value>JSON</value>
> >      > </property>
> >      >
> >      > I notice that executing "sqlline-thin.py -s JSON <query_file>"
> >     returns
> >      > results just fine. However, when I use a simple script to try the
> >     same
> >      > query, it returns 0 rows. I'm attaching the Python script here.
> The
> >      > script essentially makes HTTP calls using the Avatica JSON
> reference
> >      > <https://calcite.apache.org/avatica/docs/json_reference.html>. I
> >     assumed
> >      > that the sqlline-thin wrapper (when passed the -s JSON flag) also
> >     make
> >      > HTTP calls based on the JSON reference, is that not correct?
> >
> >     Apache mailing lists strip attachments. Please consider hosting it
> >     somewhere else, along with instructions/scripts to generate the
> >     required
> >     tables. Please provide some more analysis of the problem than just a
> >     summarization of what you see as an end-user -- I don't have the
> cycles
> >     or interest to debug the entire system for you :)
> >
> >     Avatica is a protocol that interprets JDBC using some serialization
> >     (JSON or Protobuf today) and a transport (only HTTP) to a remote
> server
> >     to run the JDBC oeprations. So, yes: an Avatica client is always
> using
> >     HTTP, given whatever serialization you instruct it to use.
> >
> >      > I'll work on getting some test cases here soon to illustrate this
> as
> >      > well as the performance problem.
> >      >
> >      > Thanks again!
> >      > Manoj
> >      >
> >      > On Mon, Nov 5, 2018 at 10:43 AM Josh Elser <els...@apache.org
> >     <mailto:els...@apache.org>
> >      > <mailto:els...@apache.org <mailto:els...@apache.org>>> wrote:
> >      >
> >      >     Is the OOME issue regardless of using the Java client
> >     (sqlline-thin)
> >      >     and
> >      >     the Python client? I would like to know more about this one.
> >     If you can
> >      >     share something that reproduces the problem for you, I'd like
> >     to look
> >      >     into it. The only suggestion I have at this point in time is
> >     to make
> >      >     sure you set a reasonable max-heap size in hbase-env.sh (e.g.
> >     -Xmx) via
> >      >     PHOENIX_QUERYSERVER_OPTS and have HBASE_CONF_DIR pointing to
> >     the right
> >      >     directory when you launch PQS.
> >      >
> >      >     Regarding performance, as you've described it, it sounds like
> the
> >      >     Python
> >      >     driver is just slower than the Java driver. You are right
> >     that the
> >      >     operations in PQS should be exactly the same, regardless of
> >     the client
> >      >     you're using -- that is how this architecture works. Avatica
> >     is a wire
> >      >     protocol that all clients use to talk to PQS. More
> >     digging/information
> >      >     you can provide about the exact circumstances (and, again,
> >      >     steps/environment to reproduce what you see) would be
> >     extremely helpful.
> >      >
> >      >     Thanks Manoj.
> >      >
> >      >     - Josh
> >      >
> >      >     On 11/2/18 7:16 PM, Manoj Ganesan wrote:
> >      >      > Thanks Josh for the response!
> >      >      >
> >      >      > I would definitely like to use protobuf serialization, but
> I'm
> >      >     observing
> >      >      > performance issues trying to run queries with a large
> >     number of
> >      >     results.
> >      >      > One problem is that I observe PQS runs out of memory, when
> its
> >      >     trying to
> >      >      > (what looks like to me) serialize the results in Avatica.
> The
> >      >     other is
> >      >      > that the phoenixdb python adapter itself spends a large
> >     amount of
> >      >     time
> >      >      > in the logic
> >      >      >
> >      >
> >       <
> https://github.com/apache/phoenix/blob/master/python/phoenixdb/phoenixdb/cursor.py#L248
> >
> >      >
> >      >      > where its converting the protobuf rows to python objects.
> >      >      >
> >      >      > Interestingly when we use sqlline-thin.py instead of python
> >      >     phoenixdb,
> >      >      > the protobuf serialization works fine and responses are
> fast.
> >      >     It's not
> >      >      > clear to me why PQS would have problems when using the
> python
> >      >     adapter
> >      >      > and not when using sqlline-thin, do they follow different
> >     code paths
> >      >      > (especially around serialization)?
> >      >      >
> >      >      > Thanks again,
> >      >      > Manoj
> >      >      >
> >      >      > On Fri, Nov 2, 2018 at 4:05 PM Josh Elser
> >     <els...@apache.org <mailto:els...@apache.org>
> >      >     <mailto:els...@apache.org <mailto:els...@apache.org>>
> >      >      > <mailto:els...@apache.org <mailto:els...@apache.org>
> >     <mailto:els...@apache.org <mailto:els...@apache.org>>>> wrote:
> >      >      >
> >      >      >     I would strongly suggest you do not use the JSON
> >     serialization.
> >      >      >
> >      >      >     The JSON support is implemented via Jackson which has
> no
> >      >     means to make
> >      >      >     backwards compatibility "easy". On the contrast,
> protobuf
> >      >     makes this
> >      >      >     extremely easy and we have multiple examples over the
> past
> >      >     years where
> >      >      >     we've been able to fix bugs in a backwards compatible
> >     manner.
> >      >      >
> >      >      >     If you want the thin client to continue to work across
> >      >     versions, stick
> >      >      >     with protobuf.
> >      >      >
> >      >      >     On 11/2/18 5:27 PM, Manoj Ganesan wrote:
> >      >      >      > Hey everyone,
> >      >      >      >
> >      >      >      > I'm trying to use the Python phoenixdb adapter work
> >     with JSON
> >      >      >      > serialization on PQS.
> >      >      >      >
> >      >      >      > I'm using Phoenix 4.14 and the adapter works fine
> with
> >      >     protobuf, but
> >      >      >      > when I try making it work with an older version of
> >     phoenixdb
> >      >      >     (before the
> >      >      >      > JSON to protobuf switch was introduced), it just
> >     returns 0
> >      >     rows.
> >      >      >     I don't
> >      >      >      > see anything in particular wrong with the HTTP
> requests
> >      >     itself,
> >      >      >     and they
> >      >      >      > seem to conform to the Avatica JSON spec
> >      >      >      >
> >     (http://calcite.apache.org/avatica/docs/json_reference.html).
> >      >      >      >
> >      >      >      > Here's the result (with some debug statements) that
> >      >     returns 0 rows.
> >      >      >      > Notice the
> >      >     *"firstFrame":{"offset":0,"done":true,"rows":[]* below:
> >      >      >      >
> >      >      >      > request body =  {"maxRowCount": -2, "connectionId":
> >      >      >      > "68c05d12-5770-47d6-b3e4-dba556db4790", "request":
> >      >      >     "prepareAndExecute",
> >      >      >      > "statementId": 3, "sql": "SELECT col1, col2 from
> table
> >      >     limit 20"}
> >      >      >      > request headers =  {'content-type':
> 'application/json'}
> >      >      >      > _post_request: got response {'fp':
> <socket._fileobject
> >      >     object at
> >      >      >      > 0x7f858330b9d0>, 'status': 200, 'will_close': False,
> >      >     'chunk_left':
> >      >      >      > 'UNKNOWN', 'length': 1395, 'strict': 0, 'reason':
> 'OK',
> >      >      >     'version': 11,
> >      >      >      > 'debuglevel': 0, 'msg': <httplib.HTTPMessage
> >     instance at
> >      >      >      > 0x7f84fb50be18>, 'chunked': 0, '_method': 'POST'}
> >      >      >      > response.read(): body =
> >      >      >      >
> >      >      >
> >      >
> >
>  
> {"response":"executeResults","missingStatement":false,"rpcMetadata":{"response":"rpcMetadata","serverAddress":"ip-10-55-6-247:8765"},"results":[{"response":"resultSet","connectionId":"68c05d12-5770-47d6-b3e4-dba556db4790","statementId":3,"ownStatement":true,"signature":{"columns":[{"ordinal":0,"autoIncrement":false,"caseSensitive":false,"searchable":true,"currency":false,"nullable
> >      >      >      >
> >      >      >
> >      >
> >
>  
> ":0,"signed":true,"displaySize":40,"label":"COL1","columnName":"COL1","schemaName":"","precision":0,"scale":0,"tableName":"TABLE","catalogName":"","type":{"type":"scalar","id":4,"name":"INTEGER","rep":"PRIMITIVE_INT"},"readOnly":true,"writable":false,"definitelyWritable":false,"columnClassName":"java.lang.Integer"},{"ordinal":1,"autoIncrement":false,"caseSensitive":false,"searchable":true,"currency":false,"nullable":0,"signed":true,"displaySize":40,"label":"COL2","columnName":"COL2","schemaName":"","precision":0,"scale":0,"tableName":"TABLE","catalogName":"","type":{"type":"scalar","id":4,"name":"INTEGER","rep":"PRIMITIVE_INT"},"readOnly":true,"writable":false,"definitelyWritable":false,"columnClassName":"java.lang.Integer"}],"sql":null,"parameters":[],"cursorFactory":{"style":"LIST","clazz":null,"fieldNames":null},"statementType":null},*"firstFrame":{"offset":0,"done":true,"rows":[]*},"updateCount":-1,"rpcMetadata":{"response":"rpcMetadata","serverAddress":"ip-10-55-6-247:8765"}}]}
> >      >      >
> >      >      >      >
> >      >      >      >
> >      >      >      > The same query issued against a PQS started with
> >     PROTOBUF
> >      >      >     serialization
> >      >      >      > and using a newer phoenixdb adapter returns the
> correct
> >      >     number of
> >      >      >     rows.
> >      >      >      >
> >      >      >      > Has anyone had luck making this work?
> >      >      >      >
> >      >      >      > Thanks,
> >      >      >      > Manoj
> >      >      >      >
> >      >      >
> >      >
> >
>

Re: Python phoenixdb adapter and JSON serialization on PQS

Reply via email to