Thanks for the pointers Josh.
Here’s a set of instructions to demonstrate the performance issues I’m
seeing when using protobuf serialization. Hopefully these are
comprehensive enough to reproduce the issue. If not, please let me know
and I’m happy to provide clarifications.
Basic outline of the instructions:
1. Create and populate a test table
2. Use the python adapter to fetch rows
3. JSON serialization
1. Create and populate a test table
1.1. Create a table:
$> /usr/lib/phoenix/bin/sqlline.py localhost
CREATE TABLE test_perf (ind_id integer not null, a tinyint , b tinyint ,
c varchar(12), constraint pk primary key (ind_id));
1.2. I wrote this simple python script to populate the table with about
6 million rows:
# data_generator.py
import string
import csv
import random
for i in range(0, 6000000):
print ','.join([str(i), str(random.randint(-127, 127)),
str(random.randint(-127, 127)),
''.join(random.choice(string.ascii_uppercase + string.digits) for _ in
range(12))])
# run the script
$> python data_generator.py > in.csv
1.3. Import rows into the table
$> /opt/phoenix/bin/psql.py -t TEST_PERF <cluster_address> in.csv
2. Use the python adapter to fetch rows
2.1. Install the Python adapter
The instructions are detailed on the project README
(https://github.com/apache/phoenix/tree/master/python/phoenixdb).
pip install phoenixdb
2.2. Write a script to fetch rows
import phoenixdb
from time import time
database_url = ‘<your_pqs_url>’
sql = 'select ind_id, a, b, c from test_perf'
conn = phoenixdb.connect(database_url, autocommit=True)
print 'connecting to %s' % database_url
cursor = conn.cursor()
cursor.itersize = -2 # to fetch all rows at once
s = time()
print 'executing query %s' % sql
cursor.execute()
print 'time to execute = %s' % (time() - s)
s = time()
print 'fetching results'
results = cursor.fetchall()
print 'time to fetchall = %s' % (time() - s)
2.3. Execute the fetch script
When I execute the above script to fetch all rows at once (i.e., using
cursor.itersize = -2), PQS crashes and returns and error:
Traceback (most recent call last):
File "queries/protobuf_client.py", line 12, in <module>
cursor = conn.cursor()
File
"/Users/mganesan/ttam/local-scripts/venv/lib/python2.7/site-packages/phoenixdb/cursor.py",
line 201, in execute
operation, first_frame_max_size=self.itersize)
File
"/Users/mganesan/ttam/local-scripts/venv/lib/python2.7/site-packages/phoenixdb/avatica.py",
line 405, in prepare_and_execute
response_data = self._apply(request, 'ExecuteResponse')
File
"/Users/mganesan/ttam/local-scripts/venv/lib/python2.7/site-packages/phoenixdb/avatica.py",
line 213, in _apply
parse_error_page(response_body)
File
"/Users/mganesan/ttam/local-scripts/venv/lib/python2.7/site-packages/phoenixdb/avatica.py",
line 117, in parse_error_page
raise errors.InternalError(message)
phoenixdb.errors.InternalError: ('GC overhead limit exceeded', None,
None, None)
If I fetch it 2000 rows at a time (i.e., cursor.itersize = 2000, which
is also the default), the fetch finishes in 360s.
3. JSON serialization
One of the issues I mentioned in my previous email was about how JSON
serialization seems to be not supported via the Python client even
though I was making the queries as per the Avatica JSON reference
(https://calcite.apache.org/avatica/docs/json_reference.html). The
client I wrote always received 0 rows, but no errors. On digging through
the source a bit, I realized that with the version being used within
Phoenix (the latest?), 2 new arguments are expected with the
‘prepareAndExecute’ method: ‘maxRowsTotal’ and ‘maxRowsInFirstFrame’.
Essentially the request should look like the following (note the last 2
args):
{
'request': 'prepareAndExecute',
'sql': 'select ind_id, a, b, c from test_perf’,
'connectionId': '2747c631-440e-405b-8185-44d1cc7f266c',
'statementId': 1,
'maxRowCount': -2,
'maxRowsTotal': -2,
'maxRowsInFirstFrame': -2,
}
“sqlline-thin.json -s JSON” works because the Avatica client driver
Phoenix/Sqlline uses makes the right HTTP calls, with the correct args.
The response time with using JSON serialization and fetching *all* rows
(cursor.itersize = -2) was 60s, which is much faster than the 360s that
it takes with protobuf when passing cursor.itersize=2K (note that
protobuf did not work with fetching all rows at once, i.e.,
cursor.itersize=-2)
Because of these issues, for now we have switched PQS to using JSON
serialization and updated our clients to use the same. We’re obviously
very much interested in understanding how the protobuf path can be made
faster.
Thanks again for the help!
Manoj
On Tue, Nov 6, 2018 at 9:51 AM Josh Elser <els...@apache.org
<mailto:els...@apache.org>> wrote:
On 11/5/18 10:10 PM, Manoj Ganesan wrote:
> Thanks for the pointers Josh. I'm working on getting a
representative
> concise test to demonstrate the issue.
>
> Meanwhile, I had one question regarding the following:
>
> You are right that the operations in PQS should be exactly
the same,
> regardless of the client you're using -- that is how this
> architecture works.
>
>
> IIUC, this means the following 2 methods should yield the same
result:
>
> 1. sqlline-thin.py -s JSON <query_file>
> 2. using a python avatica client script making JSON requests
That's correct. Any client which speaks to PQS should see the same
results. There may be bugs in the client implementation, of course,
which make this statement false.
> I made the following change in hbase-site.xml on the PQS host:
>
> <property>
> <name>phoenix.queryserver.serialization</name>
> <value>JSON</value>
> </property>
>
> I notice that executing "sqlline-thin.py -s JSON <query_file>"
returns
> results just fine. However, when I use a simple script to try the
same
> query, it returns 0 rows. I'm attaching the Python script here. The
> script essentially makes HTTP calls using the Avatica JSON reference
> <https://calcite.apache.org/avatica/docs/json_reference.html>. I
assumed
> that the sqlline-thin wrapper (when passed the -s JSON flag) also
make
> HTTP calls based on the JSON reference, is that not correct?
Apache mailing lists strip attachments. Please consider hosting it
somewhere else, along with instructions/scripts to generate the
required
tables. Please provide some more analysis of the problem than just a
summarization of what you see as an end-user -- I don't have the cycles
or interest to debug the entire system for you :)
Avatica is a protocol that interprets JDBC using some serialization
(JSON or Protobuf today) and a transport (only HTTP) to a remote server
to run the JDBC oeprations. So, yes: an Avatica client is always using
HTTP, given whatever serialization you instruct it to use.
> I'll work on getting some test cases here soon to illustrate this as
> well as the performance problem.
>
> Thanks again!
> Manoj
>
> On Mon, Nov 5, 2018 at 10:43 AM Josh Elser <els...@apache.org
<mailto:els...@apache.org>
> <mailto:els...@apache.org <mailto:els...@apache.org>>> wrote:
>
> Is the OOME issue regardless of using the Java client
(sqlline-thin)
> and
> the Python client? I would like to know more about this one.
If you can
> share something that reproduces the problem for you, I'd like
to look
> into it. The only suggestion I have at this point in time is
to make
> sure you set a reasonable max-heap size in hbase-env.sh (e.g.
-Xmx) via
> PHOENIX_QUERYSERVER_OPTS and have HBASE_CONF_DIR pointing to
the right
> directory when you launch PQS.
>
> Regarding performance, as you've described it, it sounds like the
> Python
> driver is just slower than the Java driver. You are right
that the
> operations in PQS should be exactly the same, regardless of
the client
> you're using -- that is how this architecture works. Avatica
is a wire
> protocol that all clients use to talk to PQS. More
digging/information
> you can provide about the exact circumstances (and, again,
> steps/environment to reproduce what you see) would be
extremely helpful.
>
> Thanks Manoj.
>
> - Josh
>
> On 11/2/18 7:16 PM, Manoj Ganesan wrote:
> > Thanks Josh for the response!
> >
> > I would definitely like to use protobuf serialization, but I'm
> observing
> > performance issues trying to run queries with a large
number of
> results.
> > One problem is that I observe PQS runs out of memory, when its
> trying to
> > (what looks like to me) serialize the results in Avatica. The
> other is
> > that the phoenixdb python adapter itself spends a large
amount of
> time
> > in the logic
> >
>
<https://github.com/apache/phoenix/blob/master/python/phoenixdb/phoenixdb/cursor.py#L248>
>
> > where its converting the protobuf rows to python objects.
> >
> > Interestingly when we use sqlline-thin.py instead of python
> phoenixdb,
> > the protobuf serialization works fine and responses are fast.
> It's not
> > clear to me why PQS would have problems when using the python
> adapter
> > and not when using sqlline-thin, do they follow different
code paths
> > (especially around serialization)?
> >
> > Thanks again,
> > Manoj
> >
> > On Fri, Nov 2, 2018 at 4:05 PM Josh Elser
<els...@apache.org <mailto:els...@apache.org>
> <mailto:els...@apache.org <mailto:els...@apache.org>>
> > <mailto:els...@apache.org <mailto:els...@apache.org>
<mailto:els...@apache.org <mailto:els...@apache.org>>>> wrote:
> >
> > I would strongly suggest you do not use the JSON
serialization.
> >
> > The JSON support is implemented via Jackson which has no
> means to make
> > backwards compatibility "easy". On the contrast, protobuf
> makes this
> > extremely easy and we have multiple examples over the past
> years where
> > we've been able to fix bugs in a backwards compatible
manner.
> >
> > If you want the thin client to continue to work across
> versions, stick
> > with protobuf.
> >
> > On 11/2/18 5:27 PM, Manoj Ganesan wrote:
> > > Hey everyone,
> > >
> > > I'm trying to use the Python phoenixdb adapter work
with JSON
> > > serialization on PQS.
> > >
> > > I'm using Phoenix 4.14 and the adapter works fine with
> protobuf, but
> > > when I try making it work with an older version of
phoenixdb
> > (before the
> > > JSON to protobuf switch was introduced), it just
returns 0
> rows.
> > I don't
> > > see anything in particular wrong with the HTTP requests
> itself,
> > and they
> > > seem to conform to the Avatica JSON spec
> > >
(http://calcite.apache.org/avatica/docs/json_reference.html).
> > >
> > > Here's the result (with some debug statements) that
> returns 0 rows.
> > > Notice the
> *"firstFrame":{"offset":0,"done":true,"rows":[]* below:
> > >
> > > request body = {"maxRowCount": -2, "connectionId":
> > > "68c05d12-5770-47d6-b3e4-dba556db4790", "request":
> > "prepareAndExecute",
> > > "statementId": 3, "sql": "SELECT col1, col2 from table
> limit 20"}
> > > request headers = {'content-type': 'application/json'}
> > > _post_request: got response {'fp': <socket._fileobject
> object at
> > > 0x7f858330b9d0>, 'status': 200, 'will_close': False,
> 'chunk_left':
> > > 'UNKNOWN', 'length': 1395, 'strict': 0, 'reason': 'OK',
> > 'version': 11,
> > > 'debuglevel': 0, 'msg': <httplib.HTTPMessage
instance at
> > > 0x7f84fb50be18>, 'chunked': 0, '_method': 'POST'}
> > > response.read(): body =
> > >
> >
>
{"response":"executeResults","missingStatement":false,"rpcMetadata":{"response":"rpcMetadata","serverAddress":"ip-10-55-6-247:8765"},"results":[{"response":"resultSet","connectionId":"68c05d12-5770-47d6-b3e4-dba556db4790","statementId":3,"ownStatement":true,"signature":{"columns":[{"ordinal":0,"autoIncrement":false,"caseSensitive":false,"searchable":true,"currency":false,"nullable
> > >
> >
>
":0,"signed":true,"displaySize":40,"label":"COL1","columnName":"COL1","schemaName":"","precision":0,"scale":0,"tableName":"TABLE","catalogName":"","type":{"type":"scalar","id":4,"name":"INTEGER","rep":"PRIMITIVE_INT"},"readOnly":true,"writable":false,"definitelyWritable":false,"columnClassName":"java.lang.Integer"},{"ordinal":1,"autoIncrement":false,"caseSensitive":false,"searchable":true,"currency":false,"nullable":0,"signed":true,"displaySize":40,"label":"COL2","columnName":"COL2","schemaName":"","precision":0,"scale":0,"tableName":"TABLE","catalogName":"","type":{"type":"scalar","id":4,"name":"INTEGER","rep":"PRIMITIVE_INT"},"readOnly":true,"writable":false,"definitelyWritable":false,"columnClassName":"java.lang.Integer"}],"sql":null,"parameters":[],"cursorFactory":{"style":"LIST","clazz":null,"fieldNames":null},"statementType":null},*"firstFrame":{"offset":0,"done":true,"rows":[]*},"updateCount":-1,"rpcMetadata":{"response":"rpcMetadata","serverAddress":"ip-10-55-6-247:8765"}}]}
> >
> > >
> > >
> > > The same query issued against a PQS started with
PROTOBUF
> > serialization
> > > and using a newer phoenixdb adapter returns the correct
> number of
> > rows.
> > >
> > > Has anyone had luck making this work?
> > >
> > > Thanks,
> > > Manoj
> > >
> >
>