Re: Phoenix as a source for Spark processing

2018-03-08 Thread Josh Elser
How large is each row in this case? Or, better yet, how large is the 
table in HBase?


You're spreading out approximately 7 "clients" to each Regionserver 
fetching results (100/14). So, you should have pretty decent saturation 
from Spark into HBase.


I'd be taking a look at the EXPLAIN plan for your SELECT DISTINCT to 
really understand what Phoenix is doing. For example, are you getting 
ample saturation of the resources that your servers have available 
(32core/128Gb memory is pretty good). Validating how busy Spark is 
actually keeping HBase, and how much time is spent transforming the data 
would be good. Or, another point, are you excessively scanning data in 
the system which you could otherwise preclude by a different rowkey 
structure via logic such as a skip-scan (which would be shown in the 
EXPLAIN plan).


You may actually find that using the built-in UPSERT SELECT logic may 
out-perform the Spark integration since you aren't actually doing any 
transformation logic inside of Spark.



On 3/5/18 3:14 PM, Stepan Migunov wrote:

Hi Josh, thank you for response!

Our cluster has 14 nodes (32 cores each/128 GB memory). The source Phoenix
table contains about 1 billion records (100 columns). We start a Spark's job
with about 100 executors. Spark executes SELECT from the source table
(select 6 columns with DISTINCT) and writes down output to another Phoenix
table. Expected that the target table will contains about 100 million
records.
HBase has 14 region servers, both tables salted with SALT_BUCKETS=42.
Spark's job running via Yarn.


-Original Message-
From: Josh Elser [mailto:els...@apache.org]
Sent: Monday, March 5, 2018 9:14 PM
To: user@phoenix.apache.org
Subject: Re: Phoenix as a source for Spark processing

Hi Stepan,

Can you better ballpark the Phoenix-Spark performance you've seen (e.g.
how much hardware do you have, how many spark executors did you use, how
many region servers)? Also, what versions of software are you using?

I don't think there are any firm guidelines on how you can solve this
problem, but you've found the tools available for you.

* You can try Phoenix+Spark to run over the Phoenix tables in place
* You can use Phoenix+Hive to offload the data into Hive for queries

If Phoenix-Spark wasn't fast enough, I'd imagine using the Phoenix-Hive
integration to query the data would be similarly not fast enough.

It's possible that the bottleneck is something we could fix in the
integration, or fix configuration of Spark and/or Phoenix. We'd need you to
help quantify this better :)

On 3/4/18 6:08 AM, Stepan Migunov wrote:

In our software we need to combine fast interactive access to the data
with quite complex data processing. I know that Phoenix intended for fast
access, but hoped that also I could be able to use Phoenix as a source for
complex processing with the Spark.  Unfortunately, Phoenix + Spark shows
very poor performance. E.g., querying big (about billion records) table
with distinct takes about 2 hours. At the same time this task with Hive
source takes a few minutes. Is it expected? Does it mean that Phoenix is
absolutely not suitable for batch processing with spark and I should
duplicate data to Hive and process it with Hive?



Re: Phoenix as a source for Spark processing

2018-03-08 Thread Josh Elser
I would guess that Hive would always be capable of out-matching what 
HBase/Phoenix can do for this type of workload (bulk-transformation). 
That said, I'm not ready to tell you that you can't get the 
Phoenix-Spark integration better performing. See the other thread where 
you provide more details..


It's important to remember that Phoenix is designed to shine when you 
have workloads which require updates to a single row/column. The 
underlying I/O system is much different in HBase compared to Hive in 
order to server the random update use-case.


On 3/7/18 4:08 AM, Stepan Migunov wrote:

Some more details... We have done some simple tests to compare read/write 
possibility spark+hive and spark+phoenix. And now we have the following results:

Copy table (with no any transformations) (about 800 million rec):
Hive (TEZ) - 752 sec

Spark:
 From Hive to Hive: 2463 sec
 From Phoenix to Hive - 13310 sec
 From Hive to Phoenix - > 30240 sec

We use Spark 2.2.1; hbase 1.1.2, Phonix 4.13, Hive 2.1.1

So it seems that Spark + Phoenix led great performance degradation. Any 
thoughts?

On 2018/03/04 11:08:56, Stepan Migunov  
wrote:

In our software we need to combine fast interactive access to the data with 
quite complex data processing. I know that Phoenix intended for fast access, 
but hoped that also I could be able to use Phoenix as a source for complex 
processing with the Spark.  Unfortunately, Phoenix + Spark shows very poor 
performance. E.g., querying big (about billion records) table with distinct 
takes about 2 hours. At the same time this task with Hive source takes a few 
minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable 
for batch processing with spark and I should  duplicate data to Hive and 
process it with Hive?



Re: Direct HBase vs. Phoenix query performance

2018-03-08 Thread James Taylor
Hi Marcell,
It'd be helpful to see the table DDL and the query too along with an idea
of how many regions might be involved in the query. If a query is a
commonly run query, usually you'll design the row key around optimizing it.
If you have other, simpler queries that have determined your row key, then
another alternative is to add one or more secondary indexes. Another common
technique is to denormalize your data in ways that precompute the join to
avoid having to do it at run time.

With joins, make sure to order your tables from post filtered largest (on
LHS) to smallest (on RHS). Also, if you're joining on the PK of both
tables, you should use the USE_SORT_MERGE_JOIN hint. Another common tuning
exercise is around determining the best parallelization to use (i.e.
guidepost width) or even disabling parallelization for more than an entire
region's worth of data.

It'd also be interesting to see the raw HBase code for a query of this
complexity.

Thanks,
James

On Thu, Mar 8, 2018 at 1:03 PM, Marcell Ortutay 
wrote:

> Hi,
>
> I am using Phoenix at my company for a large query that is meant to be run
> in real time as part of our application. The query involves several
> aggregations, anti-joins, and an inner query. Here is the (anonymized)
> query plan: https://gist.github.com/ortutay23andme/
> 1da620472cc469ed2d8a6fdd0cc7eb01
>
> The query performance on this is not great, it takes about 5sec to execute
> the query, and moreover it performs badly under load. If we run ~4qps of
> this query Phoenix starts to timeout and slow down a lot (queries take
> >30sec).
>
> For comparison, I wrote a simple Go script that runs a similar query
> talking directly to HBase. The performance on it is substantially better.
> It executes in ~1.5sec, and can handle loads of ~50-100qps on the same
> cluster.
>
> I'm wondering if anyone has ideas on what might be causing this difference
> in performance? Are there configs / optimizations we can do in Phoenix to
> bring the performance closer to direct HBase queries?
>
> I can provide context on the table sizes etc. if needed.
>
> Thanks,
> Marcell
>
>


Direct HBase vs. Phoenix query performance

2018-03-08 Thread Marcell Ortutay
Hi,

I am using Phoenix at my company for a large query that is meant to be run
in real time as part of our application. The query involves several
aggregations, anti-joins, and an inner query. Here is the (anonymized)
query plan:
https://gist.github.com/ortutay23andme/1da620472cc469ed2d8a6fdd0cc7eb01

The query performance on this is not great, it takes about 5sec to execute
the query, and moreover it performs badly under load. If we run ~4qps of
this query Phoenix starts to timeout and slow down a lot (queries take
>30sec).

For comparison, I wrote a simple Go script that runs a similar query
talking directly to HBase. The performance on it is substantially better.
It executes in ~1.5sec, and can handle loads of ~50-100qps on the same
cluster.

I'm wondering if anyone has ideas on what might be causing this difference
in performance? Are there configs / optimizations we can do in Phoenix to
bring the performance closer to direct HBase queries?

I can provide context on the table sizes etc. if needed.

Thanks,
Marcell


Re: Runtime DDL supported?

2018-03-08 Thread James Taylor
Thanks for digging that up, Miles. I've added a comment on the JIRA on how
to go about implementing it here:
https://issues.apache.org/jira/browse/PHOENIX-3547?focusedCommentId=16391739=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16391739

That would be a good first contribution if you're up for it.

Thanks,
James

On Wed, Mar 7, 2018 at 5:09 PM, Miles Spielberg  wrote:

> We found https://issues.apache.org/jira/browse/PHOENIX-3547, which seems
> to be precisely our problem. We would want at least the option to use a
> bigint rather than the int in the JIRA to accommodate massive growth. While
> we intend to have many tenants, we don't intend to use the Phoenix
> "tenant_id" to differentiate them, and instead manage them at our
> application layer, so separate counters per Phoenix tenant would not help
> in our situation.
>
> Miles Spielberg
> Staff Software Engineer
>
>
> O. 650.485.1102 <(650)%20485-1102>
> 900 Jefferson Ave
> 
> Redwood City, CA 94063
> 
>
> On Wed, Feb 28, 2018 at 10:27 PM, James Taylor 
> wrote:
>
>> Please file a JIRA as it’d be feasible to change this limitation. The
>> easiest way would be to have a separate counter for each tenant. Another
>> way to reduce the number of indexes on tenant specific views would be to
>> factor out common columns to global views and create indexes there.
>>
>> On Tue, Feb 27, 2018 at 3:40 PM Miles Spielberg  wrote:
>>
>>> As we discussed, indexes across views are stored in a single HBase table
>>> associated with the original table (_IDX_). That's grand for
>>> limiting the number of HBase tables created, but I just realized that the
>>> actual index data within is differentiated by the 16-bit "viewIndexId",
>>> which limits us to 64K indexes across all views for a given table. That's
>>> concerning for our use case, especially if its a cumulative autoincrement
>>> across all CREATE INDEX and DROP INDEX operations over the lifetime of the
>>> base table.
>>>
>>> Is there any workaround for this? A quick grep across the source
>>> indicates that the length of viewIndexId is currently hard-coded.
>>>
>>> At least, this limitation should probably be added to the list of
>>> caveats and warnings at https://phoenix.apache.org/views.html.
>>>
>>> Miles Spielberg
>>> Staff Software Engineer
>>>
>>>
>>> O. 650.485.1102 <(650)%20485-1102>
>>> 900 Jefferson Ave
>>> 
>>> Redwood City
>>> ,
>>> CA 94063
>>> 
>>>
>>> On Thu, Feb 22, 2018 at 7:42 AM, James Taylor 
>>> wrote:
>>>
 Another option would be to use dynamic columns[1] when querying across
 views. You’d have to disable column encoding [2] in this case.

 [1] http://phoenix.apache.org/dynamic_columns.html
 [2] http://phoenix.apache.org/columnencoding.html

 On Wed, Feb 21, 2018 at 11:57 PM Miles Spielberg  wrote:

> I believe each query in a UNION needs to have the same result tuple
> format, which would work in this toy example, but in the general case each
> view would have a different schema. We could make the result tuples 
> conform
> with each other by selecting NULL literals for every column except those 
> in
> a view. It would get quite verbose though. Assuming f1,f2,f3 all have
> incompatible types, were you suggesting something like this?
>
> Select f1, null, null from v1 where PK=?
> Union all
> Select null, f2, null from v2 where PK=?
> Union all
> Select null, null, f3 from v3 where PK=?
>
> We might just run separate parallel queries against each view and
> merge the results client side. I would guess this should perform well 
> since
> the block cache can be leveraged for queries after the first.
>
> We could also use the HBase API to run a point row get. We'd have to
> reimplement decoding for Phoenix's column values, which is not ideal but
> quite doable.
>
> Sent from my iPhone
>
> On Feb 21, 2018, at 9:09 PM, James Taylor 
> wrote:
>
> Have you tried a UNION ALL query on (f1, f2, f3) instead? It seems
> you’re on a good track with multiple views over a single (or handful) of
> physical table(s).
>
> On Wed, Feb 21, 2018 at 6:45 PM Miles Spielberg  wrote:
>
>> I've done some experimentation with views, with a schema resembling
>> this:
>>
>> create table t1(
>>>
>>> pk bigint not null 

Re: [DISCUSS] Include python-phoenixdb into Phoenix

2018-03-08 Thread Ankit Singhal
Thanks, Lucas and Josh. I'm now putting up the formal thread for voting.

On Fri, Mar 2, 2018 at 2:50 AM, Josh Elser  wrote:

> He appears! Thanks for weighing in. Comments inline..
>
> On Thu, Mar 1, 2018 at 3:55 PM, Lukáš Lalinský  wrote:
> > I'm fine with including the code. The license already matches Phoenix, is
> > there anything else that you would need regarding that?
>
> There's an official IP clearance process we need to follow, but your
> comment here should be sufficient going off of memory (the mailing
> list memorializing it is good).
>
> > Releasing Python code involves building the source distribution package
> and
> > uploading it to PyPI. If you end up using the python-phoenixdb project, I
> > can give you full access to https://pypi.python.org/pypi/phoenixdb.
> >
> > Wheels are mostly useful when you have non-Python code in the package,
> which
> > is not the case here.
>
> Ok, thanks. I'm obviously not an expert here.
>
> > This is the (simple) process I use:
> >
> > https://github.com/lalinsky/python-phoenixdb/blob/master/RELEASING.rst
> >
> > You could just publish the tar.gz on Apache servers, but uploading to
> PyPI
> > is preferable.
>
> A source distribution is easiest for us at the moment (and fits ASF
> policy). I'd need to investigate use publishing to PyPI. It might just
> be something we can do when the release is approved (in spirit, it's
> no different than Maven central...)
>
> > One thing that needs to be taken care of is CI. Unlike the Phoenix
> itself,
> > the test suite is unable to run an embedded version of Phoenix/HBase.
>
> Got it. We can try to figure out some automation here to make that
> happen. IIRC you have something already with Docker. I'm less worried
> about this part :)
>
> > Lukas
> >
> >
> >
> > On Thu, Mar 1, 2018 at 8:38 PM, Josh Elser  wrote:
> >>
> >> Obviously, I'm in favor of this. Thanks for bringing it up again, Ankit.
> >>
> >> I remember Lukas giving his "blessing" to the project being pulled up
> >> into Apache Phoenix in the past. It seems like there are multiple
> >> authors on the pyPhoenix fork -- we'll need to make sure that we have
> >> all authors on board with this decision.
> >>
> >> It seems like making a Python "Wheel" would be the lowest-barrier to
> >> entry on getting something usable for people from the ASF. I don't
> >> know what, if any, infrastructure exists to distribute Python modules.
> >> https://packaging.python.org/glossary/#term-built-distribution
> >>
> >> I feel like a sub-directory in the phoenix repository would be the
> >> easiest to make this work (for now). A separate repo would also work.
> >> I know phoenixdb has some tests -- I assume the same tests (plus more)
> >> exist for pyPhoenix but I have not validated this.
> >>
> >> On Thu, Mar 1, 2018 at 10:34 AM, Ankit Singhal 
> wrote:
> >> > Re-opening the thread as I can extend help towards the IP clearance
> >> > process[1] to include python-phoenixdb[2] if Lukas Lalinsky and
> >> > community
> >> > agrees to it.
> >> >
> >> > I found both the repositories[2][3] on par but we should let first
> Lukáš
> >> > Lalinský (the original owner) to confirm if he agrees to contribute
> his
> >> > work[2].
> >> >
> >> > @dev, We can discuss any concerns/suggestions towards it before I'll
> >> > start a
> >> > formal vote for inclusion.
> >> >
> >> > Details on packaging and progress will be tracked on JIRA[4].
> >> >
> >> > [1] http://incubator.apache.org/ip-clearance/ip-clearance-
> template.html
> >> > [2] https://github.com/lalinsky/python-phoenixdb
> >> > [3] https://github.com/Pirionfr/pyPhoenix
> >> > [4] https://issues.apache.org/jira/browse/PHOENIX-4636
> >> >
> >> > Regards,
> >> > Ankit Singhal
> >> >
> >> >
> >> > On Tue, Apr 11, 2017 at 1:30 AM, James Taylor  >
> >> > wrote:
> >> >>
> >> >> Thanks, Dimitri. A pull request would be great. Looks like the
> >> >> outstanding
> >> >> items are:
> >> >> - Fixing licensing
> >> >> - Deciding on package structure. FWIW, the phoenix-spark module has
> >> >> scala
> >> >> code, so this is a little bit of precedence for non Java code.
> >> >> - Tweaking our release process/scripts to handle Python
> appropriately.
> >> >> Is
> >> >> that an ASF way of releasing Python libraries?
> >> >>
> >> >> I haven't looked yet, but is there good test coverage for all this?
> >> >> What
> >> >> about backward compatibility - any tests in place to validate that
> >> >> nothing
> >> >> gets broken from release to release?
> >> >>
> >> >> Thanks,
> >> >> James
> >> >>
> >> >> On Mon, Apr 10, 2017 at 8:08 AM, Dimitri  wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> DB API 2.0 and SQLAlchemy are in pyPhoenix.
> >> >>>
> >> >>> I can fix the licensing.
> >> >>>
> >> >>> tell us what you decide.
> >> >>>
> >> >>>
> >> >>> Dimitri.
> >> >>>
> >> >>>
> >> >>> 2017-03-27 0:48 GMT+02:00 Josh Elser