Re: Salting based on partial rowkeys

2018-11-05 Thread Gerald Sangudi
Hi folks,

I would like to continue the discussion below from a few weeks ago. I would
like to address the feedback from Thomas, Jaanai, and Sergey.

In thinking about this some more, it amounts to introducing a form of hash
partitioning in Phoenix + HBase. For this to work in the general case,
region splits and merges would need to be disabled, and the regions
pre-defined. This is already supported by DisableRegionSplitPolicy.

I understand that HBase is range-partitioned. What we are proposing is to
allow specialized use cases for users to manage their own partitioning. All
other users would be unaffected.

Jaanai also mentioned that salting is meant to address write hotspots. That
is true. We are proposing an additional use of salting (or if you prefer,
another feature altogether) for specialized use cases. Again, all other
users would be unaffected.

We have a lot of data in HBase, and some of our queries would benefit from
some sort of hash partitioning. That is the crux of our proposal.

Some specific responses:

@Josh -- exactly.

@Thomas -- if we can partition the data exactly how we want, we can make
sure that certain queries do not go across regions. We can either scan only
the matching regions, and we can perform the full aggregation within each
matching region, without having to do a merge.

@Jaanai, @Sergey -- I hope I explained about hotspotting above. We would
also be fine calling it something other than salting. Maybe that's better
to avoid confusion.

@Lars -- you posted questions in PHOENIX-4757. I will respond there.

Thanks everyone for all the feedback on this. Our goal is to discuss all
the concerns, and then finally get a yay or nay consensus from the
committers.

Gerald

On Sun, Sep 16, 2018 at 9:52 PM la...@apache.org  wrote:

>  I added some comments on the PHOENIX-4757
>
> On Thursday, September 13, 2018, 6:42:12 PM PDT, Josh Elser <
> els...@apache.org> wrote:
>
>  Ahh, I get you now.
>
> For a composite primary key made up of columns 1 through N, you want
> similar controls to compute the value of the salt based on a sequence of
> the columns 1 through M where M <= N (instead of always on all columns).
>
> For large numbers of salt buckets and a scan over a facet, you prune
> your search space considerably. Makes sense to me!
>
> On 9/13/18 6:37 PM, Gerald Sangudi wrote:
> > In case the text formatting is lost below, I also added it as a comment
> in
> > the JIRA ticket:
> >
> > https://issues.apache.org/jira/browse/PHOENIX-4757
> >
> >
> > On Thu, Sep 13, 2018 at 3:24 PM, Gerald Sangudi 
> > wrote:
> >
> >> Sorry I missed Josh's reply; I've subscribed to the dev list now.
> >>
> >> Below is a copy-and-paste from our internal document. Thanks in advance
> >> for your review and additional feedback on this.
> >>
> >> Gerald
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> *BackgroundWe make extensive use of multi-column rowkeys and salting
> >>  in our different apache
> phoenix
> >> deployments. We frequently perform group-by aggregations on these data
> >> along a specific dimension that would benefit from predictably
> partitioning
> >> the data along that dimension. Proposal:We propose to add table
> metadata to
> >> allow schema designers to constrain salting to a subset of the rowkey,
> >> rather than the full rowkey as it is today. This will introduce a
> mechanism
> >> to partition data on a per-table basis along a single dimension without
> >> application changes or much change to the phoenix runtime logic. We
> expect
> >> this will result in substantially faster group-by’s along the salted
> >> dimension and negligible penalties elsewhere. This feature has also been
> >> proposed in PHOENIX-4757
> >>  where it was
> pointed
> >> out that partitioning and sorting data along different dimensions is a
> >> common pattern in other datastores as well.Theoretically, it could cause
> >> hotspotting when querying along the salted dimension without the leading
> >> rowkey - that would be an anti-pattern.Usage
> ExampleCurrent:Schema:CREATE
> >> TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key
> >> BIGINT NOT NULL,val SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2,
> >> other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From
> >> relationshipWhere id_1 in (2,3)Group by id_2Explain:0: jdbc:phoenix:>
> >> EXPLAIN Select id_2, sum(val) From relationship Where id_1 in (2,3)
> Group
> >> by id_2
> >>
> ;+-++|
> >>  PLAN| EST_BY
> >>
> |+-++|
> >> CLIENT 60-CHUNK PARALLEL 60-WAY SKIP SCAN ON 120 KEYS OVER RELATIONSHIP
> >> [0,2] - [59,3]  | null ||SERVER AGGREGATE INTO DISTINCT 

[jira] [Resolved] (PHOENIX-5002) Don't load or disable Indexer coprocessor for non-indexed tables

2018-11-05 Thread Vincent Poon (JIRA)


 [ 
https://issues.apache.org/jira/browse/PHOENIX-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Poon resolved PHOENIX-5002.
---
Resolution: Won't Fix

> Don't load or disable Indexer coprocessor for non-indexed tables
> 
>
> Key: PHOENIX-5002
> URL: https://issues.apache.org/jira/browse/PHOENIX-5002
> Project: Phoenix
>  Issue Type: Improvement
>Affects Versions: 4.14.1
>Reporter: Vincent Poon
>Priority: Major
>
> It seems the Indexer coprocessor is loaded for tables even if they have no 
> indexes.
> There is some overhead such as write locking within Phoenix - we should 
> investigate whether we can avoid loading the Indexer coproc or disable it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)