The note of the round table meeting after HBaseConAsia 2019

Duo Zhang Fri, 26 Jul 2019 06:43:18 -0700

The conclusion of the HBaseConAsia 2019 will be available later. And here
is the note of the round table meeting after the conference. A bit long...


First we talked about splittable meta. At Xiaomi we have a cluster which
has nearly 200k regions and meta is very easy to overload and can not
recover. Anoop said we can try read replica, but agreed that read replica
can not solve all the problems, finally we still need to split meta.

Then we talked about SQL. Allan Yang said that most of their customers want
secondary index, even more than SQL. And for global strong consistent
secondary index, we agree that the only safe way is to use transaction.
Other 'local' solutions will be in trouble when splitting/merging. Xiaomi
has an global secondary index solution, open source it?

Then we back to SQL. We talked about Phoenix, the problem for Phoenix is
well known: not stable enough. We even had a user on the mailing-list said
he/she will never use Phoenix again. Alibaba and Huawei both have their
in-house SQL solution, and Huawei also talked about it on HBaseConAsia
2019, they will try to open source it. And we could introduce a SQL proxy
in hbase-connector repo. No push down support first, all logics are done at
the proxy side, can optimize later.

Some guys said that the current feature set for 3.0.0 is not good enough to
attract more users, especially for small companies. Only internal
improvements, no users visible features. SQL and secondary index are very
important.

Yu Li talked about the CCSMap, we still want it to be release in 3.0.0. One
problem is the relationship with in memory compaction. Theoretically they
should have no conflicts but actually they have. And Xiaomi guys mentioned
that in memory compaction still has some bugs, even for basic mode, the
MVCC writePoint may be stuck and hang the region server. And Jieshan Bi
asked why not just use CCSMap to replace CSLM. Yu Li said this is for
better memory usage, the index and data could be placed together.

Then we started to talk about the HBase on cloud. For now, it is a bit
difficult to deploy HBase on cloud as we need to deploy zookeeper and HDFS
first. Then we talked about the HBOSS and WAL abstraction(HBASE-209520.
Wellington said the HBOSS basicly works, it use s3a and zookeeper to help
simulating the operations of HDFS. We could introduce our own 'FileSystem'
interface, not the hadoop one, and we could remove the 'atomic renaming'
dependency so the 'FileSystem' implementation will be easier. And on the
WAL abstraction, Wellington said there are still some guys working it, but
now they focus on patching ratis, rather than abstracting the WAL system
first. We agreed that a better way is to abstract WAL system at a level
higher than FileSystem. so maybe we could even use Kafka to store the WAL.

Then we talked about the FPGA usage for compaction at Alibaba. Jieshan Bi
said that in Huawei they offload the compaction to storage layer. For open
source solution, maybe we could offload the compaction to spark, and then
use something like bulkload to let region server load the new HFiles. The
problem for doing compaction inside region server is the CPU cost and GC
pressure. We need to scan every cell so the CPU cost is high. Yu Li talked
about their page based compaction in flink state store, maybe it could also
benefit HBase.

Then it is the time for MOB. Huawei said MOD can not solve their problem.
We still need to read the data through RPC, and it will also introduce
pressures on the memstore, since the memstore is still a bit small,
comparing to MOB cell. And we will also flush a lot although there are only
a small number of MOB cells in the memstore, so we still need to compact a
lot. So maybe the suitable scenario for using MOB is that, most of your
data are still small, and a small amount of the data are a bit larger,
where MOD could increase the performance, and users do not need to use
another system to store the larger data.
Huawei said that they implement the logic at client side. If the data is
larger than a threshold, the client will go to another storage system
rather than HBase.
Alibaba said that if we want to support large blob, we need to introduce
streaming API.
And Kuaishou said that they do not use MOB, they just store data on HDFS
and the index in HBase, typical solution.

Then we talked about which company to host the next year's HBaseConAsia. It
will be Tencent or Huawei, or both, probably in Shenzhen. And since there
is no HBaseCon in America any more(it is called 'NoSQL Day'), maybe next
year we could just call the conference HBaseCon.

Then we back to SQL again. Alibaba said that most of their customers are
migrate from old business, so they need 'full' SQL support. That's why they
need Phoenix. And lots of small companies wants to run OLAP queries
directly on the database, they do no want to use ETL. So maybe in the SQL
proxy(planned above), we should delegate the OLAP queries to spark SQL or
something else, rather than just rejecting them.

And a Phoenix committer said that, the Phoenix community are currently
re-evaluate the relationship with HBase, because when upgrading to HBase
2.1.x, lots of things are broken. They plan to break the tie between
Phoenix and HBase, which means Phoenix plans to also run on other storage
systems.
Note: This is not on the meeting but personally, I think this maybe a good
news, since Phoenix is not HBase only, we have more reasons to introduce
our own SQL layer.

Then we talked about Kudu. It is faster than HBase on scan. If we want to
increase the performance on scan, we should have larger block size, but
this will lead to a slower random read, so we need to trade-off.
The Kuaishou guys asked whether HBase could support storing HFile in
columnar format. The answer is no, as said above, it will slow random read.
But we could learn what google done in bigtable. We could write a copy of
the data in parquet format to another FileSystem, and user could just scan
the parquet file for better analysis performance. And if they want the
newest data, they could ask HBase for the newest data, and it should be
small. This is more like a solution, not only HBase is involved. But at
least we could introduce some APIs in HBase so users can build the solution
in their own environment. And if you do not care the newest data, you could
also use replication to replicate the data to ES or other systems, and
search there.

And Didi talked about their problems using HBase. They use kylin so they
also have lots of regions, so meta is also a problem for them. And the
pressure on zookeeper is also a problem, as the replication queues are
stored on zk. And after 2.1, zookeeper is only used as an external storage
in replication implementation, so it is possible to switch to other
storages, such as etcd. But it is still a bit difficult to store the data
in a system table, as now we need to start the replication system before
WAL system, but  if we want to store the replication data in a hbase table,
obviously the WAL system must be started before replication system, as we
need the region of the system online first, and it will write an open
marker to WAL. We need to find a way to break the dead lock.
And they also mentioned that, the rsgroup feature also makes big znode on
zookeeper, as they have lots of tables. We have HBASE-22514 which aims to
solve the problem.
And last, they shared their experience when upgrading from 0.98 to 1.4.x.
they should be compatible but actually there are problems. They agreed to
post a blog about this.

And the Flipkart guys said they will open source their test-suite, which
focus on the consistency(Jepsen?). This is a good news, hope we could have
another useful tool other than ITBLL.

That's all. Thanks for reading.

The note of the round table meeting after HBaseConAsia 2019

Reply via email to