Re: Best strategy for UPSERT SELECT in large table

Jonathan Leech Sun, 18 Jun 2017 10:51:06 -0700

Also, if you're updating that many values and not doing it in bulk / mapreduce 
/ straight to hfiles, you'll want to give the region servers as much heap as 
possible, set store files and blocking store files astronomically high, and set 
the memory size for the table before Hbase flushes to disk as large as 
possible. This is to avoid compactions slowing you down and causing timeouts. 
You can also break up the upsert selects into smaller chunks and manually 
compact in between to mitigate. The above strategy also applies for other large 
updates in the regular Hbase write path, such as building or rebuilding indexes.


> On Jun 18, 2017, at 11:41 AM, Jonathan Leech <jonat...@gmail.com> wrote:
> 
> Another thing to consider, but only if your 1:1 mapping keeps the primary 
> keys the same, is to snapshot the table and restore it with the new name, and 
> a schema that is the union of the old and new schemas. I would put the new 
> columns in a new column family. Then use upsert select, mapreduce, or Spark 
> to transform the data, then drop the columns from the old schema. This 
> strategy could cut the amount of work to be done by half and not send data 
> over the network.
> 
>> On Jun 17, 2017, at 5:06 PM, Randy Hu <ruw...@gmail.com> wrote:
>> 
>> If I count the number of tailing zeros correctly, it's 15 billion records,
>> any solution based on HBase PUT interaction (UPSERT SELECT) would probably
>> take way more time than your expectation. It would be better to use the
>> map/reduce based bulk importer provided by Phoenix:
>> 
>> https://phoenix.apache.org/bulk_dataload.html
>> 
>> The importer leverages HBase bulk mode to convert all data into HBase
>> storage file, then hand it over to HBase in the final stage, thus avoids
>> all network and disk random access cost when going through HBase region
>> servers.
>> 
>> Randy
>> 
>> On Fri, Jun 16, 2017 at 9:51 AM, Pedro Boado [via Apache Phoenix User List]
>> <ml+s1124778n3675...@n5.nabble.com> wrote:
>> 
>>> Hi guys,
>>> 
>>> We are trying to populate a Phoenix table based on a 1:1 projection of
>>> another table with around 15.000.000.000 records via an UPSERT SELECT in
>>> phoenix client. We've noticed a very poor performance ( I suspect the
>>> client is using a single-threaded approach ) and lots of issues with client
>>> timeouts.
>>> 
>>> Is there a better way of approaching this problem?
>>> 
>>> Cheers!
>>> Pedro
>>> 
>>> 
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>> http://apache-phoenix-user-list.1124778.n5.nabble.com/
>>> Best-strategy-for-UPSERT-SELECT-in-large-table-tp3675.html
>>> To start a new topic under Apache Phoenix User List, email
>>> ml+s1124778n1...@n5.nabble.com
>>> To unsubscribe from Apache Phoenix User List, click here
>>> <http://apache-phoenix-user-list.1124778.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=cnV3ZWloQGdtYWlsLmNvbXwxfC04OTI3ODY3NTc=>
>>> .
>>> NAML
>>> <http://apache-phoenix-user-list.1124778.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-phoenix-user-list.1124778.n5.nabble.com/Best-strategy-for-UPSERT-SELECT-in-large-table-tp3675p3683.html
>> Sent from the Apache Phoenix User List mailing list archive at Nabble.com.

Re: Best strategy for UPSERT SELECT in large table

Reply via email to