Re: Create View of Existing HBase table

2017-06-19 Thread Randy Hu
You are very welcome. Glad it helped.

For people using Phoenix with existing HBase tables for a nice SQL
interface, they would very likely experience the same issue again.
Wondering if the Phoenix team can change the strategy to use maximum
timestamp for select on views. This would not cause problem with UPSERT
SELECT since view is not updatable from Phoenix.

Randy

On Mon, Jun 19, 2017 at 10:52 AM, M. Aaron Bossert [via Apache Phoenix User
List]  wrote:

> Thanks for that pointer!  When I got to work this morning, I redid my
> ingest with all timestamps in milliseconds and the view populate nicely.
> Thanks again!
>
> Sent from my iPhone
>
> > On Jun 18, 2017, at 01:12, Randy Hu <[hidden email]
> > wrote:
> >
> > That's likely the problem. The value in HBase/Java is in millisecond (
> > System.currentTimeMillis()) as explained in the link:
> >
> > https://currentmillis.com/tutorials/system-currentTimeMillis.html#unix-
> timestamp
> >
> > If the packet time stamp is a long value in microsecond, then it's 1000
> > times of the value expected in HBase, which would be a future time stamp
> > for HBase.
> >
> > Randy
> >
> > On Sat, Jun 17, 2017 at 8:40 PM, M. Aaron Bossert [via Apache Phoenix
> User
> > List] <[hidden email]
> > wrote:
> >
> >> One potential difference might be resolution.  The network packets have
> >> precision down to microsecond...could that cause an issue?
> >>
> >> Sent from my iPhone
> >>
> >>> On Jun 17, 2017, at 20:32, Randy Hu <[hidden email]
> >> > wrote:
> >>>
> >>> Yes, the problem I had experienced was the future time stamp in cells.
> >>> Setting "CurrentSCN" to maximal value would enforce the scan/get to
> use
> >>> this as the timestamp, so any cells would be retuned.
> >>>
> >>> Not sure if the network packet time stamp you copied is the same as
> Unix
> >>> timestamp:
> >>>
> >>> https://currentmillis.com/tutorials/system-
> currentTimeMillis.html#unix-
> >> timestamp
> >>>
> >>> For testing purpose, maybe you can insert some cells without explicit
> >>> timestamp to confirm whether timestamp is the issue.
> >>>
> >>> Randy
> >>>
> >>> On Sat, Jun 17, 2017 at 6:21 PM, M. Aaron Bossert [via Apache Phoenix
> >> User
> >>> List] <[hidden email]
> >> > wrote:
> >>>
>  I looked through the discussion and it seems like their issue was
>  timestamps in the future?  I do use explicit timestamps for the data.
> >> I am
>  storing network traffic and am using the packet's timestamp as the
> >> Hbase
>  timestamp, which is often some time in the past when ingesting bulk
>  collection files for forensic-type analysis.
> 
>  How would setting the SCN value help?  Would I set it in the past to
> >> match
>  the oldest or newest packets?
> 
>  Sent from my iPhone
> 
> > On Jun 17, 2017, at 18:01, Randy Hu <[hidden email]
>  > wrote:
> >
> > Not sure if you have custom time stamp set when the values were
>  persisted.
> > Here is the time stamp issue I discovered with view. The solution
> >> (work
> > around) is in the last post:
> >
> > http://apache-phoenix-user-list.1124778.n5.nabble.com/
>  View-timestamp-on-existing-table-potential-defect-td3475.html
> >
> > Randy
> >
> > On Fri, Jun 16, 2017 at 2:07 PM, M. Aaron Bossert [via Apache
> Phoenix
>  User
> > List] <[hidden email]
>  > wrote:
> >
> >> *I have an existing HBase table with the following
> characteristics:*
> >>
> >> hbase(main):032:0> describe 'bro'
> >>
> >> Table bro is ENABLED
> >>
> >>
> >>
> >> bro, {TABLE_ATTRIBUTES => {coprocessor$1 => '|org.apache.phoenix.
> >> coprocessor.ScanRegionObserver|805306366|', coprocessor$2 =>
> >> '|org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver|805306366|',
>
> >>
> 
> >> coprocessor
> >>
> >> $3 => '|org.apache.phoenix.coprocessor.
> GroupedAggregateRegionObserver|805306366|',
> >>
> 
> >> coprocessor$4 => '|org.apache.phoenix.coprocessor.
> >> ServerCachingEndpointImpl|805306366|', METADATA =>
>  {'COMPACTION_ENABLED'
> >> => 'true'}
> >>
> >> }
> >>
> >>
> >>
> >> COLUMN FAMILIES DESCRIPTION
> >>
> >>
> >>
> >> {NAME => 'bm', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY =>
> >> 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
> >> 'NONE',
>  TTL
> >> => 'FOREVER', COMPRESSION => 'LZ4', MIN_VERSIONS => '0', BLOCKCACHE
> >> =>
>  'tr
> >>
> >> ue', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
> >>
> >>
> >>
> >> {NAME => 'uid', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY 

Re: Best strategy for UPSERT SELECT in large table

2017-06-19 Thread Jonathan Leech
I think you could add additional pk columns, but not change or remove existing 
ones.

> On Jun 19, 2017, at 11:58 AM, Michael Young  wrote:
> 
> Regarding your idea to use the snapshot/restore method (with a new name).  Is 
> it possible to add a PK column with that approach?  For example, if I wanted 
> to change a PK column type from VARCHAR to FLOAT, is this possible?
> 
> 
> 
>> On Sun, Jun 18, 2017 at 10:50 AM, Jonathan Leech  wrote:
>> Also, if you're updating that many values and not doing it in bulk / 
>> mapreduce / straight to hfiles, you'll want to give the region servers as 
>> much heap as possible, set store files and blocking store files 
>> astronomically high, and set the memory size for the table before Hbase 
>> flushes to disk as large as possible. This is to avoid compactions slowing 
>> you down and causing timeouts. You can also break up the upsert selects into 
>> smaller chunks and manually compact in between to mitigate. The above 
>> strategy also applies for other large updates in the regular Hbase write 
>> path, such as building or rebuilding indexes.
>> 
>> > On Jun 18, 2017, at 11:41 AM, Jonathan Leech  wrote:
>> >
>> > Another thing to consider, but only if your 1:1 mapping keeps the primary 
>> > keys the same, is to snapshot the table and restore it with the new name, 
>> > and a schema that is the union of the old and new schemas. I would put the 
>> > new columns in a new column family. Then use upsert select, mapreduce, or 
>> > Spark to transform the data, then drop the columns from the old schema. 
>> > This strategy could cut the amount of work to be done by half and not send 
>> > data over the network.
>> >
>> >> On Jun 17, 2017, at 5:06 PM, Randy Hu  wrote:
>> >>
>> >> If I count the number of tailing zeros correctly, it's 15 billion records,
>> >> any solution based on HBase PUT interaction (UPSERT SELECT) would probably
>> >> take way more time than your expectation. It would be better to use the
>> >> map/reduce based bulk importer provided by Phoenix:
>> >>
>> >> https://phoenix.apache.org/bulk_dataload.html
>> >>
>> >> The importer leverages HBase bulk mode to convert all data into HBase
>> >> storage file, then hand it over to HBase in the final stage, thus avoids
>> >> all network and disk random access cost when going through HBase region
>> >> servers.
>> >>
>> >> Randy
>> >>
>> >> On Fri, Jun 16, 2017 at 9:51 AM, Pedro Boado [via Apache Phoenix User 
>> >> List]
>> >>  wrote:
>> >>
>> >>> Hi guys,
>> >>>
>> >>> We are trying to populate a Phoenix table based on a 1:1 projection of
>> >>> another table with around 15.000.000.000 records via an UPSERT SELECT in
>> >>> phoenix client. We've noticed a very poor performance ( I suspect the
>> >>> client is using a single-threaded approach ) and lots of issues with 
>> >>> client
>> >>> timeouts.
>> >>>
>> >>> Is there a better way of approaching this problem?
>> >>>
>> >>> Cheers!
>> >>> Pedro
>> >>>
>> >>>
>> >>> --
>> >>> If you reply to this email, your message will be added to the discussion
>> >>> below:
>> >>> http://apache-phoenix-user-list.1124778.n5.nabble.com/
>> >>> Best-strategy-for-UPSERT-SELECT-in-large-table-tp3675.html
>> >>> To start a new topic under Apache Phoenix User List, email
>> >>> ml+s1124778n1...@n5.nabble.com
>> >>> To unsubscribe from Apache Phoenix User List, click here
>> >>> 
>> >>> .
>> >>> NAML
>> >>> 
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context: 
>> >> http://apache-phoenix-user-list.1124778.n5.nabble.com/Best-strategy-for-UPSERT-SELECT-in-large-table-tp3675p3683.html
>> >> Sent from the Apache Phoenix User List mailing list archive at Nabble.com.
> 


Re: Cant run map-reduce index builder because my view/idx is lower case

2017-06-19 Thread Batyrshin Alexander
Hello again,

Could you, please, help me to run map-reduce for indexing view with lower-case 
name?

Here is my test try on Phoenix-4.8.2:

CREATE TABLE "table" (
c1 varchar,
c2 varchar,
c3 varchar
CONSTRAINT pk PRIMARY KEY (c1,c2,c3)
)

CREATE VIEW "table_view"
AS SELECT * FROM "table" WHERE c3 = 'X';

CREATE INDEX "table_view_idx" ON "table_view" (c2, c1) ASYNC;

sudo -u hadoop ./bin/hbase org.apache.phoenix.mapreduce.index.IndexTool 
--data-table '"table_view"' --index-table '"table_view_idx"' --output-path 
ASYNC_IDX_HFILES

2017-06-19 21:27:17,716 ERROR [main] index.IndexTool: An exception occurred 
while performing the indexing job: IllegalArgumentException:  TABLE_VIEW_IDX is 
not an index table for TABLE_VIEW  at:
java.lang.IllegalArgumentException:  TABLE_VIEW_IDX is not an index table for 
TABLE_VIEW
at org.apache.phoenix.mapreduce.index.IndexTool.run(IndexTool.java:190)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.phoenix.mapreduce.index.IndexTool.main(IndexTool.java:394)


> On 17 Jun 2017, at 03:55, Batyrshin Alexander <0x62...@gmail.com> wrote:
> 
>  Hello,
> Im trying to build ASYNC index by example from 
> https://phoenix.apache.org/secondary_indexing.html 
> 
> My issues is that my view name and index name is lower case, so map-reduce 
> rise error:
> 
> 2017-06-17 03:45:56,506 ERROR [main] index.IndexTool: An exception occurred 
> while performing the indexing job: IllegalArgumentException:  
> INVOICES_V4_INDEXED_FUZZY_IDX is not an index table for 
> INVOICES_V4_INDEXED_FUZZY
> 



Re: Best strategy for UPSERT SELECT in large table

2017-06-19 Thread Michael Young
Regarding your idea to use the snapshot/restore method (with a new name).
Is it possible to add a PK column with that approach?  For example, if I
wanted to change a PK column type from VARCHAR to FLOAT, is this possible?



On Sun, Jun 18, 2017 at 10:50 AM, Jonathan Leech  wrote:

> Also, if you're updating that many values and not doing it in bulk /
> mapreduce / straight to hfiles, you'll want to give the region servers as
> much heap as possible, set store files and blocking store files
> astronomically high, and set the memory size for the table before Hbase
> flushes to disk as large as possible. This is to avoid compactions slowing
> you down and causing timeouts. You can also break up the upsert selects
> into smaller chunks and manually compact in between to mitigate. The above
> strategy also applies for other large updates in the regular Hbase write
> path, such as building or rebuilding indexes.
>
> > On Jun 18, 2017, at 11:41 AM, Jonathan Leech  wrote:
> >
> > Another thing to consider, but only if your 1:1 mapping keeps the
> primary keys the same, is to snapshot the table and restore it with the new
> name, and a schema that is the union of the old and new schemas. I would
> put the new columns in a new column family. Then use upsert select,
> mapreduce, or Spark to transform the data, then drop the columns from the
> old schema. This strategy could cut the amount of work to be done by half
> and not send data over the network.
> >
> >> On Jun 17, 2017, at 5:06 PM, Randy Hu  wrote:
> >>
> >> If I count the number of tailing zeros correctly, it's 15 billion
> records,
> >> any solution based on HBase PUT interaction (UPSERT SELECT) would
> probably
> >> take way more time than your expectation. It would be better to use the
> >> map/reduce based bulk importer provided by Phoenix:
> >>
> >> https://phoenix.apache.org/bulk_dataload.html
> >>
> >> The importer leverages HBase bulk mode to convert all data into HBase
> >> storage file, then hand it over to HBase in the final stage, thus avoids
> >> all network and disk random access cost when going through HBase region
> >> servers.
> >>
> >> Randy
> >>
> >> On Fri, Jun 16, 2017 at 9:51 AM, Pedro Boado [via Apache Phoenix User
> List]
> >>  wrote:
> >>
> >>> Hi guys,
> >>>
> >>> We are trying to populate a Phoenix table based on a 1:1 projection of
> >>> another table with around 15.000.000.000 records via an UPSERT SELECT
> in
> >>> phoenix client. We've noticed a very poor performance ( I suspect the
> >>> client is using a single-threaded approach ) and lots of issues with
> client
> >>> timeouts.
> >>>
> >>> Is there a better way of approaching this problem?
> >>>
> >>> Cheers!
> >>> Pedro
> >>>
> >>>
> >>> --
> >>> If you reply to this email, your message will be added to the
> discussion
> >>> below:
> >>> http://apache-phoenix-user-list.1124778.n5.nabble.com/
> >>> Best-strategy-for-UPSERT-SELECT-in-large-table-tp3675.html
> >>> To start a new topic under Apache Phoenix User List, email
> >>> ml+s1124778n1...@n5.nabble.com
> >>> To unsubscribe from Apache Phoenix User List, click here
> >>>  template/NamlServlet.jtp?macro=unsubscribe_by_code=1=
> cnV3ZWloQGdtYWlsLmNvbXwxfC04OTI3ODY3NTc=>
> >>> .
> >>> NAML
> >>>  template/NamlServlet.jtp?macro=macro_viewer=instant_
> html%21nabble%3Aemail.naml=nabble.naml.namespaces.
> BasicNamespace-nabble.view.web.template.NabbleNamespace-
> nabble.view.web.template.NodeNamespace=
> notify_subscribers%21nabble%3Aemail.naml-instant_emails%
> 21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
> >>>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: http://apache-phoenix-user-
> list.1124778.n5.nabble.com/Best-strategy-for-UPSERT-SELECT-in-large-table-
> tp3675p3683.html
> >> Sent from the Apache Phoenix User List mailing list archive at
> Nabble.com.
>


Re: Create View of Existing HBase table

2017-06-19 Thread M. Aaron Bossert
Thanks for that pointer!  When I got to work this morning, I redid my ingest 
with all timestamps in milliseconds and the view populate nicely.  Thanks again!

Sent from my iPhone

> On Jun 18, 2017, at 01:12, Randy Hu  wrote:
> 
> That's likely the problem. The value in HBase/Java is in millisecond (
> System.currentTimeMillis()) as explained in the link:
> 
> https://currentmillis.com/tutorials/system-currentTimeMillis.html#unix-timestamp
> 
> If the packet time stamp is a long value in microsecond, then it's 1000
> times of the value expected in HBase, which would be a future time stamp
> for HBase.
> 
> Randy
> 
> On Sat, Jun 17, 2017 at 8:40 PM, M. Aaron Bossert [via Apache Phoenix User
> List]  wrote:
> 
>> One potential difference might be resolution.  The network packets have
>> precision down to microsecond...could that cause an issue?
>> 
>> Sent from my iPhone
>> 
>>> On Jun 17, 2017, at 20:32, Randy Hu <[hidden email]
>> > wrote:
>>> 
>>> Yes, the problem I had experienced was the future time stamp in cells.
>>> Setting "CurrentSCN" to maximal value would enforce the scan/get to use
>>> this as the timestamp, so any cells would be retuned.
>>> 
>>> Not sure if the network packet time stamp you copied is the same as Unix
>>> timestamp:
>>> 
>>> https://currentmillis.com/tutorials/system-currentTimeMillis.html#unix-
>> timestamp
>>> 
>>> For testing purpose, maybe you can insert some cells without explicit
>>> timestamp to confirm whether timestamp is the issue.
>>> 
>>> Randy
>>> 
>>> On Sat, Jun 17, 2017 at 6:21 PM, M. Aaron Bossert [via Apache Phoenix
>> User
>>> List] <[hidden email]
>> > wrote:
>>> 
 I looked through the discussion and it seems like their issue was
 timestamps in the future?  I do use explicit timestamps for the data.
>> I am
 storing network traffic and am using the packet's timestamp as the
>> Hbase
 timestamp, which is often some time in the past when ingesting bulk
 collection files for forensic-type analysis.
 
 How would setting the SCN value help?  Would I set it in the past to
>> match
 the oldest or newest packets?
 
 Sent from my iPhone
 
> On Jun 17, 2017, at 18:01, Randy Hu <[hidden email]
 > wrote:
> 
> Not sure if you have custom time stamp set when the values were
 persisted.
> Here is the time stamp issue I discovered with view. The solution
>> (work
> around) is in the last post:
> 
> http://apache-phoenix-user-list.1124778.n5.nabble.com/
 View-timestamp-on-existing-table-potential-defect-td3475.html
> 
> Randy
> 
> On Fri, Jun 16, 2017 at 2:07 PM, M. Aaron Bossert [via Apache Phoenix
 User
> List] <[hidden email]
 > wrote:
> 
>> *I have an existing HBase table with the following characteristics:*
>> 
>> hbase(main):032:0> describe 'bro'
>> 
>> Table bro is ENABLED
>> 
>> 
>> 
>> bro, {TABLE_ATTRIBUTES => {coprocessor$1 => '|org.apache.phoenix.
>> coprocessor.ScanRegionObserver|805306366|', coprocessor$2 =>
>> '|org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver|805306366|',
>> 
 
>> coprocessor
>> 
>> $3 => 
>> '|org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver|805306366|',
>> 
 
>> coprocessor$4 => '|org.apache.phoenix.coprocessor.
>> ServerCachingEndpointImpl|805306366|', METADATA =>
 {'COMPACTION_ENABLED'
>> => 'true'}
>> 
>> }
>> 
>> 
>> 
>> COLUMN FAMILIES DESCRIPTION
>> 
>> 
>> 
>> {NAME => 'bm', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY =>
>> 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
>> 'NONE',
 TTL
>> => 'FOREVER', COMPRESSION => 'LZ4', MIN_VERSIONS => '0', BLOCKCACHE
>> =>
 'tr
>> 
>> ue', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>> 
>> 
>> 
>> {NAME => 'uid', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY =>
>> 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING =>
>> 'NONE',
 TTL
>> => 'FOREVER', COMPRESSION => 'LZ4', MIN_VERSIONS => '0', BLOCKCACHE
>> =>
 't
>> 
>> rue', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
>> 
>> 
>> 
>> 2 row(s) in 0.0300 seconds
>> 
>> 
>> *When I try to create a Phoenix VIEW (Phoenix 4.7) as such:*
>> 
>> 
>> 0: jdbc:phoenix:pocc2.net:2181> CREATE VIEW "bro" (pk VARCHAR
>> PRIMARY
>> KEY, "bm"."id.orig_h" VARCHAR, "bm"."id.orig_p" VARCHAR,
 "bm"."id_resp_h"
>> VARCHAR, "bm"."id_resp_p" VARCHAR, "bm"."proto" VARCHAR) AS SELECT *
 FROM
>> "bro";
>> 
>> No rows affected (0.019 seconds)
>> 
>> 
>> *I see no errors at all, but when I try to