Re: Hash join confusion

2016-09-28 Thread Sumit Nigam
Thanks Maryann.
Yes let me switch to merge sort join because the other query uses lots more 
columns. Also, if I just change the hint to use merge sort would that be enough 
or I need to sort both the driving query and subquery with same order by for 
merge sort?
As an aside, is there a document to interpret explain plan?
Thanks,Sumit
  From: Maryann Xue 
 To: Sumit Nigam  
Cc: "user@phoenix.apache.org" 
 Sent: Thursday, September 29, 2016 11:03 AM
 Subject: Re: Hash join confusion
   
Thank you Sumit, for trying this out! So right now it's very clear that the 
table to be cached IS too big so there should be no point of using hash join in 
this case. Is the other table much smaller, or it is about the same size or 
even bigger? If it's considerably smaller you can probably rewrite your query 
to do the join the other way, otherwise let's just stick to sort-merge join.

Thanks,Maryann
On Wed, Sep 28, 2016 at 10:29 PM, Sumit Nigam  wrote:

Thank you Maryann.
I am not using multi-tenancy for these tables. Increasing phoenix. 
coprocessor.maxServerCacheTime ToLiveMs and the corresponding cache size config 
just delayed the error. 
I have also started seeing some memory problem -
Caused by: org.apache.phoenix.memory.Insu fficientMemoryException: Requested 
memory of 22871932 bytes could not be allocated from remaining memory of 
776776654 bytes from global pool of 778469376 bytes after waiting for 1ms.
at org.apache.phoenix.memory. GlobalMemoryManager. allocateBytes( 
GlobalMemoryManager.java:78)
at org.apache.phoenix.memory. GlobalMemoryManager.access$ 
300(GlobalMemoryManager.java: 30)
at org.apache.phoenix.memory. GlobalMemoryManager$ 
GlobalMemoryChunk.resize( GlobalMemoryManager.java:139)
What I am having trouble with is, that the total size of csv produced by 
sub-query is only ~7Mb. I have 12 region servers with 5GB heap each. So, when 
this result gets sent across to all region servers to perform the server side 
join, not sure why a memory issue should show up (or a time out occur). Any 
insights?

These tables are salted. Not sure if it is https://issues.apache.org/ 
jira/browse/PHOENIX-2900  issue. 
Switching to sort merge join helped. But not sure if that is the right solution 
going forward.
Thanks again!Sumit

  From: Maryann Xue 
 To: "user@phoenix.apache.org" ; Sumit Nigam 
 
 Sent: Wednesday, September 28, 2016 11:36 PM
 Subject: Re: Hash join confusion
   
Yes, Sumit, the sub-query will get cached in hash join. Are you using 
multi-tenancy for these tables? If yes, you might want to checkout Phoenix 4.7 
or 4.8, since a related bug fix got in the 4.7 release. https://issues. 
apache.org/jira/browse/ PHOENIX-2381?jql=project%20% 
3D%20PHOENIX%20AND%20text%20~% 20%22hash%20cache%20id%22
Otherwise I think it's the hash cache timeout issue, in which case changing 
phoenix.coprocessor.m axServerCacheTimeToLiveMs migh t be helpful.

Thanks,Maryann
On Tue, Sep 27, 2016 at 10:02 PM, Sumit Nigam  wrote:

Hi,
Is there any document which can help me understand explain plan output in 
detail? Or, which piece of code should I look at, to get an idea?
Here is explain plan for inner join query below. Can anyone help in explaining 
it to me? Like, as per the plan which table is being cached, etc.?Here, 
indx_exdocb is index table (on ID) and exDocStoreb is main table with rowkey as 
(current_timestamp, ID).
+- -+|                   PLAN           
        |+- -+| CLIENT 36-CHUNK 
PARALLEL 36-WAY FULL SCAN OVER exDocStoreb ||     PARALLEL INNER-JOIN TABLE 0 
(SKIP MERGE) ||         CLIENT 36-CHUNK PARALLEL 36-WAY RANGE SCAN OVER 
indx_exdocb [0,' 42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 
c3bf'] - [0,' 42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 
c3bg' ||             SERVER FILTER BY FIRST KEY ONLY ||             SERVER 
AGGREGATE INTO ORDERED DISTINCT ROWS BY ["ID"] ||         CLIENT MERGE SORT     
           ||     DYNAMIC SERVER FILTER BY (A.CURRENT_TIMESTAMP, A.ID) IN 
((TMP.MCT, TMP.TID)) |+- -+
Also, is there a way to turn ON more verbose explain plan? Like, seeing number 
of bytes, rows that each step results in?
Thanks,Sumit

  From: Sumit Nigam 
 To: Users Mail List Phoenix  
 Sent: Tuesday, September 27, 2016 9:17 PM
 Subject: Hash join confusion
  
Hi,
I am using hbase 1.1 with phoenix 4.6. 
I have a table with row key as (current_timestamp, id) which is salted and 
index on (id). This table has ~3 million records.
I have a query like given below. 
SELECT  ID, CURRENT_TIMESTAMP,  from TBL                    
    as a inner join (                                select 

Re: Hash join confusion

2016-09-28 Thread Maryann Xue
Thank you Sumit, for trying this out! So right now it's very clear that the
table to be cached IS too big so there should be no point of using hash
join in this case. Is the other table much smaller, or it is about the same
size or even bigger? If it's considerably smaller you can probably rewrite
your query to do the join the other way, otherwise let's just stick to
sort-merge join.


Thanks,
Maryann

On Wed, Sep 28, 2016 at 10:29 PM, Sumit Nigam  wrote:

> Thank you Maryann.
>
> I am not using multi-tenancy for these tables. Increasing phoenix.
> coprocessor.maxServerCacheTimeToLiveMs and the corresponding cache size
> config just delayed the error.
>
> I have also started seeing some memory problem -
>
> Caused by: org.apache.phoenix.memory.*InsufficientMemoryException*: Requested 
> memory of 22871932 bytes could not be allocated from remaining memory of 
> 776776654 bytes from global pool of 778469376 bytes after waiting for 1ms.
>   at 
> org.apache.phoenix.memory.GlobalMemoryManager.allocateBytes(GlobalMemoryManager.java:78)
>   at 
> org.apache.phoenix.memory.GlobalMemoryManager.access$300(GlobalMemoryManager.java:30)
>   at 
> org.apache.phoenix.memory.GlobalMemoryManager$GlobalMemoryChunk.resize(GlobalMemoryManager.java:139)
>
>
> What I am having trouble with is, that the total size of csv produced by
> sub-query is only ~7Mb. I have 12 region servers with 5GB heap each. So,
> when this result gets sent across to all region servers to perform the
> server side join, not sure why a memory issue should show up (or a time out
> occur). Any insights?
>
> These tables are salted. Not sure if it is https://issues.apache.org/
> jira/browse/PHOENIX-2900 issue.
>
> Switching to sort merge join helped. But not sure if that is the right
> solution going forward.
>
> Thanks again!
> Sumit
>
>
> --
> *From:* Maryann Xue 
> *To:* "user@phoenix.apache.org" ; Sumit Nigam <
> sumit_o...@yahoo.com>
> *Sent:* Wednesday, September 28, 2016 11:36 PM
> *Subject:* Re: Hash join confusion
>
> Yes, Sumit, the sub-query will get cached in hash join. Are you using
> multi-tenancy for these tables? If yes, you might want to checkout Phoenix
> 4.7 or 4.8, since a related bug fix got in the 4.7 release.
> https://issues.apache.org/jira/browse/PHOENIX-2381?jql=project%20%
> 3D%20PHOENIX%20AND%20text%20~%20%22hash%20cache%20id%22
>
> Otherwise I think it's the hash cache timeout issue, in which case
> changing phoenix.coprocessor.maxServerCacheTimeToLiveMs might be helpful.
>
>
> Thanks,
> Maryann
>
> On Tue, Sep 27, 2016 at 10:02 PM, Sumit Nigam 
> wrote:
>
> Hi,
>
> Is there any document which can help me understand explain plan output in
> detail? Or, which piece of code should I look at, to get an idea?
>
> Here is explain plan for inner join query below. Can anyone help in
> explaining it to me? Like, as per the plan which table is being cached,
> etc.?
> Here, *indx_exdocb* is index table* (on ID) *and *exDocStoreb *is main
> table with rowkey as (current_timestamp, ID).
>
> +- -+
> |   PLAN   |
> +- -+
> | CLIENT 36-CHUNK PARALLEL 36-WAY FULL SCAN OVER *exDocStoreb* |
> | PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE) |
> | CLIENT 36-CHUNK PARALLEL 36-WAY RANGE SCAN OVER *indx_exdocb*
> [0,' 42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bf'] -
> [0,' 42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bg' |
> | SERVER FILTER BY FIRST KEY ONLY |
> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["ID"] |
> | CLIENT MERGE SORT|
> | DYNAMIC SERVER FILTER BY (A.CURRENT_TIMESTAMP, A.ID )
> IN ((TMP.MCT, TMP.TID)) |
> +- -+
>
> Also, is there a way to turn ON more verbose explain plan? Like, seeing
> number of bytes, rows that each step results in?
>
> Thanks,
> Sumit
>
> --
> *From:* Sumit Nigam 
> *To:* Users Mail List Phoenix 
> *Sent:* Tuesday, September 27, 2016 9:17 PM
> *Subject:* Hash join confusion
>
> Hi,
>
> I am using hbase 1.1 with phoenix 4.6.
>
> I have a table with row key as (current_timestamp, id) which is salted and
> index on (id). This table has ~3 million records.
>
> I have a query like given below.
>
> SELECT  ID, CURRENT_TIMESTAMP,  from TBL
>as a inner join (
> select max(CURRENT_TIMESTAMP) as mct, ID
> as tid from TBL where ID like ' 42ecf4abd4bd7e7606025dc8eee3de
> 6a3cc04418cbc2619ddc01f54d88d7 c3bf%' group by ID) as tmp
>on a.ID=tmp.tid and a.CURRENT_TIMESTAMP=tmp.mct
>
>
> The query hangs for long and finally fails with a timeout. I have 12
> region servers 

Re: Hash join confusion

2016-09-28 Thread Sumit Nigam
Thank you Maryann.
I am not using multi-tenancy for these tables. Increasing 
phoenix.coprocessor.maxServerCacheTimeToLiveMs and the corresponding cache size 
config just delayed the error. 
I have also started seeing some memory problem -
Caused by: org.apache.phoenix.memory.InsufficientMemoryException: Requested 
memory of 22871932 bytes could not be allocated from remaining memory of 
776776654 bytes from global pool of 778469376 bytes after waiting for 1ms.
at 
org.apache.phoenix.memory.GlobalMemoryManager.allocateBytes(GlobalMemoryManager.java:78)
at 
org.apache.phoenix.memory.GlobalMemoryManager.access$300(GlobalMemoryManager.java:30)
at 
org.apache.phoenix.memory.GlobalMemoryManager$GlobalMemoryChunk.resize(GlobalMemoryManager.java:139)
What I am having trouble with is, that the total size of csv produced by 
sub-query is only ~7Mb. I have 12 region servers with 5GB heap each. So, when 
this result gets sent across to all region servers to perform the server side 
join, not sure why a memory issue should show up (or a time out occur). Any 
insights?

These tables are salted. Not sure if it is 
https://issues.apache.org/jira/browse/PHOENIX-2900 issue. 
Switching to sort merge join helped. But not sure if that is the right solution 
going forward.
Thanks again!Sumit

  From: Maryann Xue 
 To: "user@phoenix.apache.org" ; Sumit Nigam 
 
 Sent: Wednesday, September 28, 2016 11:36 PM
 Subject: Re: Hash join confusion
   
Yes, Sumit, the sub-query will get cached in hash join. Are you using 
multi-tenancy for these tables? If yes, you might want to checkout Phoenix 4.7 
or 4.8, since a related bug fix got in the 4.7 release. 
https://issues.apache.org/jira/browse/PHOENIX-2381?jql=project%20%3D%20PHOENIX%20AND%20text%20~%20%22hash%20cache%20id%22
Otherwise I think it's the hash cache timeout issue, in which case changing 
phoenix.coprocessor.maxServerCacheTimeToLiveMs might be helpful.

Thanks,Maryann
On Tue, Sep 27, 2016 at 10:02 PM, Sumit Nigam  wrote:

Hi,
Is there any document which can help me understand explain plan output in 
detail? Or, which piece of code should I look at, to get an idea?
Here is explain plan for inner join query below. Can anyone help in explaining 
it to me? Like, as per the plan which table is being cached, etc.?Here, 
indx_exdocb is index table (on ID) and exDocStoreb is main table with rowkey as 
(current_timestamp, ID).
+- -+|                   PLAN           
        |+- -+| CLIENT 36-CHUNK 
PARALLEL 36-WAY FULL SCAN OVER exDocStoreb ||     PARALLEL INNER-JOIN TABLE 0 
(SKIP MERGE) ||         CLIENT 36-CHUNK PARALLEL 36-WAY RANGE SCAN OVER 
indx_exdocb [0,' 42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 
c3bf'] - [0,' 42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 
c3bg' ||             SERVER FILTER BY FIRST KEY ONLY ||             SERVER 
AGGREGATE INTO ORDERED DISTINCT ROWS BY ["ID"] ||         CLIENT MERGE SORT     
           ||     DYNAMIC SERVER FILTER BY (A.CURRENT_TIMESTAMP, A.ID) IN 
((TMP.MCT, TMP.TID)) |+- -+
Also, is there a way to turn ON more verbose explain plan? Like, seeing number 
of bytes, rows that each step results in?
Thanks,Sumit

  From: Sumit Nigam 
 To: Users Mail List Phoenix  
 Sent: Tuesday, September 27, 2016 9:17 PM
 Subject: Hash join confusion
  
Hi,
I am using hbase 1.1 with phoenix 4.6. 
I have a table with row key as (current_timestamp, id) which is salted and 
index on (id). This table has ~3 million records.
I have a query like given below. 
SELECT  ID, CURRENT_TIMESTAMP,  from TBL                    
    as a inner join (                                select 
max(CURRENT_TIMESTAMP) as mct, ID as tid from TBL where ID like ' 
42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bf%' group by 
ID) as tmp                        on a.ID=tmp.tid and 
a.CURRENT_TIMESTAMP=tmp.mct


The query hangs for long and finally fails with a timeout. I have 12 region 
servers each with 5GB heap and also the total records satisfying the above 
query is 62K whose CSV dump is ~10MB only. 
DoNotRetryIOException: Could not find hash cache for join Id: Ӧ�8�D�. The cache 
might have expired and have been removed

and - 
Caused by: java.sql.SQLException: Encountered exception in sub plan [0] 
execution.
at org.apache.phoenix.execute. HashJoinPlan.iterator( 
HashJoinPlan.java:175)
at com.infa.products.ldm. ingestion.server.java.hadoop. impl. 
FixPhoenixIngestInputFormat. getQueryPlan( FixPhoenixIngestInputFormat. java:94)
... 22 more
and
Caused by: java.sql.SQLException:
java.util.concurrent.Timeo utException    at org.apache.phoenix.cache. 
ServerCacheClient. addServerCache( 

Re: bulk-delete spark phoenix

2016-09-28 Thread Josh Mahonin
Hi Fabio,

You could probably just execute a regular DELETE query from a JDBC call,
which is generally safe to do either from the Spark driver or within an
executor. As long as auto-commit is enabled, it's an entirely server side
operation: https://phoenix.apache.org/language/#delete

Josh

On Wed, Sep 28, 2016 at 2:13 PM, fabio ferrante 
wrote:

> Hi,
>
> I would like to perform a bulk delete to HBase using Apache Phoenix from
> Spark. Using Phoenix-Spark plugin i can successfully perform a bulk load
> using saveToPhoenix method from PhoenixRDD but how i can perform a bulk
> delete? There isn't a deleteFromPhoenix method in PhoenixRDD. Is that
> correct? Implement such method is a trivial task?
>
> Thanks in advance,
>  Fabio.
>


bulk-delete spark phoenix

2016-09-28 Thread fabio ferrante
Hi,
 
I would like to perform a bulk delete to HBase using Apache Phoenix from
Spark. Using Phoenix-Spark plugin i can successfully perform a bulk load
using saveToPhoenix method from PhoenixRDD but how i can perform a bulk
delete? There isn't a deleteFromPhoenix method in PhoenixRDD. Is that
correct? Implement such method is a trivial task?
 
Thanks in advance,
 Fabio.


Re: Hash join confusion

2016-09-28 Thread Maryann Xue
Yes, Sumit, the sub-query will get cached in hash join. Are you using
multi-tenancy for these tables? If yes, you might want to checkout Phoenix
4.7 or 4.8, since a related bug fix got in the 4.7 release.
https://issues.apache.org/jira/browse/PHOENIX-2381?jql=project%20%3D%20PHOENIX%20AND%20text%20~%20%22hash%20cache%20id%22

Otherwise I think it's the hash cache timeout issue, in which case changing
phoenix.coprocessor.maxServerCacheTimeToLiveMs might be helpful.


Thanks,
Maryann

On Tue, Sep 27, 2016 at 10:02 PM, Sumit Nigam  wrote:

> Hi,
>
> Is there any document which can help me understand explain plan output in
> detail? Or, which piece of code should I look at, to get an idea?
>
> Here is explain plan for inner join query below. Can anyone help in
> explaining it to me? Like, as per the plan which table is being cached,
> etc.?
> Here, *indx_exdocb* is index table* (on ID) *and *exDocStoreb *is main
> table with rowkey as (current_timestamp, ID).
>
> +--+
> |   PLAN   |
> +--+
> | CLIENT 36-CHUNK PARALLEL 36-WAY FULL SCAN OVER *exDocStoreb* |
> | PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE) |
> | CLIENT 36-CHUNK PARALLEL 36-WAY RANGE SCAN OVER *indx_exdocb*
> [0,'42ecf4abd4bd7e7606025dc8eee3de6a3cc04418cbc2619ddc01f54d88d7c3bf'] -
> [0,'42ecf4abd4bd7e7606025dc8eee3de6a3cc04418cbc2619ddc01f54d88d7c3bg' |
> | SERVER FILTER BY FIRST KEY ONLY |
> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["ID"] |
> | CLIENT MERGE SORT|
> | DYNAMIC SERVER FILTER BY (A.CURRENT_TIMESTAMP, A.ID) IN ((TMP.MCT,
> TMP.TID)) |
> +--+
>
> Also, is there a way to turn ON more verbose explain plan? Like, seeing
> number of bytes, rows that each step results in?
>
> Thanks,
> Sumit
>
> --
> *From:* Sumit Nigam 
> *To:* Users Mail List Phoenix 
> *Sent:* Tuesday, September 27, 2016 9:17 PM
> *Subject:* Hash join confusion
>
> Hi,
>
> I am using hbase 1.1 with phoenix 4.6.
>
> I have a table with row key as (current_timestamp, id) which is salted and
> index on (id). This table has ~3 million records.
>
> I have a query like given below.
>
> SELECT  ID, CURRENT_TIMESTAMP,  from TBL
>as a inner join (
> select max(CURRENT_TIMESTAMP) as mct, ID
> as tid from TBL where ID like '42ecf4abd4bd7e7606025dc8eee3de
> 6a3cc04418cbc2619ddc01f54d88d7c3bf%' group by ID) as tmp
>on a.ID=tmp.tid and a.CURRENT_TIMESTAMP=tmp.mct
>
>
> The query hangs for long and finally fails with a timeout. I have 12
> region servers each with 5GB heap and also the total records satisfying the
> above query is 62K whose CSV dump is ~10MB only.
>
> DoNotRetryIOException: Could not find *hash cache for join Id*: Ӧ�8�D�.
> The cache might have expired and have been removed
>
> and -
>
> Caused by: java.sql.SQLException: Encountered exception in sub plan [0] 
> execution.
>   at 
> org.apache.phoenix.execute.HashJoinPlan.iterator(HashJoinPlan.java:175)
>   at 
> com.infa.products.ldm.ingestion.server.java.hadoop.impl.FixPhoenixIngestInputFormat.getQueryPlan(FixPhoenixIngestInputFormat.java:94)
>   ... 22 more
>
>
> andCaused by: java.sql.SQLException:
> java.util.concurrent.*TimeoutException*
>
> at 
> org.apache.phoenix.cache.ServerCacheClient.addServerCache(ServerCacheClient.java:264)
>
>
> I can try playing around with parameters such as 
> phoenix.coprocessor.maxServerCacheTimeToLiveMs and switching to 
> sort_merge_join actually helped.
>
> But my question is as per Joins | Apache Phoenix 
>  in a case such as *lhs* INNER JOIN 
> *rhs, *it is *rhs* which will be built as hash table in server cache. So, in 
> the above query I assume this gets cached?
>
>  select max(CURRENT_TIMESTAMP) as mct, ID as tid from TBL where ID like 
> '42ecf4abd4bd7e7606025dc8eee3de6a3cc04418cbc2619ddc01f54d88d7c3bf%' group by 
> ID) as tmp
>
> Thanks,
> Sumit
>
>
>
>


Re: bulk-upsert spark phoenix

2016-09-28 Thread Josh Mahonin
Hi Antonio,

Certainly, a JIRA ticket with a patch would be fantastic.

Thanks!

Josh

On Wed, Sep 28, 2016 at 12:08 PM, Antonio Murgia 
wrote:

> Thank you very much for your insights Josh, if I decide to develop a small
> Phoenix Library that does, through Spark, what the CSV loader does, I'll
> surely write to the mailing list, or open a Jira, or maybe even open a PR,
> right?
>
> Thank you again
>
> #A.M.
>
> On 09/28/2016 05:10 PM, Josh Mahonin wrote:
>
> Hi Antonio,
>
> You're correct, the phoenix-spark output uses the Phoenix Hadoop
> OutputFormat under the hood, which effectively does a parallel, batch JDBC
> upsert. It should scale depending on the number of Spark executors,
> RDD/DataFrame parallelism, and number of HBase RegionServers, though
> admittedly there's a lot of overhead involved.
>
> The CSV Bulk loading tool uses MapReduce, it's not integrated with Spark.
> It's likely possible to do so, but it's probably a non-trivial amount of
> work. If you're interested in taking it on, I'd start with looking at the
> following classes:
>
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
> https://github.com/apache/phoenix/blob/master/phoenix-
> spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala
>
> Good luck,
>
> Josh
>
> On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia 
> wrote:
>
>> Hi,
>>
>> I would like to perform a Bulk insert to HBase using Apache Phoenix from
>> Spark. I tried using Apache Spark Phoenix library but, as far as I was
>> able to understand from the code, it looks like it performs a jdbc batch
>> of upserts (am I right?). Instead I want to perform a Bulk load like the
>> one described in this blog post
>> (https://zeyuanxy.github.io/HBase-Bulk-Loading/) but taking advance of
>> the automatic transformation between java/scala types to Bytes.
>>
>> I'm actually using phoenix 4.5.2, therefore I cannot use hive to
>> manipulate the phoenix table, and if it possible i want to avoid to
>> spawn a MR job that reads data from csv
>> (https://phoenix.apache.org/bulk_dataload.html). Actually i just want to
>> do what the csv loader is doing with MR but programmatically with Spark
>> (since the data I want to persist is already loaded in memory).
>>
>> Thank you all!
>>
>>
>
>


Re: bulk-upsert spark phoenix

2016-09-28 Thread Antonio Murgia
Thank you very much for your insights Josh, if I decide to develop a 
small Phoenix Library that does, through Spark, what the CSV loader 
does, I'll surely write to the mailing list, or open a Jira, or maybe 
even open a PR, right?


Thank you again

#A.M.


On 09/28/2016 05:10 PM, Josh Mahonin wrote:

Hi Antonio,

You're correct, the phoenix-spark output uses the Phoenix Hadoop 
OutputFormat under the hood, which effectively does a parallel, batch 
JDBC upsert. It should scale depending on the number of Spark 
executors, RDD/DataFrame parallelism, and number of HBase 
RegionServers, though admittedly there's a lot of overhead involved.


The CSV Bulk loading tool uses MapReduce, it's not integrated with 
Spark. It's likely possible to do so, but it's probably a non-trivial 
amount of work. If you're interested in taking it on, I'd start with 
looking at the following classes:


https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala

Good luck,

Josh

On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia 
> wrote:


Hi,

I would like to perform a Bulk insert to HBase using Apache
Phoenix from
Spark. I tried using Apache Spark Phoenix library but, as far as I was
able to understand from the code, it looks like it performs a jdbc
batch
of upserts (am I right?). Instead I want to perform a Bulk load
like the
one described in this blog post
(https://zeyuanxy.github.io/HBase-Bulk-Loading/
) but taking
advance of
the automatic transformation between java/scala types to Bytes.

I'm actually using phoenix 4.5.2, therefore I cannot use hive to
manipulate the phoenix table, and if it possible i want to avoid to
spawn a MR job that reads data from csv
(https://phoenix.apache.org/bulk_dataload.html
). Actually i just
want to
do what the csv loader is doing with MR but programmatically with
Spark
(since the data I want to persist is already loaded in memory).

Thank you all!






Re: bulk-upsert spark phoenix

2016-09-28 Thread Josh Mahonin
Hi Antonio,

You're correct, the phoenix-spark output uses the Phoenix Hadoop
OutputFormat under the hood, which effectively does a parallel, batch JDBC
upsert. It should scale depending on the number of Spark executors,
RDD/DataFrame parallelism, and number of HBase RegionServers, though
admittedly there's a lot of overhead involved.

The CSV Bulk loading tool uses MapReduce, it's not integrated with Spark.
It's likely possible to do so, but it's probably a non-trivial amount of
work. If you're interested in taking it on, I'd start with looking at the
following classes:

https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/CsvBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/AbstractBulkLoadTool.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixOutputFormat.java
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/mapreduce/PhoenixRecordWriter.java
https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala

Good luck,

Josh

On Tue, Sep 27, 2016 at 10:43 AM, Antonio Murgia 
wrote:

> Hi,
>
> I would like to perform a Bulk insert to HBase using Apache Phoenix from
> Spark. I tried using Apache Spark Phoenix library but, as far as I was
> able to understand from the code, it looks like it performs a jdbc batch
> of upserts (am I right?). Instead I want to perform a Bulk load like the
> one described in this blog post
> (https://zeyuanxy.github.io/HBase-Bulk-Loading/) but taking advance of
> the automatic transformation between java/scala types to Bytes.
>
> I'm actually using phoenix 4.5.2, therefore I cannot use hive to
> manipulate the phoenix table, and if it possible i want to avoid to
> spawn a MR job that reads data from csv
> (https://phoenix.apache.org/bulk_dataload.html). Actually i just want to
> do what the csv loader is doing with MR but programmatically with Spark
> (since the data I want to persist is already loaded in memory).
>
> Thank you all!
>
>


Re: Loading via MapReduce, Not Moving HFiles to HBase

2016-09-28 Thread Gabriel Reid
Hi Ravi,

It looks like those log file entries you posted are from a mapreduce task.
Could you post the output of the command that you're using to start the
actual job (i.e. console output of "hadoop jar ...").

- Gabriel

On Wed, Sep 28, 2016 at 1:49 PM, Ravi Kumar Bommada 
wrote:

> Hi All,
>
> I’m trying to load data via phoenix mapreduce referring to below screen:
>
>
> HFiles are getting created, each HFile is of size 300MB and 176 such
> HFiles are there, but after that files are not moving to HBase. i.e when
> I’m querying HBase I’m not able to see data.According to the logs below
> data commit is successful.
>
> Please suggest, if I’m missing any configuration.
>
> Provided:
>
> Using property: -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.
> perFamily=1024
>
> Last Few Logs:
> 2016-09-27 07:27:35,845 INFO [main] org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor [.snappy]
> 2016-09-27 07:27:35,846 INFO [main] org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor [.snappy]
> 2016-09-27 07:27:35,846 INFO [main] org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor [.snappy]
> 2016-09-27 07:27:35,846 INFO [main] org.apache.hadoop.mapred.Merger:
> Merging 64 intermediate segments out of a total of 127
> 2016-09-27 07:28:21,238 INFO [main] org.apache.hadoop.mapred.Merger: Down
> to the last merge-pass, with 64 segments left of total size: -40111574372
> bytes
> 2016-09-27 07:30:24,933 INFO [main] org.apache.hadoop.mapred.Merger:
> Merging 179 sorted segments
> 2016-09-27 07:30:24,965 INFO [main] org.apache.hadoop.mapred.Merger: Down
> to the last merge-pass, with 0 segments left of total size: 4736 bytes
> 2016-09-27 07:30:24,967 INFO [main] org.apache.hadoop.mapred.Merger:
> Merging 179 sorted segments
> 2016-09-27 07:30:24,999 INFO [main] org.apache.hadoop.mapred.Merger: Down
> to the last merge-pass, with 0 segments left of total size: 4736 bytes
> 2016-09-27 07:30:25,000 INFO [main] org.apache.hadoop.mapred.Merger:
> Merging 179 sorted segments
> 2016-09-27 07:30:25,033 INFO [main] org.apache.hadoop.mapred.Merger: Down
> to the last merge-pass, with 0 segments left of total size: 4736 bytes
> 2016-09-27 07:30:25,035 INFO [main] org.apache.hadoop.mapred.Merger:
> Merging 179 sorted segments
> 2016-09-27 07:30:25,068 INFO [main] org.apache.hadoop.mapred.Merger: Down
> to the last merge-pass, with 0 segments left of total size: 4736 bytes
> 2016-09-27 07:30:25,723 INFO [main] org.apache.hadoop.mapred.Task:
> Task:attempt_1467713708066_29809_m_16_0 is done. And is in the
> process of committing
> 2016-09-27 07:30:25,788 INFO [main] org.apache.hadoop.mapred.Task: Task
> 'attempt_1467713708066_29809_m_16_0' done.
>
>
> Regard’s
>
> Ravi Kumar B
> Mob: +91 9591144511
>
>
>
>


Loading via MapReduce, Not Moving HFiles to HBase

2016-09-28 Thread Ravi Kumar Bommada
Hi All,

I'm trying to load data via phoenix mapreduce referring to below screen:



HFiles are getting created, each HFile is of size 300MB and 176 such HFiles are 
there, but after that files are not moving to HBase. i.e when I'm querying 
HBase I'm not able to see data.According to the logs below data commit is 
successful.

Please suggest, if I'm missing any configuration.

Provided:

Using property: -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=1024

Last Few Logs:
2016-09-27 07:27:35,845 INFO [main] org.apache.hadoop.io.compress.CodecPool: 
Got brand-new decompressor [.snappy]
2016-09-27 07:27:35,846 INFO [main] org.apache.hadoop.io.compress.CodecPool: 
Got brand-new decompressor [.snappy]
2016-09-27 07:27:35,846 INFO [main] org.apache.hadoop.io.compress.CodecPool: 
Got brand-new decompressor [.snappy]
2016-09-27 07:27:35,846 INFO [main] org.apache.hadoop.mapred.Merger: Merging 64 
intermediate segments out of a total of 127
2016-09-27 07:28:21,238 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 64 segments left of total size: -40111574372 bytes
2016-09-27 07:30:24,933 INFO [main] org.apache.hadoop.mapred.Merger: Merging 
179 sorted segments
2016-09-27 07:30:24,965 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 0 segments left of total size: 4736 bytes
2016-09-27 07:30:24,967 INFO [main] org.apache.hadoop.mapred.Merger: Merging 
179 sorted segments
2016-09-27 07:30:24,999 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 0 segments left of total size: 4736 bytes
2016-09-27 07:30:25,000 INFO [main] org.apache.hadoop.mapred.Merger: Merging 
179 sorted segments
2016-09-27 07:30:25,033 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 0 segments left of total size: 4736 bytes
2016-09-27 07:30:25,035 INFO [main] org.apache.hadoop.mapred.Merger: Merging 
179 sorted segments
2016-09-27 07:30:25,068 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 0 segments left of total size: 4736 bytes
2016-09-27 07:30:25,723 INFO [main] org.apache.hadoop.mapred.Task: 
Task:attempt_1467713708066_29809_m_16_0 is done. And is in the process of 
committing
2016-09-27 07:30:25,788 INFO [main] org.apache.hadoop.mapred.Task: Task 
'attempt_1467713708066_29809_m_16_0' done.


Regard's

Ravi Kumar B
Mob: +91 9591144511





Recall: Loading via MapReduce, Not Moving HFiles to HBase

2016-09-28 Thread Ravi Kumar Bommada
Ravi Kumar Bommada would like to recall the message, "Loading via MapReduce, 
Not Moving HFiles to HBase".

Loading via MapReduce, Not Moving HFiles to HBase

2016-09-28 Thread Ravi Kumar Bommada
Hi All,

I'm trying to load data via phoenix mapreduce referring to below screen:



HFiles are getting created, each HFile is of size 300MB and 176 such HFiles are 
there, but after that files are not moving to HBase. i.e when I'm querying 
HBase I'm not able to see data.According to the logs below data commit is 
successful.

Please suggest, if I'm missing any configuration.

Provided:
Command Executed:

HADOOP_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol-1.0.0-cdh5.5.1.jar:/opt/cloudera/parcels/CLABS_PHOENIX-4.5.2-1.clabs_phoenix1.2.0.p0.774/lib/phoenix/phoenix-1.2.0-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/conf
 hadoop jar jtac-case-import-0.0.1-SNAPSHOT.jar 
net.juniper.cs.sr.bulkload.edge.index.InstallBaseIndexBulkLoadTool 
-Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=1024 --table 
 --input /tmp/ibase/details/chunks/07 --output 
/tmp/output/ibindexedgedeltaload10 --zookeeper 

Last Few Logs:
2016-09-27 07:27:35,845 INFO [main] org.apache.hadoop.io.compress.CodecPool: 
Got brand-new decompressor [.snappy]
2016-09-27 07:27:35,846 INFO [main] org.apache.hadoop.io.compress.CodecPool: 
Got brand-new decompressor [.snappy]
2016-09-27 07:27:35,846 INFO [main] org.apache.hadoop.io.compress.CodecPool: 
Got brand-new decompressor [.snappy]
2016-09-27 07:27:35,846 INFO [main] org.apache.hadoop.mapred.Merger: Merging 64 
intermediate segments out of a total of 127
2016-09-27 07:28:21,238 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 64 segments left of total size: -40111574372 bytes
2016-09-27 07:30:24,933 INFO [main] org.apache.hadoop.mapred.Merger: Merging 
179 sorted segments
2016-09-27 07:30:24,965 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 0 segments left of total size: 4736 bytes
2016-09-27 07:30:24,967 INFO [main] org.apache.hadoop.mapred.Merger: Merging 
179 sorted segments
2016-09-27 07:30:24,999 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 0 segments left of total size: 4736 bytes
2016-09-27 07:30:25,000 INFO [main] org.apache.hadoop.mapred.Merger: Merging 
179 sorted segments
2016-09-27 07:30:25,033 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 0 segments left of total size: 4736 bytes
2016-09-27 07:30:25,035 INFO [main] org.apache.hadoop.mapred.Merger: Merging 
179 sorted segments
2016-09-27 07:30:25,068 INFO [main] org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 0 segments left of total size: 4736 bytes
2016-09-27 07:30:25,723 INFO [main] org.apache.hadoop.mapred.Task: 
Task:attempt_1467713708066_29809_m_16_0 is done. And is in the process of 
committing
2016-09-27 07:30:25,788 INFO [main] org.apache.hadoop.mapred.Task: Task 
'attempt_1467713708066_29809_m_16_0' done.


Regard's

Ravi Kumar B
Mob: +91 9591144511





Re: Phoenix ResultSet.next() takes a long time for first row

2016-09-28 Thread Ankit Singhal
Sorry Sasi, missed your last mails.

It seems that you have one region in a table or the query touching one
region because of monotonically increasing key['MK00100','YOU',4]  .
Varying performance is because you may have filter which are aggressive and
skipping lots of rows in between (*0*  (7965 ms), *2041* (7155 ms),
*4126 *(1630
ms)) and that's why server is taking time.

can you try after doing salting on the table.
https://phoenix.apache.org/salted.html




On Wed, Sep 28, 2016 at 10:47 AM, Sasikumar Natarajan 
wrote:

> Any one has suggestions for the performance issue discussed in this
> thread?. Your suggestions would help me resolve this issue.
>
> Infrastructure details:
>
> Azure HDInsight HBase
>
> Type Node SizeCores   Nodes
> Head D3 V2 8 2
> Region D3 V2 16 4
> ZooKeeper D3 V2 12 3
> Thanks,
> Sasikumar Natarajan.
>
>
> On Fri, Sep 23, 2016 at 7:57 AM, Sasikumar Natarajan 
> wrote:
>
>> Also its not only the first time it takes time when we call
>> ResultSet.next().
>>
>> When we iterate over ResultSet, it takes a long time initially and then
>> iterates faster. Again after few iterations, it takes sometime and this
>> goes on.
>>
>>
>>
>> Sample observation:
>>
>>
>>
>> Total Rows available on ResultSet : 5130
>>
>> Statement.executeQuery() has taken : 702 ms
>>
>> ResultSet Indices at which long time has been taken : *0*  (7965 ms),
>> *2041* (7155 ms), *4126 *(1630 ms)
>>
>> On Fri, Sep 23, 2016 at 7:52 AM, Sasikumar Natarajan 
>> wrote:
>>
>>> Hi Ankit,
>>>Where does the server processing happens, on the HBase
>>> cluster or the server where Phoenix core runs.
>>>
>>> PFB the details you have asked for,
>>>
>>> Query:
>>>
>>> SELECT col1, col2, col5, col7, col11, col12 FROM SPL_FINAL where
>>> col1='MK00100' and col2='YOU' and col3=4 and col5 in (?,?,?,?,?) and ((col7
>>> between to_date('2016-08-01 00:00:00.000') and to_date('2016-08-05
>>> 23:59:59.000')) or (col8 between to_date('2016-08-01 00:00:00.000') and
>>> to_date('2016-08-05 23:59:59.000')))
>>>
>>>
>>> Explain plan:
>>>
>>> CLIENT 1-CHUNK PARALLEL 1-WAY RANGE SCAN OVER SPL_FINAL
>>> ['MK00100','YOU',4]
>>> SERVER FILTER BY (COL5 IN ('100','101','105','234','653') AND
>>> ((COL7 >= TIMESTAMP '2016-08-01 00:00:00.000' AND COL7 <= TIMESTAMP
>>> '2016-08-05 23:59:59.000') OR (COL8 >= TIMESTAMP '2016-08-01 00:00:00.000'
>>> AND COL8 <= TIMESTAMP '2016-08-05 23:59:59.000')))
>>> DDL:
>>>
>>> CREATE TABLE IF NOT EXISTS SPL_FINAL
>>> (col1 VARCHAR NOT NULL,
>>> col2 VARCHAR NOT NULL,
>>> col3 INTEGER NOT NULL,
>>> col4 INTEGER NOT NULL,
>>> col5 VARCHAR NOT NULL,
>>> col6 VARCHAR NOT NULL,
>>> col7 TIMESTAMP NOT NULL,
>>> col8 TIMESTAMP NOT NULL,
>>> ext.col9 VARCHAR,
>>> ext.col10 VARCHAR,
>>> pri.col11 VARCHAR[], //this column contains 3600 items in every row
>>> pri.col12 VARCHAR
>>> ext.col13 BOOLEAN
>>> CONSTRAINT SPL_FINAL_PK PRIMARY KEY (col1, col2, col3, col4, col5, col6,
>>> col7, col8)) COMPRESSION='SNAPPY';
>>>
>>> Thanks,
>>> Sasikumar Natarajan.
>>>
>>> On Thu, Sep 22, 2016 at 12:36 PM, Ankit Singhal <
>>> ankitsingha...@gmail.com> wrote:
>>>
 Share some more details about the query, DDL and explain plan. In
 Phoenix, there are cases where we do some server processing at the time
 when rs.next() is called first time but subsequent next() should be faster.

 On Thu, Sep 22, 2016 at 9:52 AM, Sasikumar Natarajan  wrote:

> Hi,
> I'm using Apache Phoenix core 4.4.0-HBase-1.1 library to query the
> data available on Phoenix server.
>
> preparedStatement.executeQuery()  seems to be taking less time. But
> to enter into *while (rs.next()) {} *takes a long time. I would like
> to know what is causing the delay to make the ResultSet ready. Please 
> share
> your thoughts on this.
>
>
> --
> Regards,
> Sasikumar Natarajan
>


>>>
>>>
>>> --
>>> Regards,
>>> Sasikumar Natarajan
>>>
>>
>>
>>
>> --
>> Regards,
>> Sasikumar Natarajan
>>
>
>
>
> --
> Regards,
> Sasikumar Natarajan
>


Phoenix Bulkload Mapreduce Perfomance Issue

2016-09-28 Thread Ravi Kumar Bommada
Hi All,


I'm running phoenix mapreduce with 20 input files each of size 100 MB, all map 
and reduce tasks got completed in 45 mins, except the last reduce task. The 
last reduce task is taking 3 hrs to complete.

Please suggest me how I can optimize this.

Regard's

Ravi Kumar B
Mob: +91 9591144511