Re: Kerberos ticket renewal

2016-03-19 Thread Sanooj Padmakumar
This is the error in the log when it fails

ERROR org.apache.hadoop.security.UserGroupInformation -
PriviledgedActionException as: (auth:KERBEROS)
cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)]

On Wed, Mar 16, 2016 at 8:35 PM, Sanooj Padmakumar 
wrote:

> Hi Anil
>
> Thanks for your reply.
>
> We do not do anything explicitly in the code to do the ticket renwal ,
> what we do is run a cron job for the user for which the ticket has to be
> renewed.  But with this approach we need a restart to get the thing going
> after the ticket expiry
>
> We use the following connection url for getting the phoenix connection
> jdbc:phoenix:::/hbase:: keytab>
>
> This along with the entries in hbase-site.xml & core-site.xml are passed
> to the connection object
>
> Thanks
> Sanooj Padmakumar
>
> On Tue, Mar 15, 2016 at 12:04 AM, anil gupta 
> wrote:
>
>> Hi,
>>
>> At my previous job, we had web-services fetching data from a secure hbase
>> cluster. We never needed to renew the lease by restarting webserver. Our
>> app used to renew the ticket. I think, Phoenix/HBase already handles
>> renewing ticket. Maybe you need to look into your kerberos environment
>> settings.  How are you authenticating with Phoenix/HBase?
>> Sorry, I dont remember the exact kerberos setting that we had.
>>
>> HTH,
>> Anil Gupta
>>
>> On Mon, Mar 14, 2016 at 11:00 AM, Sanooj Padmakumar 
>> wrote:
>>
>>> Hi
>>>
>>> We have a rest style micro service application fetching data from hbase
>>> using Phoenix. The cluster is kerberos secured and we run a cron to renew
>>> the kerberos ticket on the machine where the micro service is deployed.
>>>
>>> But it always needs a restart of micro service java process to get the
>>> kerberos ticket working once after its expired.
>>>
>>> Is there a way I can avoid this restart?
>>>
>>> Any pointers will be very helpful. Thanks
>>>
>>> PS : We have a Solr based micro service which works without a restart.
>>>
>>> Regards
>>> Sanooj
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Thanks,
> Sanooj Padmakumar
>



-- 
Thanks,
Sanooj Padmakumar


Adding table compression

2016-03-19 Thread Michael McAllister
All

Are there any known issues if we use the hbase shell to alter a phoenix table 
to apply compression? We're currently using Phoenix 4.4 on HDP 2.3.4.

I plan on testing, but also want to double check for any gotchas.

Michael McAllister
Staff Data Warehouse Engineer | Decision Systems
mmcallis...@homeaway.com | C: 512.423.7447
[Description: Description: cid:3410354473_30269081]


Re: Does phoenix CsvBulkLoadTool write to WAL/Memstore

2016-03-19 Thread Pariksheet Barapatre
Hi Vamsi,

How many number of rows your expecting out of your transformation and what
is the frequency of job?

If there are less number of row (< ~100K and this depends on cluster size
as well), you can go ahead with phoenix-spark plug-in , increase  batch
size to accommodate more rows, else use CVSbulkLoader.

Thanks
Pari

On 16 March 2016 at 20:03, Vamsi Krishna  wrote:

> Thanks Gabriel & Ravi.
>
> I have a data processing job wirtten in Spark-Scala.
> I do a join on data from 2 data files (CSV files) and do data
> transformation on the resulting data. Finally load the transformed data
> into phoenix table using Phoenix-Spark plugin.
> On seeing that Phoenix-Spark plugin goes through regular HBase write path
> (writes to WAL), i'm thinking of option 2 to reduce the job execution time.
>
> *Option 2:* Do data transformation in Spark and write the transformed
> data to a CSV file and use Phoenix CsvBulkLoadTool to load data into
> Phoenix table.
>
> Has anyone tried this kind of exercise? Any thoughts.
>
> Thanks,
> Vamsi Attluri
>
> On Tue, Mar 15, 2016 at 9:40 PM Ravi Kiran 
> wrote:
>
>> Hi Vamsi,
>>The upserts through Phoenix-spark plugin definitely go through WAL .
>>
>>
>> On Tue, Mar 15, 2016 at 5:56 AM, Gabriel Reid 
>> wrote:
>>
>>> Hi Vamsi,
>>>
>>> I can't answer your question abotu the Phoenix-Spark plugin (although
>>> I'm sure that someone else here can).
>>>
>>> However, I can tell you that the CsvBulkLoadTool does not write to the
>>> WAL or to the Memstore. It simply writes HFiles and then hands those
>>> HFiles over to HBase, so the memstore and WAL are never
>>> touched/affected by this.
>>>
>>> - Gabriel
>>>
>>>
>>> On Tue, Mar 15, 2016 at 1:41 PM, Vamsi Krishna 
>>> wrote:
>>> > Team,
>>> >
>>> > Does phoenix CsvBulkLoadTool write to HBase WAL/Memstore?
>>> >
>>> > Phoenix-Spark plugin:
>>> > Does saveToPhoenix method on RDD[Tuple] write to HBase WAL/Memstore?
>>> >
>>> > Thanks,
>>> > Vamsi Attluri
>>> > --
>>> > Vamsi Attluri
>>>
>>
>> --
> Vamsi Attluri
>



-- 
Cheers,
Pari


Re: Problem Updating Stats

2016-03-19 Thread Benjamin Kim
Ankit,

I did not see any problems when connecting with the phoenix sqlline client. So, 
below is the what you asked for. I hope that you can give us insight into 
fixing this.

hbase(main):005:0> describe 'SYSTEM.STATS'
Table SYSTEM.STATS is ENABLED   

 
SYSTEM.STATS, {TABLE_ATTRIBUTES => {coprocessor$1 => 
'|org.apache.phoenix.coprocessor.ScanRegionObserver|805306366|', coprocessor$2 
=> '|org.apache.phoenix.coprocessor.UngroupedAggr
egateRegionObserver|805306366|', coprocessor$3 => 
'|org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver|805306366|', 
coprocessor$4 => '|org.apache.phoenix.coprocessor.Serv
erCachingEndpointImpl|805306366|', coprocessor$5 => 
'|org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint|805306366|', 
coprocessor$6 => '|org.apache.hadoop.hbase.regionserv
er.LocalIndexSplitter|805306366|', METADATA => {'SPLIT_POLICY' => 
'org.apache.phoenix.schema.MetaDataSplitPolicy'}}   
   
COLUMN FAMILIES DESCRIPTION 

 
{NAME => '0', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'ROW', 
REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS 
=> '0', TTL => 'FOREVER', KEEP
_DELETED_CELLS => 'true', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
BLOCKCACHE => 'true'}   
   
1 row(s) in 0.0280 seconds

Thanks,
Ben


> On Mar 15, 2016, at 11:59 PM, Ankit Singhal  wrote:
> 
> Yes it seems to. 
> Did you get any error related to SYSTEM.STATS when the client is connected 
> first time ?
> 
> can you please describe your system.stats table and paste the output here.
> 
> On Wed, Mar 16, 2016 at 3:24 AM, Benjamin Kim  > wrote:
> When trying to run update status on an existing table in hbase, I get error:
> Update stats:
> UPDATE STATISTICS "ops_csv" ALL
> error:
> ERROR 504 (42703): Undefined column. columnName=REGION_NAME
> Looks like the meta data information is messed up, ie. there is no column 
> with name REGION_NAME in this table.
> I see similar errors for other tables that we currently have in hbase.
> 
> We are using CDH 5.5.2, HBase 1.0.0, and Phoenix 4.5.2.
> 
> Thanks,
> Ben
> 



Re: Problem Updating Stats

2016-03-19 Thread Ankit Singhal
ok , or you could have drop SYSTEM.STATS table from sql client at
CURRENT_SCN=7 and reconnect the client. If client doesn't see this table ,
it will create it automatically.

On Thu, Mar 17, 2016 at 3:14 AM, Benjamin Kim  wrote:

> I got it to work by uninstalling Phoenix and reinstalling it again. I had
> to wipe clean all components.
>
> Thanks,
> Ben
>
>
> On Mar 16, 2016, at 10:47 AM, Ankit Singhal 
> wrote:
>
> It seems from the attached logs that you have upgraded phoenix to 4.7
> version and now you are using old client to connect with it.
> "Update statistics" command and guideposts will not work with old client
> after upgradation to 4.7, you need to use the new client for such
> operations.
>
> On Wed, Mar 16, 2016 at 10:55 PM, Benjamin Kim  wrote:
>
>> | TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME
>>  |   |
>>
>> +-+--+-++-+
>> |  | SYSTEM| STATS  |
>> PHYSICAL_NAME| 12  |
>> |  | SYSTEM| STATS  |
>> COLUMN_FAMILY| 12  |
>> |  | SYSTEM| STATS  |
>> GUIDE_POST_KEY  | -3   |
>> |  | SYSTEM| STATS  |
>> GUIDE_POSTS_WIDTH   | -5   |
>> |  | SYSTEM| STATS  |
>> LAST_STATS_UPDATE_TIME | 91  |
>> |  | SYSTEM| STATS  |
>> GUIDE_POSTS_ROW_COUNT   | -5   |
>>
>> I have attached the SYSTEM.CATALOG contents.
>>
>> Thanks,
>> Ben
>>
>>
>>
>> On Mar 16, 2016, at 9:34 AM, Ankit Singhal 
>> wrote:
>>
>> Sorry Ben, I may not be clear in first comment but I need you to describe
>> SYSTEM.STATS in some sql client so that I can see the columns present.
>> And also please scan 'SYSTEM.CATALOG' ,{RAW=>true} in hbase shell and
>> attach a output here.
>>
>> On Wed, Mar 16, 2016 at 8:55 PM, Benjamin Kim  wrote:
>>
>>> Ankit,
>>>
>>> I did not see any problems when connecting with the phoenix sqlline
>>> client. So, below is the what you asked for. I hope that you can give us
>>> insight into fixing this.
>>>
>>> hbase(main):005:0> describe 'SYSTEM.STATS'
>>> Table SYSTEM.STATS is ENABLED
>>>
>>>
>>> SYSTEM.STATS, {TABLE_ATTRIBUTES => {coprocessor$1 =>
>>> '|org.apache.phoenix.coprocessor.ScanRegionObserver|805306366|',
>>> coprocessor$2 => '|org.apache.phoenix.coprocessor.UngroupedAggr
>>> egateRegionObserver|805306366|', coprocessor$3 =>
>>> '|org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver|805306366|',
>>> coprocessor$4 => '|org.apache.phoenix.coprocessor.Serv
>>> erCachingEndpointImpl|805306366|', coprocessor$5 =>
>>> '|org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint|805306366|',
>>> coprocessor$6 => '|org.apache.hadoop.hbase.regionserv
>>> er.LocalIndexSplitter|805306366|', METADATA => {'SPLIT_POLICY' =>
>>> 'org.apache.phoenix.schema.MetaDataSplitPolicy'}}
>>>
>>> COLUMN FAMILIES DESCRIPTION
>>>
>>>
>>> {NAME => '0', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'ROW',
>>> REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE',
>>> MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP
>>> _DELETED_CELLS => 'true', BLOCKSIZE => '65536', IN_MEMORY => 'false',
>>> BLOCKCACHE => 'true'}
>>>
>>> 1 row(s) in 0.0280 seconds
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Mar 15, 2016, at 11:59 PM, Ankit Singhal 
>>> wrote:
>>>
>>> Yes it seems to.
>>> Did you get any error related to SYSTEM.STATS when the client is
>>> connected first time ?
>>>
>>> can you please describe your system.stats table and paste the output
>>> here.
>>>
>>> On Wed, Mar 16, 2016 at 3:24 AM, Benjamin Kim 
>>> wrote:
>>>
 When trying to run update status on an existing table in hbase, I get
 error:

 Update stats:

 UPDATE STATISTICS "ops_csv" ALL

 error:

 ERROR 504 (42703): Undefined column. columnName=REGION_NAME

 Looks like the meta data information is messed up, ie. there is no
 column with name REGION_NAME in this table.

 I see similar errors for other tables that we currently have in hbase.

 We are using CDH 5.5.2, HBase 1.0.0, and Phoenix 4.5.2.

 Thanks,
 Ben

>>>
>>>
>>>
>>
>>
>>
>
>


Re: Implement Custom Aggregate Functions in Phoenix

2016-03-19 Thread James Taylor
No, custom UDFs can be added dynamically as described here:
https://phoenix.apache.org/udf.html. No need to re-build Phoenix. It's just
custom aggregates that would require rebuilding.

FYI, we have support for UPPER and LOWER already.

On Thu, Mar 17, 2016 at 6:09 PM, Swapna Swapna 
wrote:

> Thank you James for swift response.
>
> is the process (adding to phoenix-core and rebuild the jar)  remains the
> same for custom UDF's as well  (like as for custom aggregate functions)?
>
> ex: we have UDF's like  UPPER, LOWER ,etc
>
> On Thu, Mar 17, 2016 at 5:53 PM, James Taylor 
> wrote:
>
>> Hi Swapna,
>> We don't support custom aggregate functions, only scalar functions
>> (see PHOENIX-2069). For a custom aggregate function, you'd need to add them
>> to phoenix-core and rebuild the jar. We're open to adding them to the code
>> base if they're general enough. That's how FIRST_VALUE, LAST_VALUE, and
>> NTH_VALUE made it in.
>> Thanks,
>> James
>>
>> On Thu, Mar 17, 2016 at 12:11 PM, Swapna Swapna 
>> wrote:
>>
>>> Hi,
>>>
>>> I found this in Phoenix UDF documentation:
>>>
>>>- After compiling your code to a jar, you need to deploy the jar
>>>into the HDFS. It would be better to add the jar to HDFS folder 
>>> configured
>>>for hbase.dynamic.jars.dir.
>>>
>>>
>>> My question is, can that be any 'udf-user-specific' jar which need to be
>>> copied to HDFS or would it need to register the function and update the
>>> custom UDF classes inside phoenix-core.jar and rebuild the
>>> 'phoenix-core.jar'
>>>
>>> Regards
>>> Swapna
>>>
>>>
>>>
>>>
>>> On Fri, Jan 29, 2016 at 6:31 PM, James Taylor 
>>> wrote:
>>>
 Hi Swapna,
 We currently don't support custom aggregate UDF, and it looks like you
 found the JIRA here: PHOENIX-2069. It would be a natural extension of UDFs.
 Would be great to capture your use case and requirements on the JIRA to
 make sure the functionality will meet your needs.
 Thanks,
 James

 On Fri, Jan 29, 2016 at 1:47 PM, Swapna Swapna 
 wrote:

> Hi,
>
> I would like to know the approach to implement and register custom
> aggregate functions in Phoenix like the way we have built-in aggregate
> functions like SUM, COUNT,etc
>
> Please help.
>
> Thanks
> Swapna
>


>>>
>>
>


Re: array support issue

2016-03-19 Thread James Taylor
How can users know what to expect when they're using an undocumented,
unsupported, non public API?

On Thu, Mar 17, 2016 at 6:20 PM, Nick Dimiduk  wrote:

> > Applications should never query the SYSTEM.CATALOG directly. Instead
> they should go through the DatabaseMetaData interface from
> Connection.getMetaData().
>
> I may have this detail wrong, but the point remains: applications are
> getting an incorrect value, or misinterpreting the correct value they
> receive. From what I can see, this issue is unique to Phoenix.
>
> On Thursday, March 10, 2016, James Taylor  wrote:
>
>> Applications should never query the SYSTEM.CATALOG directly. Instead they
>> should go through the DatabaseMetaData interface from
>> Connection.getMetaData(). For column type information, you'd use the
>> DatabaseMetaData#getColumn method[1] which would return the standard SQL
>> type for ARRAY in the "DATA_TYPE" or resultSet.getInt(5) and the type name
>> in the "TYPE_NAME" or resultSet.getString(6). For arrays, the String is of
>> the form " ARRAY". The max length or precision would in
>> the "COLUMN_SIZE" or resultSet.getInt(7).
>>
>> If we do want query access to the system catalog table, we should
>> implement the INFORMATION_SCHEMA (PHOENIX-24) by creating a view on top of
>> the SYSTEM.CATALOG.
>>
>> Thanks,
>> James
>>
>>
>> [1]
>> https://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html#getColumns(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
>>
>> On Thu, Mar 10, 2016 at 8:14 AM, Nick Dimiduk  wrote:
>>
>>> I believe I've seen something similar while interfacing between Apache
>>> NiFi and Phoenix. From the bit of debugging I did yesterday with my
>>> colleague, NiFi is querying the system table for schema information about
>>> the target table. My target table has a VARCHAR ARRAY column, which is
>>> reported as DATA_TYPE 2003. NiFi then provides this type number to the
>>> setObject method. Phoenix actually has no type registered to DATA_TYPE
>>> 2003. Perhaps Squirrel is doing something similar?
>>>
>>> I think Phoenix either needs a generic Array type registered to 2003
>>> that can dispatch to the appropriate PDataType implementation, or it needs
>>> to store the correct array DATA_TYPE number in the schema table. In this
>>> case, it should be 3000 (base array offset) + 12 (varchar) = 3012.
>>>
>>> On Wed, Jan 13, 2016 at 1:05 AM, Bulvik, Noam 
>>> wrote:
>>>
 we have upgraded to the server to use the parcel and we created table
 with varchar array column
 when working with client java client like squirrel  we still get the
 same error (Error: org.apache.phoenix.schema.IllegalDataException:
 Unsupported sql type: VARCHAR ARRAY) and from sqlline.py it still works 
 fine



 any idea what else to check



 *From:* Bulvik, Noam
 *Sent:* Thursday, October 8, 2015 7:41 PM
 *To:* 'user@phoenix.apache.org' 
 *Subject:* array support issue



 Hi all,



 We are using CDH 5.4 and phoenix 4.4. When we try to use the client jar
 (from squirrel ) to query table with array column we get the following
 error  (even when doing simple thing like select  from >>> array>:



 Error: org.apache.phoenix.schema.IllegalDataException: Unsupported sql
 type: VARCHAR ARRAY

 SQLState:  null

 ErrorCode: 0



 The same SQL from SQL command line (sqlline.py) it works fine (BTW only
 from phoenix 4.3.1 with 4.4 there is CDH compatibility issue .



 Any idea how it can be fixed?



 Regards ,

 Noam

 --

 PRIVILEGED AND CONFIDENTIAL
 PLEASE NOTE: The information contained in this message is privileged
 and confidential, and is intended only for the use of the individual to
 whom it is addressed and others who have been specifically authorized to
 receive it. If you are not the intended recipient, you are hereby notified
 that any dissemination, distribution or copying of this communication is
 strictly prohibited. If you have received this communication in error, or
 if any problems occur with transmission, please contact sender. Thank you.

>>>
>>>
>>


Re: Kerberos ticket renewal

2016-03-19 Thread Sergey Soldatov
Where do you see this error? Is it the client side? Ideally you don't
need to renew ticket since Phoenix Driver gets the required
information (principal name and keytab path) from jdbc connection
string and performs User.login itself.

Thanks,
Sergey

On Wed, Mar 16, 2016 at 11:02 AM, Sanooj Padmakumar  wrote:
> This is the error in the log when it fails
>
> ERROR org.apache.hadoop.security.UserGroupInformation -
> PriviledgedActionException as: (auth:KERBEROS)
> cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to find
> any Kerberos tgt)]
>
> On Wed, Mar 16, 2016 at 8:35 PM, Sanooj Padmakumar 
> wrote:
>>
>> Hi Anil
>>
>> Thanks for your reply.
>>
>> We do not do anything explicitly in the code to do the ticket renwal ,
>> what we do is run a cron job for the user for which the ticket has to be
>> renewed.  But with this approach we need a restart to get the thing going
>> after the ticket expiry
>>
>> We use the following connection url for getting the phoenix connection
>> jdbc:phoenix:::/hbase::> keytab>
>>
>> This along with the entries in hbase-site.xml & core-site.xml are passed
>> to the connection object
>>
>> Thanks
>> Sanooj Padmakumar
>>
>> On Tue, Mar 15, 2016 at 12:04 AM, anil gupta 
>> wrote:
>>>
>>> Hi,
>>>
>>> At my previous job, we had web-services fetching data from a secure hbase
>>> cluster. We never needed to renew the lease by restarting webserver. Our app
>>> used to renew the ticket. I think, Phoenix/HBase already handles renewing
>>> ticket. Maybe you need to look into your kerberos environment settings.  How
>>> are you authenticating with Phoenix/HBase?
>>> Sorry, I dont remember the exact kerberos setting that we had.
>>>
>>> HTH,
>>> Anil Gupta
>>>
>>> On Mon, Mar 14, 2016 at 11:00 AM, Sanooj Padmakumar 
>>> wrote:

 Hi

 We have a rest style micro service application fetching data from hbase
 using Phoenix. The cluster is kerberos secured and we run a cron to renew
 the kerberos ticket on the machine where the micro service is deployed.

 But it always needs a restart of micro service java process to get the
 kerberos ticket working once after its expired.

 Is there a way I can avoid this restart?

 Any pointers will be very helpful. Thanks

 PS : We have a Solr based micro service which works without a restart.

 Regards
 Sanooj
>>>
>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>>
>>
>>
>>
>> --
>> Thanks,
>> Sanooj Padmakumar
>
>
>
>
> --
> Thanks,
> Sanooj Padmakumar


Re: Kerberos ticket renewal

2016-03-19 Thread Sanooj Padmakumar
Hi Anil

Thanks for your reply.

We do not do anything explicitly in the code to do the ticket renwal , what
we do is run a cron job for the user for which the ticket has to be
renewed.  But with this approach we need a restart to get the thing going
after the ticket expiry

We use the following connection url for getting the phoenix connection
jdbc:phoenix:::/hbase::

This along with the entries in hbase-site.xml & core-site.xml are passed to
the connection object

Thanks
Sanooj Padmakumar

On Tue, Mar 15, 2016 at 12:04 AM, anil gupta  wrote:

> Hi,
>
> At my previous job, we had web-services fetching data from a secure hbase
> cluster. We never needed to renew the lease by restarting webserver. Our
> app used to renew the ticket. I think, Phoenix/HBase already handles
> renewing ticket. Maybe you need to look into your kerberos environment
> settings.  How are you authenticating with Phoenix/HBase?
> Sorry, I dont remember the exact kerberos setting that we had.
>
> HTH,
> Anil Gupta
>
> On Mon, Mar 14, 2016 at 11:00 AM, Sanooj Padmakumar 
> wrote:
>
>> Hi
>>
>> We have a rest style micro service application fetching data from hbase
>> using Phoenix. The cluster is kerberos secured and we run a cron to renew
>> the kerberos ticket on the machine where the micro service is deployed.
>>
>> But it always needs a restart of micro service java process to get the
>> kerberos ticket working once after its expired.
>>
>> Is there a way I can avoid this restart?
>>
>> Any pointers will be very helpful. Thanks
>>
>> PS : We have a Solr based micro service which works without a restart.
>>
>> Regards
>> Sanooj
>>
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Thanks,
Sanooj Padmakumar


Re: Implement Custom Aggregate Functions in Phoenix

2016-03-19 Thread Swapna Swapna
Yes, we do have support UPPER and LOWER. I just provided as an example to
refer to UDF.

For custom UDF's, i understand that we can go ahead and create custom UDF
jar.

but how do we register that function?

As per the blog, i found the below lines:

*Finally, we'll need to register our new function. For this, you'll need to
edit the ExpressionType enum and include your new built-in function.
There's room for improvement here to allow registration of user defined
functions outside of the phoenix jar. However, you'd need to be able to
ensure your class is available on the HBase server class path since this
will be executed on the server side at runtime.*

 Does that mean, to register my custom function, i should edit the
*ExpressionType
enum *exists in Phoenix and rebuild the *phoenix jar?*




On Thu, Mar 17, 2016 at 6:17 PM, James Taylor 
wrote:

> No, custom UDFs can be added dynamically as described here:
> https://phoenix.apache.org/udf.html. No need to re-build Phoenix. It's
> just custom aggregates that would require rebuilding.
>
> FYI, we have support for UPPER and LOWER already.
>
> On Thu, Mar 17, 2016 at 6:09 PM, Swapna Swapna 
> wrote:
>
>> Thank you James for swift response.
>>
>> is the process (adding to phoenix-core and rebuild the jar)  remains the
>> same for custom UDF's as well  (like as for custom aggregate functions)?
>>
>> ex: we have UDF's like  UPPER, LOWER ,etc
>>
>> On Thu, Mar 17, 2016 at 5:53 PM, James Taylor 
>> wrote:
>>
>>> Hi Swapna,
>>> We don't support custom aggregate functions, only scalar functions
>>> (see PHOENIX-2069). For a custom aggregate function, you'd need to add them
>>> to phoenix-core and rebuild the jar. We're open to adding them to the code
>>> base if they're general enough. That's how FIRST_VALUE, LAST_VALUE, and
>>> NTH_VALUE made it in.
>>> Thanks,
>>> James
>>>
>>> On Thu, Mar 17, 2016 at 12:11 PM, Swapna Swapna 
>>> wrote:
>>>
 Hi,

 I found this in Phoenix UDF documentation:

- After compiling your code to a jar, you need to deploy the jar
into the HDFS. It would be better to add the jar to HDFS folder 
 configured
for hbase.dynamic.jars.dir.


 My question is, can that be any 'udf-user-specific' jar which need to
 be copied to HDFS or would it need to register the function and update the
 custom UDF classes inside phoenix-core.jar and rebuild the
 'phoenix-core.jar'

 Regards
 Swapna




 On Fri, Jan 29, 2016 at 6:31 PM, James Taylor 
 wrote:

> Hi Swapna,
> We currently don't support custom aggregate UDF, and it looks like you
> found the JIRA here: PHOENIX-2069. It would be a natural extension of 
> UDFs.
> Would be great to capture your use case and requirements on the JIRA to
> make sure the functionality will meet your needs.
> Thanks,
> James
>
> On Fri, Jan 29, 2016 at 1:47 PM, Swapna Swapna  > wrote:
>
>> Hi,
>>
>> I would like to know the approach to implement and register custom
>> aggregate functions in Phoenix like the way we have built-in aggregate
>> functions like SUM, COUNT,etc
>>
>> Please help.
>>
>> Thanks
>> Swapna
>>
>
>

>>>
>>
>


Re: Does phoenix CsvBulkLoadTool write to WAL/Memstore

2016-03-19 Thread Vamsi Krishna
Thanks Gabriel & Ravi.

I have a data processing job wirtten in Spark-Scala.
I do a join on data from 2 data files (CSV files) and do data
transformation on the resulting data. Finally load the transformed data
into phoenix table using Phoenix-Spark plugin.
On seeing that Phoenix-Spark plugin goes through regular HBase write path
(writes to WAL), i'm thinking of option 2 to reduce the job execution time.

*Option 2:* Do data transformation in Spark and write the transformed data
to a CSV file and use Phoenix CsvBulkLoadTool to load data into Phoenix
table.

Has anyone tried this kind of exercise? Any thoughts.

Thanks,
Vamsi Attluri

On Tue, Mar 15, 2016 at 9:40 PM Ravi Kiran 
wrote:

> Hi Vamsi,
>The upserts through Phoenix-spark plugin definitely go through WAL .
>
>
> On Tue, Mar 15, 2016 at 5:56 AM, Gabriel Reid 
> wrote:
>
>> Hi Vamsi,
>>
>> I can't answer your question abotu the Phoenix-Spark plugin (although
>> I'm sure that someone else here can).
>>
>> However, I can tell you that the CsvBulkLoadTool does not write to the
>> WAL or to the Memstore. It simply writes HFiles and then hands those
>> HFiles over to HBase, so the memstore and WAL are never
>> touched/affected by this.
>>
>> - Gabriel
>>
>>
>> On Tue, Mar 15, 2016 at 1:41 PM, Vamsi Krishna 
>> wrote:
>> > Team,
>> >
>> > Does phoenix CsvBulkLoadTool write to HBase WAL/Memstore?
>> >
>> > Phoenix-Spark plugin:
>> > Does saveToPhoenix method on RDD[Tuple] write to HBase WAL/Memstore?
>> >
>> > Thanks,
>> > Vamsi Attluri
>> > --
>> > Vamsi Attluri
>>
>
> --
Vamsi Attluri


how to tune phoenix CsvBulkLoadTool job

2016-03-19 Thread Vamsi Krishna
Hi,

I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase table.

HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
CSV file size: 97.6 GB
No. of records: 1,439,000,238
Cluster: 13 node
Phoenix table salt-buckets: 13
Phoenix table compression: snappy
HBase table size after loading: 26.6 GB

The job completed in *1hrs, 39mins, 43sec*.
Average Map Time 5mins, 25sec
Average Shuffle Time *47mins, 46sec*
Average Merge Time 12mins, 22sec
Average Reduce Time *32mins, 9sec*

I'm looking for an opportunity to tune this job.
Could someone please help me with some pointers on how to tune this job?
Please let me know if you need to know any cluster configuration parameters
that I'm using.

*This is only a performance test. My PRODUCTION data file is 7x bigger.*

Thanks,
Vamsi Attluri

-- 
Vamsi Attluri


Re: array support issue

2016-03-19 Thread Nick Dimiduk
> Applications should never query the SYSTEM.CATALOG directly. Instead they
should go through the DatabaseMetaData interface from
Connection.getMetaData().

I may have this detail wrong, but the point remains: applications are
getting an incorrect value, or misinterpreting the correct value they
receive. From what I can see, this issue is unique to Phoenix.

On Thursday, March 10, 2016, James Taylor  wrote:

> Applications should never query the SYSTEM.CATALOG directly. Instead they
> should go through the DatabaseMetaData interface from
> Connection.getMetaData(). For column type information, you'd use the
> DatabaseMetaData#getColumn method[1] which would return the standard SQL
> type for ARRAY in the "DATA_TYPE" or resultSet.getInt(5) and the type name
> in the "TYPE_NAME" or resultSet.getString(6). For arrays, the String is of
> the form " ARRAY". The max length or precision would in
> the "COLUMN_SIZE" or resultSet.getInt(7).
>
> If we do want query access to the system catalog table, we should
> implement the INFORMATION_SCHEMA (PHOENIX-24) by creating a view on top of
> the SYSTEM.CATALOG.
>
> Thanks,
> James
>
>
> [1]
> https://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html#getColumns(java.lang.String,%20java.lang.String,%20java.lang.String,%20java.lang.String)
>
> On Thu, Mar 10, 2016 at 8:14 AM, Nick Dimiduk  > wrote:
>
>> I believe I've seen something similar while interfacing between Apache
>> NiFi and Phoenix. From the bit of debugging I did yesterday with my
>> colleague, NiFi is querying the system table for schema information about
>> the target table. My target table has a VARCHAR ARRAY column, which is
>> reported as DATA_TYPE 2003. NiFi then provides this type number to the
>> setObject method. Phoenix actually has no type registered to DATA_TYPE
>> 2003. Perhaps Squirrel is doing something similar?
>>
>> I think Phoenix either needs a generic Array type registered to 2003 that
>> can dispatch to the appropriate PDataType implementation, or it needs to
>> store the correct array DATA_TYPE number in the schema table. In this case,
>> it should be 3000 (base array offset) + 12 (varchar) = 3012.
>>
>> On Wed, Jan 13, 2016 at 1:05 AM, Bulvik, Noam > > wrote:
>>
>>> we have upgraded to the server to use the parcel and we created table
>>> with varchar array column
>>> when working with client java client like squirrel  we still get the
>>> same error (Error: org.apache.phoenix.schema.IllegalDataException:
>>> Unsupported sql type: VARCHAR ARRAY) and from sqlline.py it still works fine
>>>
>>>
>>>
>>> any idea what else to check
>>>
>>>
>>>
>>> *From:* Bulvik, Noam
>>> *Sent:* Thursday, October 8, 2015 7:41 PM
>>> *To:* 'user@phoenix.apache.org
>>> ' <
>>> user@phoenix.apache.org
>>> >
>>> *Subject:* array support issue
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> We are using CDH 5.4 and phoenix 4.4. When we try to use the client jar
>>> (from squirrel ) to query table with array column we get the following
>>> error  (even when doing simple thing like select  from >> array>:
>>>
>>>
>>>
>>> Error: org.apache.phoenix.schema.IllegalDataException: Unsupported sql
>>> type: VARCHAR ARRAY
>>>
>>> SQLState:  null
>>>
>>> ErrorCode: 0
>>>
>>>
>>>
>>> The same SQL from SQL command line (sqlline.py) it works fine (BTW only
>>> from phoenix 4.3.1 with 4.4 there is CDH compatibility issue .
>>>
>>>
>>>
>>> Any idea how it can be fixed?
>>>
>>>
>>>
>>> Regards ,
>>>
>>> Noam
>>>
>>> --
>>>
>>> PRIVILEGED AND CONFIDENTIAL
>>> PLEASE NOTE: The information contained in this message is privileged and
>>> confidential, and is intended only for the use of the individual to whom it
>>> is addressed and others who have been specifically authorized to receive
>>> it. If you are not the intended recipient, you are hereby notified that any
>>> dissemination, distribution or copying of this communication is strictly
>>> prohibited. If you have received this communication in error, or if any
>>> problems occur with transmission, please contact sender. Thank you.
>>>
>>
>>
>


Re: Implement Custom Aggregate Functions in Phoenix

2016-03-19 Thread Swapna Swapna
Thank you James for swift response.

is the process (adding to phoenix-core and rebuild the jar)  remains the
same for custom UDF's as well  (like as for custom aggregate functions)?

ex: we have UDF's like  UPPER, LOWER ,etc

On Thu, Mar 17, 2016 at 5:53 PM, James Taylor 
wrote:

> Hi Swapna,
> We don't support custom aggregate functions, only scalar functions
> (see PHOENIX-2069). For a custom aggregate function, you'd need to add them
> to phoenix-core and rebuild the jar. We're open to adding them to the code
> base if they're general enough. That's how FIRST_VALUE, LAST_VALUE, and
> NTH_VALUE made it in.
> Thanks,
> James
>
> On Thu, Mar 17, 2016 at 12:11 PM, Swapna Swapna 
> wrote:
>
>> Hi,
>>
>> I found this in Phoenix UDF documentation:
>>
>>- After compiling your code to a jar, you need to deploy the jar into
>>the HDFS. It would be better to add the jar to HDFS folder configured for
>>hbase.dynamic.jars.dir.
>>
>>
>> My question is, can that be any 'udf-user-specific' jar which need to be
>> copied to HDFS or would it need to register the function and update the
>> custom UDF classes inside phoenix-core.jar and rebuild the
>> 'phoenix-core.jar'
>>
>> Regards
>> Swapna
>>
>>
>>
>>
>> On Fri, Jan 29, 2016 at 6:31 PM, James Taylor 
>> wrote:
>>
>>> Hi Swapna,
>>> We currently don't support custom aggregate UDF, and it looks like you
>>> found the JIRA here: PHOENIX-2069. It would be a natural extension of UDFs.
>>> Would be great to capture your use case and requirements on the JIRA to
>>> make sure the functionality will meet your needs.
>>> Thanks,
>>> James
>>>
>>> On Fri, Jan 29, 2016 at 1:47 PM, Swapna Swapna 
>>> wrote:
>>>
 Hi,

 I would like to know the approach to implement and register custom
 aggregate functions in Phoenix like the way we have built-in aggregate
 functions like SUM, COUNT,etc

 Please help.

 Thanks
 Swapna

>>>
>>>
>>
>


Re: Adding table compression

2016-03-19 Thread Vladimir Rodionov
Nope, it should be transparent.

New data will be compressed on flush and old data will be compressed during
next compaction.

-Vlad

On Fri, Mar 18, 2016 at 12:55 PM, Michael McAllister <
mmcallis...@homeaway.com> wrote:

> All
>
>
>
> Are there any known issues if we use the hbase shell to alter a phoenix
> table to apply compression? We’re currently using Phoenix 4.4 on HDP 2.3.4.
>
>
>
> I plan on testing, but also want to double check for any gotchas.
>
>
>
> Michael McAllister
>
> Staff Data Warehouse Engineer | Decision Systems
>
> mmcallis...@homeaway.com | C: 512.423.7447
>
> [image: Description: Description: cid:3410354473_30269081]
>


Question about in-flight new rows while index creation in progress

2016-03-19 Thread Li Gao
Hi Community,

I want to understand and confirm whether it is expected behavior that a
long running index creation will capture all in-flight new rows to the data
table while the index creation is still in progress.

i.e. when I issue CREATE INDEX there are only 1 million rows
after I issued CREATE INDEX call, before CREATE INDEX finishes there are 9
million new rows inserted into the data table

So when CREATE INDEX completes, will there be total of 10 million rows?

Thanks,
Li


Re: how to decode phoenix data under hbase

2016-03-19 Thread Sanooj Padmakumar
Hi Kevin,

You can access the data created using phoenix with java hbase api .. Use
the sample code below..

Keep in mind for varchar (i.e. for columns whose size is unknown phoenix
uses separator) based columns we need to use
QueryConstants.SEPARATOR_BYTE_ARRAY as the separator and for number based
columns we dont need any separator (since phoenix keeps fixed size for such
columns)

byte[] startRow = ByteUtil.concat(PVarchar.INSTANCE
.toBytes("primaryKeyCol1Value"),
QueryConstants.SEPARATOR_BYTE_ARRAY,
PVarchar.INSTANCE.toBytes("primaryKeyCol2Value");

Get get = new Get(startRow);
Result result = table.get(get);

String colValue = Bytes.toString(result.getValue(Bytes.toBytes("colFamily"),
Bytes.toBytes("colName")));

Also read about PrefixFilter and range filters

Hope this helps

Sanooj


On Tue, Mar 15, 2016 at 2:33 PM, kevin  wrote:

> HI,all
> I create a table under phoenix and upsert somedata. I turn to hbase
> client and scan the new table.
> I got data like :
> column=0:NAME, timestamp=1458028540810, value=\xE5\xB0\x8F\xE6\x98\x8E
>
> I don't know how to decode the value to normal string.what's the
> codeset?
>



-- 
Thanks,
Sanooj Padmakumar


Re: how to decode phoenix data under hbase

2016-03-19 Thread anil gupta
Hi Kevin,

You should use Phoenix commandline(squirrel) or Phoenix api to read data
written via Phoenix. One of the biggest advantage of Phoenix is that it
converts long, int, date, etc into a human readable format at the time of
displaying data(unlike binary in HBase). Have a look at Phoenix website to
find out how to use Phoenix to query data.

Thanks,
Anil

On Wed, Mar 16, 2016 at 8:13 AM, Sanooj Padmakumar 
wrote:

> Hi Kevin,
>
> You can access the data created using phoenix with java hbase api .. Use
> the sample code below..
>
> Keep in mind for varchar (i.e. for columns whose size is unknown phoenix
> uses separator) based columns we need to use
> QueryConstants.SEPARATOR_BYTE_ARRAY as the separator and for number based
> columns we dont need any separator (since phoenix keeps fixed size for such
> columns)
>
> byte[] startRow = ByteUtil.concat(PVarchar.INSTANCE
> .toBytes("primaryKeyCol1Value"),
> QueryConstants.SEPARATOR_BYTE_ARRAY,
> PVarchar.INSTANCE.toBytes("primaryKeyCol2Value");
>
> Get get = new Get(startRow);
> Result result = table.get(get);
>
> String colValue =
> Bytes.toString(result.getValue(Bytes.toBytes("colFamily"),
> Bytes.toBytes("colName")));
>
> Also read about PrefixFilter and range filters
>
> Hope this helps
>
> Sanooj
>
>
> On Tue, Mar 15, 2016 at 2:33 PM, kevin  wrote:
>
>> HI,all
>> I create a table under phoenix and upsert somedata. I turn to hbase
>> client and scan the new table.
>> I got data like :
>> column=0:NAME, timestamp=1458028540810, value=\xE5\xB0\x8F\xE6\x98\x8E
>>
>> I don't know how to decode the value to normal string.what's the
>> codeset?
>>
>
>
>
> --
> Thanks,
> Sanooj Padmakumar
>



-- 
Thanks & Regards,
Anil Gupta


Re: Problem Updating Stats

2016-03-19 Thread Benjamin Kim
I got it to work by uninstalling Phoenix and reinstalling it again. I had to 
wipe clean all components.

Thanks,
Ben

> On Mar 16, 2016, at 10:47 AM, Ankit Singhal  wrote:
> 
> It seems from the attached logs that you have upgraded phoenix to 4.7 version 
> and now you are using old client to connect with it.
> "Update statistics" command and guideposts will not work with old client 
> after upgradation to 4.7, you need to use the new client for such operations.
> 
> On Wed, Mar 16, 2016 at 10:55 PM, Benjamin Kim  > wrote:
> | TABLE_CAT | TABLE_SCHEM | TABLE_NAME | COLUMN_NAME  
> |   |
> +-+--+-++-+
> |  | SYSTEM| STATS  | 
> PHYSICAL_NAME| 12  |
> |  | SYSTEM| STATS  | 
> COLUMN_FAMILY| 12  |
> |  | SYSTEM| STATS  | 
> GUIDE_POST_KEY  | -3   |
> |  | SYSTEM| STATS  | 
> GUIDE_POSTS_WIDTH   | -5   |
> |  | SYSTEM| STATS  | 
> LAST_STATS_UPDATE_TIME | 91  |
> |  | SYSTEM| STATS  | 
> GUIDE_POSTS_ROW_COUNT   | -5   |
> 
> I have attached the SYSTEM.CATALOG contents.
> 
> Thanks,
> Ben
> 
> 
> 
>> On Mar 16, 2016, at 9:34 AM, Ankit Singhal > > wrote:
>> 
>> Sorry Ben, I may not be clear in first comment but I need you to describe 
>> SYSTEM.STATS in some sql client so that I can see the columns present.
>> And also please scan 'SYSTEM.CATALOG' ,{RAW=>true} in hbase shell and attach 
>> a output here.
>> 
>> On Wed, Mar 16, 2016 at 8:55 PM, Benjamin Kim > > wrote:
>> Ankit,
>> 
>> I did not see any problems when connecting with the phoenix sqlline client. 
>> So, below is the what you asked for. I hope that you can give us insight 
>> into fixing this.
>> 
>> hbase(main):005:0> describe 'SYSTEM.STATS'
>> Table SYSTEM.STATS is ENABLED
>>  
>>
>> SYSTEM.STATS, {TABLE_ATTRIBUTES => {coprocessor$1 => 
>> '|org.apache.phoenix.coprocessor.ScanRegionObserver|805306366|', 
>> coprocessor$2 => '|org.apache.phoenix.coprocessor.UngroupedAggr
>> egateRegionObserver|805306366|', coprocessor$3 => 
>> '|org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver|805306366|', 
>> coprocessor$4 => '|org.apache.phoenix.coprocessor.Serv
>> erCachingEndpointImpl|805306366|', coprocessor$5 => 
>> '|org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint|805306366|', 
>> coprocessor$6 => '|org.apache.hadoop.hbase.regionserv
>> er.LocalIndexSplitter|805306366|', METADATA => {'SPLIT_POLICY' => 
>> 'org.apache.phoenix.schema.MetaDataSplitPolicy'}}
>>   
>> COLUMN FAMILIES DESCRIPTION  
>>  
>>
>> {NAME => '0', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'ROW', 
>> REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', 
>> MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP
>> _DELETED_CELLS => 'true', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
>> BLOCKCACHE => 'true'}
>>   
>> 1 row(s) in 0.0280 seconds
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Mar 15, 2016, at 11:59 PM, Ankit Singhal >> > wrote:
>>> 
>>> Yes it seems to. 
>>> Did you get any error related to SYSTEM.STATS when the client is connected 
>>> first time ?
>>> 
>>> can you please describe your system.stats table and paste the output here.
>>> 
>>> On Wed, Mar 16, 2016 at 3:24 AM, Benjamin Kim >> > wrote:
>>> When trying to run update status on an existing table in hbase, I get error:
>>> Update stats:
>>> UPDATE STATISTICS "ops_csv" ALL
>>> error:
>>> ERROR 504 (42703): Undefined column. columnName=REGION_NAME
>>> Looks like the meta data information is messed up, ie. there is no column 
>>> with name REGION_NAME in this table.
>>> I see similar errors for other tables that we currently have in hbase.
>>> 
>>> We are using CDH 5.5.2, HBase 1.0.0, and Phoenix 4.5.2.
>>> 
>>> Thanks,
>>> Ben
>>> 
>> 
>> 
> 
> 
> 



Re: Implement Custom Aggregate Functions in Phoenix

2016-03-19 Thread James Taylor
No need to register your custom UDFs. Did you see these directions:
https://phoenix.apache.org/udf.html#How_to_write_custom_UDF?

Have you tried it yet?

On Thu, Mar 17, 2016 at 6:49 PM, Swapna Swapna 
wrote:

> Yes, we do have support UPPER and LOWER. I just provided as an example to
> refer to UDF.
>
> For custom UDF's, i understand that we can go ahead and create custom UDF
> jar.
>
> but how do we register that function?
>
> As per the blog, i found the below lines:
>
> *Finally, we'll need to register our new function. For this, you'll need
> to edit the ExpressionType enum and include your new built-in function.
> There's room for improvement here to allow registration of user defined
> functions outside of the phoenix jar. However, you'd need to be able to
> ensure your class is available on the HBase server class path since this
> will be executed on the server side at runtime.*
>
>  Does that mean, to register my custom function, i should edit the 
> *ExpressionType
> enum *exists in Phoenix and rebuild the *phoenix jar?*
>
>
>
>
> On Thu, Mar 17, 2016 at 6:17 PM, James Taylor 
> wrote:
>
>> No, custom UDFs can be added dynamically as described here:
>> https://phoenix.apache.org/udf.html. No need to re-build Phoenix. It's
>> just custom aggregates that would require rebuilding.
>>
>> FYI, we have support for UPPER and LOWER already.
>>
>> On Thu, Mar 17, 2016 at 6:09 PM, Swapna Swapna 
>> wrote:
>>
>>> Thank you James for swift response.
>>>
>>> is the process (adding to phoenix-core and rebuild the jar)  remains
>>> the same for custom UDF's as well  (like as for custom aggregate functions)?
>>>
>>> ex: we have UDF's like  UPPER, LOWER ,etc
>>>
>>> On Thu, Mar 17, 2016 at 5:53 PM, James Taylor 
>>> wrote:
>>>
 Hi Swapna,
 We don't support custom aggregate functions, only scalar functions
 (see PHOENIX-2069). For a custom aggregate function, you'd need to add them
 to phoenix-core and rebuild the jar. We're open to adding them to the code
 base if they're general enough. That's how FIRST_VALUE, LAST_VALUE, and
 NTH_VALUE made it in.
 Thanks,
 James

 On Thu, Mar 17, 2016 at 12:11 PM, Swapna Swapna  wrote:

> Hi,
>
> I found this in Phoenix UDF documentation:
>
>- After compiling your code to a jar, you need to deploy the jar
>into the HDFS. It would be better to add the jar to HDFS folder 
> configured
>for hbase.dynamic.jars.dir.
>
>
> My question is, can that be any 'udf-user-specific' jar which need to
> be copied to HDFS or would it need to register the function and update the
> custom UDF classes inside phoenix-core.jar and rebuild the
> 'phoenix-core.jar'
>
> Regards
> Swapna
>
>
>
>
> On Fri, Jan 29, 2016 at 6:31 PM, James Taylor 
> wrote:
>
>> Hi Swapna,
>> We currently don't support custom aggregate UDF, and it looks like
>> you found the JIRA here: PHOENIX-2069. It would be a natural extension of
>> UDFs. Would be great to capture your use case and requirements on the 
>> JIRA
>> to make sure the functionality will meet your needs.
>> Thanks,
>> James
>>
>> On Fri, Jan 29, 2016 at 1:47 PM, Swapna Swapna <
>> talktoswa...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I would like to know the approach to implement and register custom
>>> aggregate functions in Phoenix like the way we have built-in aggregate
>>> functions like SUM, COUNT,etc
>>>
>>> Please help.
>>>
>>> Thanks
>>> Swapna
>>>
>>
>>
>

>>>
>>
>


Re: how to tune phoenix CsvBulkLoadTool job

2016-03-19 Thread Gabriel Reid
Hi Vamsi,

I see from your counters that the number of map spill records is
double the number of map output records, so I think that raising the
mapreduce.task.io.sort.mb setting on the job should improve the
shuffle throughput.

However, like I said before, I think that the first thing to try is
increasing the number of regions.

Indeed, increasing the number of regions can potentially increase
parallelism for reads by Phoenix, although Phoenix actually internally
does sub-region reads as-is, so there probably won't be a huge effect
either way in terms of read performance.

Aggregate queries shouldn't be impacted much either way. The increased
parallelism that Phoenix does to do sub-region reads is still in place
regardless. In addition, aggregate reads are done per region (or
sub-region split), and then the aggregation results are combined to
give the whole aggregate result. Having five times as many regions
(for example) would increase the number of portions of the aggregation
that need to be combined, but this should still be very minor in
comparison to the total amount of work required to do aggregations, so
it also shouldn't have a major effect either way.

- Gabriel

On Wed, Mar 16, 2016 at 7:15 PM, Vamsi Krishna  wrote:
> Thanks Gabriel,
> Please find the job counters attached.
>
> Would increasing the splitting affect the reads?
> I assume a simple read would be benefitted by increased splitting as it
> increases the parallelism.
> But, how would it impact the aggregate queries?
>
> Vamsi Attluri
>
> On Wed, Mar 16, 2016 at 9:06 AM Gabriel Reid  wrote:
>>
>> Hi Vamsi,
>>
>> The first thing that I notice looking at the info that you've posted
>> is that you have 13 nodes and 13 salt buckets (which I assume also
>> means that you have 13 regions).
>>
>> A single region is the unit of parallelism that is used for reducers
>> in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so
>> currently you're only getting an average of a single reduce process
>> per node on your cluster. Assuming that you have multiple cores in
>> each of those nodes, you will probably get a decent improvement in
>> performance by further splitting your destination table so that it has
>> multiple regions per node (thereby triggering multiple reduce tasks
>> per node).
>>
>> Would you also be able to post the full set of job counters that are
>> shown after the job is completed? This would also be helpful in
>> pinpointing things that can be (possibly) tuned.
>>
>> - Gabriel
>>
>>
>> On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna 
>> wrote:
>> > Hi,
>> >
>> > I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase
>> > table.
>> >
>> > HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
>> > CSV file size: 97.6 GB
>> > No. of records: 1,439,000,238
>> > Cluster: 13 node
>> > Phoenix table salt-buckets: 13
>> > Phoenix table compression: snappy
>> > HBase table size after loading: 26.6 GB
>> >
>> > The job completed in 1hrs, 39mins, 43sec.
>> > Average Map Time 5mins, 25sec
>> > Average Shuffle Time 47mins, 46sec
>> > Average Merge Time 12mins, 22sec
>> > Average Reduce Time 32mins, 9sec
>> >
>> > I'm looking for an opportunity to tune this job.
>> > Could someone please help me with some pointers on how to tune this job?
>> > Please let me know if you need to know any cluster configuration
>> > parameters
>> > that I'm using.
>> >
>> > This is only a performance test. My PRODUCTION data file is 7x bigger.
>> >
>> > Thanks,
>> > Vamsi Attluri
>> >
>> > --
>> > Vamsi Attluri
>
> --
> Vamsi Attluri


Re: how to tune phoenix CsvBulkLoadTool job

2016-03-19 Thread Gabriel Reid
Hi Vamsi,

The first thing that I notice looking at the info that you've posted
is that you have 13 nodes and 13 salt buckets (which I assume also
means that you have 13 regions).

A single region is the unit of parallelism that is used for reducers
in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so
currently you're only getting an average of a single reduce process
per node on your cluster. Assuming that you have multiple cores in
each of those nodes, you will probably get a decent improvement in
performance by further splitting your destination table so that it has
multiple regions per node (thereby triggering multiple reduce tasks
per node).

Would you also be able to post the full set of job counters that are
shown after the job is completed? This would also be helpful in
pinpointing things that can be (possibly) tuned.

- Gabriel


On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna  wrote:
> Hi,
>
> I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase table.
>
> HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
> CSV file size: 97.6 GB
> No. of records: 1,439,000,238
> Cluster: 13 node
> Phoenix table salt-buckets: 13
> Phoenix table compression: snappy
> HBase table size after loading: 26.6 GB
>
> The job completed in 1hrs, 39mins, 43sec.
> Average Map Time 5mins, 25sec
> Average Shuffle Time 47mins, 46sec
> Average Merge Time 12mins, 22sec
> Average Reduce Time 32mins, 9sec
>
> I'm looking for an opportunity to tune this job.
> Could someone please help me with some pointers on how to tune this job?
> Please let me know if you need to know any cluster configuration parameters
> that I'm using.
>
> This is only a performance test. My PRODUCTION data file is 7x bigger.
>
> Thanks,
> Vamsi Attluri
>
> --
> Vamsi Attluri


Re: Implement Custom Aggregate Functions in Phoenix

2016-03-19 Thread Swapna Swapna
Hi,

I found this in Phoenix UDF documentation:

   - After compiling your code to a jar, you need to deploy the jar into
   the HDFS. It would be better to add the jar to HDFS folder configured for
   hbase.dynamic.jars.dir.


My question is, can that be any 'udf-user-specific' jar which need to be
copied to HDFS or would it need to register the function and update the
custom UDF classes inside phoenix-core.jar and rebuild the
'phoenix-core.jar'

Regards
Swapna




On Fri, Jan 29, 2016 at 6:31 PM, James Taylor 
wrote:

> Hi Swapna,
> We currently don't support custom aggregate UDF, and it looks like you
> found the JIRA here: PHOENIX-2069. It would be a natural extension of UDFs.
> Would be great to capture your use case and requirements on the JIRA to
> make sure the functionality will meet your needs.
> Thanks,
> James
>
> On Fri, Jan 29, 2016 at 1:47 PM, Swapna Swapna 
> wrote:
>
>> Hi,
>>
>> I would like to know the approach to implement and register custom
>> aggregate functions in Phoenix like the way we have built-in aggregate
>> functions like SUM, COUNT,etc
>>
>> Please help.
>>
>> Thanks
>> Swapna
>>
>
>