Re: Runtime DDL supported?

2018-03-07 Thread Miles Spielberg
We found https://issues.apache.org/jira/browse/PHOENIX-3547, which seems to
be precisely our problem. We would want at least the option to use a bigint
rather than the int in the JIRA to accommodate massive growth. While we
intend to have many tenants, we don't intend to use the Phoenix "tenant_id"
to differentiate them, and instead manage them at our application layer, so
separate counters per Phoenix tenant would not help in our situation.

Miles Spielberg
Staff Software Engineer


O. 650.485.1102
900 Jefferson Ave
Redwood City, CA 94063

On Wed, Feb 28, 2018 at 10:27 PM, James Taylor 
wrote:

> Please file a JIRA as it’d be feasible to change this limitation. The
> easiest way would be to have a separate counter for each tenant. Another
> way to reduce the number of indexes on tenant specific views would be to
> factor out common columns to global views and create indexes there.
>
> On Tue, Feb 27, 2018 at 3:40 PM Miles Spielberg  wrote:
>
>> As we discussed, indexes across views are stored in a single HBase table
>> associated with the original table (_IDX_). That's grand for
>> limiting the number of HBase tables created, but I just realized that the
>> actual index data within is differentiated by the 16-bit "viewIndexId",
>> which limits us to 64K indexes across all views for a given table. That's
>> concerning for our use case, especially if its a cumulative autoincrement
>> across all CREATE INDEX and DROP INDEX operations over the lifetime of the
>> base table.
>>
>> Is there any workaround for this? A quick grep across the source
>> indicates that the length of viewIndexId is currently hard-coded.
>>
>> At least, this limitation should probably be added to the list of caveats
>> and warnings at https://phoenix.apache.org/views.html.
>>
>> Miles Spielberg
>> Staff Software Engineer
>>
>>
>> O. 650.485.1102 <(650)%20485-1102>
>> 900 Jefferson Ave
>> 
>> Redwood City
>> ,
>> CA 94063
>> 
>>
>> On Thu, Feb 22, 2018 at 7:42 AM, James Taylor 
>> wrote:
>>
>>> Another option would be to use dynamic columns[1] when querying across
>>> views. You’d have to disable column encoding [2] in this case.
>>>
>>> [1] http://phoenix.apache.org/dynamic_columns.html
>>> [2] http://phoenix.apache.org/columnencoding.html
>>>
>>> On Wed, Feb 21, 2018 at 11:57 PM Miles Spielberg  wrote:
>>>
 I believe each query in a UNION needs to have the same result tuple
 format, which would work in this toy example, but in the general case each
 view would have a different schema. We could make the result tuples conform
 with each other by selecting NULL literals for every column except those in
 a view. It would get quite verbose though. Assuming f1,f2,f3 all have
 incompatible types, were you suggesting something like this?

 Select f1, null, null from v1 where PK=?
 Union all
 Select null, f2, null from v2 where PK=?
 Union all
 Select null, null, f3 from v3 where PK=?

 We might just run separate parallel queries against each view and merge
 the results client side. I would guess this should perform well since the
 block cache can be leveraged for queries after the first.

 We could also use the HBase API to run a point row get. We'd have to
 reimplement decoding for Phoenix's column values, which is not ideal but
 quite doable.

 Sent from my iPhone

 On Feb 21, 2018, at 9:09 PM, James Taylor 
 wrote:

 Have you tried a UNION ALL query on (f1, f2, f3) instead? It seems
 you’re on a good track with multiple views over a single (or handful) of
 physical table(s).

 On Wed, Feb 21, 2018 at 6:45 PM Miles Spielberg  wrote:

> I've done some experimentation with views, with a schema resembling
> this:
>
> create table t1(
>>
>> pk bigint not null primary key
>>
>> );
>>
>>
>>> create view v1(
>>
>> f1 varchar
>>
>> ) AS SELECT * FROM t1;
>>
>> create INDEX v1_f1 ON v1(f1);
>>
>>
>>> create view v2(
>>
>> f2 varchar
>>
>> ) AS SELECT * FROM t1;
>>
>> create INDEX v2_f2 ON v2(f2);
>>
>>
>>> create view v3(
>>
>> f3 varchar
>>
>> ) AS SELECT * FROM t1;
>>
>> create INDEX v3_f3 ON v3(f3);
>>
>>
> Most of the time we'll be accessing data via the indexed views, but
> we'd also like to be able to query all columns (f1, f2, f3) for a given 
> pk. At
> the HBase level, this should be doable as a point get on t1. The
> SQL-y way to express this would 

Re: LIMIT statement when loading data in Phoenix Spark module.

2018-03-07 Thread alexander . scherbatiy
Is there a documentation which describes which queries and how will be 
propagated to the server during data fetching for the Phoenix Spark?

Thanks,
Alexandr.  

07.03.2018, 16:24, "Xavier Jodoin" :
> it will limit the number of rows fetched by the client
>
> On 2018-03-07 07:54 AM, alexander.scherba...@yandex.com wrote:
>>  Does it work that only the limited number of rows will be sent from the 
>> each HBase Region Server to the client?
>>
>>  I just ask because I can use the WHERE statement in the same way in the 
>> Spark SQL instead of passing the predicate.
>>
>>  Thanks,
>>  Alexandr.
>>
>>  07.03.2018, 15:35, "Xavier Jodoin" :
>>>  You can do it directly with spark sql
>>>
>>>  Xavier
>>>
>>>  On 2018-03-07 06:38 AM, alexander.scherba...@yandex.com wrote:
    Hello,

    I use the Phoenix Spark plugin to load data from HBase.

    There is the SparkSqlContextFunctions.phoenixTableAsDataFrame() method 
 which allows to get a Dataset
    for the given table name, columns and a predicate.

    Is it possible to also provide LIMIT statement so the number of the 
 retrieved rows were restricted?

    Thanks,
    Alexander.


Re: LIMIT statement when loading data in Phoenix Spark module.

2018-03-07 Thread Xavier Jodoin

it will limit the number of rows fetched by the client


On 2018-03-07 07:54 AM, alexander.scherba...@yandex.com wrote:

Does it work that only the limited number of rows will be sent from the each 
HBase Region Server to the client?

I just ask because I can use the WHERE statement in the same way in the Spark 
SQL instead of passing the predicate.

Thanks,
Alexandr.

07.03.2018, 15:35, "Xavier Jodoin" :

You can do it directly with spark sql

Xavier

On 2018-03-07 06:38 AM, alexander.scherba...@yandex.com wrote:

  Hello,

  I use the Phoenix Spark plugin to load data from HBase.

  There is the SparkSqlContextFunctions.phoenixTableAsDataFrame() method which 
allows to get a Dataset
  for the given table name, columns and a predicate.

  Is it possible to also provide LIMIT statement so the number of the retrieved 
rows were restricted?

  Thanks,
  Alexander.


<>

Re: LIMIT statement when loading data in Phoenix Spark module.

2018-03-07 Thread alexander . scherbatiy
Does it work that only the limited number of rows will be sent from the each 
HBase Region Server to the client?

I just ask because I can use the WHERE statement in the same way in the Spark 
SQL instead of passing the predicate.

Thanks,
Alexandr.

07.03.2018, 15:35, "Xavier Jodoin" :
> You can do it directly with spark sql
>
> Xavier
>
> On 2018-03-07 06:38 AM, alexander.scherba...@yandex.com wrote:
>>  Hello,
>>
>>  I use the Phoenix Spark plugin to load data from HBase.
>>
>>  There is the SparkSqlContextFunctions.phoenixTableAsDataFrame() method 
>> which allows to get a Dataset
>>  for the given table name, columns and a predicate.
>>
>>  Is it possible to also provide LIMIT statement so the number of the 
>> retrieved rows were restricted?
>>
>>  Thanks,
>>  Alexander.


Re: LIMIT statement when loading data in Phoenix Spark module.

2018-03-07 Thread Xavier Jodoin

You can do it directly with spark sql

Xavier


On 2018-03-07 06:38 AM, alexander.scherba...@yandex.com wrote:

Hello,

I use the Phoenix Spark plugin to load data from HBase.

There is the SparkSqlContextFunctions.phoenixTableAsDataFrame() method which 
allows to get a Dataset
for the given table name, columns and a predicate.

Is it possible to also provide LIMIT statement so the number of the retrieved 
rows were restricted?

Thanks,
Alexander.




<>

LIMIT statement when loading data in Phoenix Spark module.

2018-03-07 Thread alexander . scherbatiy
Hello,

I use the Phoenix Spark plugin to load data from HBase.

There is the SparkSqlContextFunctions.phoenixTableAsDataFrame() method which 
allows to get a Dataset
for the given table name, columns and a predicate.

Is it possible to also provide LIMIT statement so the number of the retrieved 
rows were restricted?

Thanks,
Alexander. 




Re: Phoenix as a source for Spark processing

2018-03-07 Thread Stepan Migunov
Some more details... We have done some simple tests to compare read/write 
possibility spark+hive and spark+phoenix. And now we have the following results:

Copy table (with no any transformations) (about 800 million rec):
Hive (TEZ) - 752 sec

Spark:
>From Hive to Hive: 2463 sec
>From Phoenix to Hive - 13310 sec
>From Hive to Phoenix - > 30240 sec

We use Spark 2.2.1; hbase 1.1.2, Phonix 4.13, Hive 2.1.1

So it seems that Spark + Phoenix led great performance degradation. Any 
thoughts?

On 2018/03/04 11:08:56, Stepan Migunov  
wrote: 
> In our software we need to combine fast interactive access to the data with 
> quite complex data processing. I know that Phoenix intended for fast access, 
> but hoped that also I could be able to use Phoenix as a source for complex 
> processing with the Spark.  Unfortunately, Phoenix + Spark shows very poor 
> performance. E.g., querying big (about billion records) table with distinct 
> takes about 2 hours. At the same time this task with Hive source takes a few 
> minutes. Is it expected? Does it mean that Phoenix is absolutely not suitable 
> for batch processing with spark and I should  duplicate data to Hive and 
> process it with Hive?
>