PySpark and Phoenix Dynamic Columns

2017-02-23 Thread Craig Roberts
Hi all,

I've got a (very) basic Spark application in Python that selects some basic
information from my Phoenix table. I can't quite figure out how (or even if
I can) select dynamic columns through this, however.

Here's what I have;

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("pysparkPhoenixLoad").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

df = sqlContext.read.format("org.apache.phoenix.spark") \
   .option("table", """MYTABLE("dyamic_column" VARCHAR)""") \
   .option("zkUrl", "127.0.0.1:2181:/hbase-unsecure") \
   .load()

df.show()
df.printSchema()


I get a "org.apache.phoenix.schema.TableNotFoundException:" error for the
above.

If I try and load the data frame as a table and query that with SQL:

sqlContext.registerDataFrameAsTable(df, "test")
sqlContext.sql("""SELECT * FROM test("dynamic_column" VARCHAR)""")


I get a bit of a strange exception:

py4j.protocol.Py4JJavaError: An error occurred while calling o37.sql.
: java.lang.RuntimeException: [1.19] failure: ``union'' expected but `('
found

SELECT * FROM test("dynamic_column" VARCHAR)



Does anybody have a pointer on whether this is supported and how I might be
able to query a dynamic column? I haven't found much information on the
wider Internet about Spark + Phoenix integration for this kind of
thing...Simple selects are working. Final note: I have (rather stupidly)
lower-cased my column names in Phoenix, so I need to quote them when I
execute a query (I'll be changing this as soon as possible).

Any assistance would be appreciated :)
*-- Craig*


Re: Select date range performance issue

2017-02-23 Thread Jonathan Leech
1. No, I am not confused. A skip scan would "skip" over entire ranges of obj_id 
and all create_dt values for it. This will only be effective if there are many 
less distinct values of obj_id than there are total rows. If there are too many 
distinct obj_ids then it either wont speed the query up at all, or not enough, 
but it's simple to try it and see.

2. Your index isnt used because it doesn't contain the other columns used in 
the query; e.g your query is isn't "covered". You get the column(s) defined in 
the index + anything in the rowkey. You can also use the "include" keyword to 
add other columns to the index. Alternatively, you can look at "local" indexes, 
or it may be possible to nest the query with a sub-select to fetch the desired 
primary key values from the index by create_dt and the others from the main 
table.

3. No, not all. Phoenix will assign the the internal hbase timestamp of the row 
to whatever you set to create_dt. It can also automatically set it to the 
current time when you create the row, if you want it to. This has other 
implications; e.g if you set a TTL, versions, etc in hbase. It can speed up 
queries, especially those that execute on the most recent data written, but 
prior to hbase compaction. Advanced stuff and performance is highly dependent 
on your specific use case and hbase compaction settings...

> On Feb 23, 2017, at 7:59 PM, NaHeon Kim  wrote:
> 
> Thanks for suggestion.
> 
> Here's further questions:
> 1. create_dt (not obj_id, I think you confused) would have large sets of 
> date, so SKIP_SCAN hint might be not useful.
> 
> 2. I created secondary index on create_dt
>create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;
> 
> However, EXPLAIN still shows query plan of FULL SCAN.
> Giving index hint on SELECT doesn't work as well.
> 
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT
>SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND 
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
> 
> 3. ROW_TIMESTAMP is time of current query execution time, right?
> Then it's not a right choice. :-(
> 
> 
> 2017-02-24 1:54 GMT+09:00 Jonathan Leech :
>> If there are not a large number of distinct values of obj_id, try a 
>> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's 
>> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature 
>> if it fits your use case.
>> 
>>> On Feb 22, 2017, at 11:30 PM, NaHeon Kim  wrote:
>>> 
>>> Hi all,
>>> I've seen performance problem when selecting rows within date range.
>>> 
>>> My table schema is:
>>> 
>>> CREATE TABLE MY_TABLE (
>>>OBJ_ID varchar(20) not null,
>>>CREATE_DT timestamp not null,
>>>KEYWORD varchar(100) not null,
>>>COUNT integer,
>>>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
>>> );
>>> 
>>> MY_TABLE has almost 5,200,000 rows,
>>> CREATE_DT has about 6 months range.
>>> 
>>> And executed query:
>>> 
>>> SELECT KEYWORD, SUM(COUNT)
>>> FROM MY_TABLE
>>> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
>>> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
>>> GROUP BY KEYWORD;
>>> 
>>> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row 
>>> key.
>>> I created a secondary index on CREATE_DT but there's no improvement.
>>> 
>>> Query plan looks weird:
>>> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER 
>>> NEWS_KEYWORD_COUNT
>>>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND 
>>> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>>>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
>>> CLIENT MERGE SORT
>>> CLIENT 100 ROW LIMIT
>>> 
>>> BUT If CREATE_DT comes first of row key, plan says range scan will be done.
>>> 
>>> Any suggestion? : )
>>> 
>>> Thanks,
>>> NaHeon
>>> 
> 


Re: Select date range performance issue

2017-02-23 Thread NaHeon Kim
Thanks for suggestion.

Here's further questions:
1. create_dt (not obj_id, I think you confused) would have large sets of
date, so SKIP_SCAN hint might be not useful.

2. I created secondary index on create_dt
   create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;

However, EXPLAIN still shows query plan of FULL SCAN.
Giving index hint on SELECT doesn't work as well.

CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
NEWS_KEYWORD_COUNT
   SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
   SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
CLIENT MERGE SORT
CLIENT 100 ROW LIMIT

3. ROW_TIMESTAMP is time of current query execution time, right?
Then it's not a right choice. :-(


2017-02-24 1:54 GMT+09:00 Jonathan Leech :

> If there are not a large number of distinct values of obj_id, try a
> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's
> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature
> if it fits your use case.
>
> On Feb 22, 2017, at 11:30 PM, NaHeon Kim  wrote:
>
> Hi all,
> I've seen performance problem when selecting rows within date range.
>
> My table schema is:
>
> CREATE TABLE MY_TABLE (
>OBJ_ID varchar(20) not null,
>CREATE_DT timestamp not null,
>KEYWORD varchar(100) not null,
>COUNT integer,
>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
> );
>
> MY_TABLE has almost 5,200,000 rows,
> CREATE_DT has about 6 months range.
>
> And executed query:
>
> SELECT KEYWORD, SUM(COUNT)
> FROM MY_TABLE
> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
> GROUP BY KEYWORD;
>
> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row
> key.
> I created a secondary index on CREATE_DT but there's no improvement.
>
> Query plan looks weird:
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
> NEWS_KEYWORD_COUNT
>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
>
> BUT If CREATE_DT comes first of row key, plan says range scan will be done.
>
> Any suggestion? : )
>
> Thanks,
> NaHeon
>
>


Re: Phoenix Query Server tenant_id

2017-02-23 Thread Michael Young
The missing quotes was the issue.  That fixed it.  Thanks!


On Wed, Feb 22, 2017 at 8:16 PM, Josh Elser  wrote:

> Also, remember that Bash is going to interpret that semi-colon in your URL
> if you don't quote it. It will be treated as two separate commands:
>
> `/usr/hdp/2.5.0.0-1245/phoenix/bin/sqlline-thin.py
> http://pqs1.mydomain.com:8765` and `TenantId=myTenant`
>
> Please try `/usr/hdp/2.5.0.0-1245/phoenix/bin/sqlline-thin.py '
> http://pqs1.mydomain.com:8765;TenantId=myTenant'`, and, if that still
> does not work, feel free to open a JIRA issue to investigate why this is
> happening.
>
> Michael Young wrote:
>
>> In response to your question:
>>  >>So, just that I'm on the same page as you, when you invoke the Java
>> application yourself, it works fine, but when you use sqlline-thin.py,
>> the extra argument is dropped?
>> Yes. The property gets passed when invoking the Java application, but
>> not when adding it to the sqllin-thin.py command line.
>> It is simple to fix in my local dev test environment, but not in our
>> production environment, unfortunately.
>> We are using Phoenix 4.7 (from HDP 2.5 Community release).
>>
>> On Wed, Feb 22, 2017 at 4:07 PM, Josh Elser > > wrote:
>>
>> So, just that I'm on the same page as you, when you invoke the Java
>> application yourself, it works fine, but when you use
>> sqlline-thin.py, the extra argument is dropped?
>>
>> If so, that just sounds like a bug in sqlline-thin.py that we could
>> fix...
>>
>> If you're skilled in Python (or even just adventurous), I'd guess it
>> would be a relatively simple fix :)
>>
>> Michael Young wrote:
>>
>> Hi Josh,
>>
>> you are right, the TenantId property does get passed through
>> when I run
>> the command by hand (see below).
>>
>> Phoenix Version: phoenix-4.7.0.2.5.0.0-1245
>>
>> java $PHOENIX_OPTS \
>> -cp $CLASSPATH \
>> -Dlog4j.configuration=file:/usr/hdp/2.5.0.0-1245/phoenix/bin
>> /log4j.properties
>> \
>> org.apache.phoenix.queryserver.client.SqllineWrapper \
>> -d org.apache.phoenix.queryserver.client.Driver \
>> -u
>> "jdbc:phoenix:thin:url=http://pqs1.mydomain.com:8765
>> ;serialization=PROTOBUF;Tenan
>> tId=myTenant"
>> \
>> -n none -p none --color=true --fastConnect=false --verbose=true
>> --incremental=false --isolation=TRANSACTION_READ_COMMITTED
>>
>> However, with Phoenix version 4.7.0.2.5.0.0-1245 (which comes
>> with HDP
>> 2.5), the python script doesn't pass any property along.
>>   Here's the
>> obfuscated output when I connect to PQS:
>>
>> [prompt]$ /usr/hdp/2.5.0.0-1245/phoenix/bin/sqlline-thin.py
>> http://pqs1.mydomain.com:8765;TenantId=myTenant
>>
>> Setting property: [incremental, false]
>> Setting property: [isolation, TRANSACTION_READ_COMMITTED]
>> issuing: !connect
>> jdbc:phoenix:thin:url=http://pqs1.mydomain:8765;serializatio
>> n=PROTOBUF
>> none none org.apache.phoenix.queryserver.client.Driver
>> Connecting to
>> jdbc:phoenix:thin:url=http://pqs1.mydomain:8765;serializatio
>> n=PROTOBUF
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/usr/hdp/2.5.0.0-1245/phoenix/phoenix-4.7.0.2.5.0.
>> 0-1245-thin-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/usr/hdp/2.5.0.0-1245/hadoop/lib/slf4j-log4j12-1.
>> 7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings
>>  for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFac
>> tory]
>> Connected to: Apache Phoenix (version unknown version)
>> Driver: Phoenix Remote JDBC Driver (version unknown version)
>> Autocommit status: true
>> Transaction isolation: TRANSACTION_READ_COMMITTED
>> Building list of tables and columns for tab-completion (set
>> fastconnect
>> to true to skip)...
>> 7603/7603 (100%) Done
>> Done
>> sqlline version 1.1.8
>> 0: jdbc:phoenix:thin:url=http://pqs1.mydomain
>> > !list
>> 1 active connection:
>>#0  open
>> jdbc:phoenix:thin:url=http://pqs1.mydomain:8765;serializatio
>> n=PROTOBUF
>>
>>
>> Is this something that has changed in newer versions of Phoenix?
>>
>> On Mon, Feb 20, 2017 at 1:47 PM, Josh Elser > 
>> >> wrote:
>>
>>  See
>> 

Re: Phoenix query performance

2017-02-23 Thread Pradheep Shanmugam
Hi Arvind,

The row key is PARENTID, OWNERORGID,  MILESTONETYPEID, PARENTTYPE
Each parentid will have a list of  MILESTONETYPEID (19661, 1, 2 , etc..). So 
your query will return all the parentids.. I am looking of rparentid that does 
not have a MILESTONETYPEID

Thanks,
Pradheep

From: Arvind S >
Reply-To: "user@phoenix.apache.org" 
>
Date: Thursday, February 23, 2017 at 1:19 AM
To: "user@phoenix.apache.org" 
>
Subject: Re: Phoenix query performance

why cant you reduce your query to

  select msbo1.PARENTID
  from msbo_phoenix_comp_rowkey
  where msbo1.PARENTTYPE = 'SHIPMENT'
   and msbo1.OWNERORGID = 100
   and msbo1.MILESTONETYPEID != 19661
   and msbo1.PARENTREFERENCETIME between 1479964000 and 1480464000
  group by msbo1.PARENTID
  order by msbo1.PARENTID


??

looks like that's what you get by this left outer.



Cheers !!
Arvind

On 22 February 2017 at 22:58, Pradheep Shanmugam 
> wrote:
Hi,

We have a hbase cluster with 8 region servers with 20G memory
We have a table  with 1 column family along with a secondary index.
Following query took only few milliseconds when we had less data(< 1 million)
After adding more data(~30M rows) the performance declined and took about a 
minute or more(not stable)

select msbo1.PARENTID
  from msbo_phoenix_comp_rowkey msbo1
  left outer join (
 select PARENTID,MILESTONETYPEID
   from msbo_phoenix_comp_rowkey
  where PARENTREFERENCETIME between 1479964000 and 
1480464000
and OWNERORGID = 100
and PARENTTYPE = 'SHIPMENT'
and MILESTONETYPEID = 19661
group by PARENTID,MILESTONETYPEID
 ) msbo2
  on msbo1.PARENTID = msbo2.PARENTID
  where msbo1.PARENTTYPE = 'SHIPMENT'
   and msbo1.OWNERORGID = 100
   and msbo2.MILESTONETYPEID is null
   and msbo1.PARENTREFERENCETIME between 1479964000 and 
1480464000
group by msbo1.PARENTID
  order by msbo1.PARENTID

The RHS return about a 500K rows ..LHS about 18M rows…final result about 500K 
rows

MSBO_PHOENIX_COMP_ROWKEY_INDEX_PARENTTYPE_OWNERORGID_MILESTONETYPEID_PARENTREFERENCETIME
  is the index

Query plan:
CLIENT 8-CHUNK PARALLEL 8-WAY RANGE SCAN OVER 
MSBO_PHOENIX_COMP_ROWKEY_INDEX_PARENTTYPE_OWNERORGID_MILESTONETYPEID_PARENTREFERENCETIME
 [0,'SHIPMENT',100]
SERVER FILTER BY FIRST KEY ONLY AND 
(TO_UNSIGNED_LONG("PARENTREFERENCETIME") >= 1477958400 AND 
TO_UNSIGNED_LONG("PARENTREFERENCETIME") <= 1480464000)
SERVER AGGREGATE INTO DISTINCT ROWS BY ["MSBO1.:PARENTID"]
CLIENT MERGE SORT
PARALLEL LEFT-JOIN TABLE 0
CLIENT 8-CHUNK PARALLEL 8-WAY RANGE SCAN OVER 
MSBO_PHOENIX_COMP_ROWKEY_INDEX_PARENTTYPE_OWNERORGID_MILESTONETYPEID_PARENTREFERENCETIME
 [0,'SHIPMENT',100,19661,1,477,958,400] - [0,'SHIPMENT',100,19661,1,480,464,000]
SERVER FILTER BY FIRST KEY ONLY
SERVER AGGREGATE INTO DISTINCT ROWS BY ["PARENTID", 
"MILESTONETYPEID"]
CLIENT MERGE SORT
AFTER-JOIN SERVER FILTER BY MSBO2.MILESTONETYPEID IS NULL

Attached the phoenix log.
I see the caching to set as 100..and "maxResultSize”:2097152..is that something 
that can be tuned will help?
Is that the client merge sort consuming more time can be improved? Is there any 
other tuning possible?

Thanks,
Pradheep



Re: Select date range performance issue

2017-02-23 Thread Jonathan Leech
If there are not a large number of distinct values of obj_id, try a SKIP_SCAN 
hint. Otherwise, the secondary index should work, make sure it's actually used 
via explain. Finally, you might try the ROW_TIMESTAMP feature if it fits your 
use case.

> On Feb 22, 2017, at 11:30 PM, NaHeon Kim  wrote:
> 
> Hi all,
> I've seen performance problem when selecting rows within date range.
> 
> My table schema is:
> 
> CREATE TABLE MY_TABLE (
>OBJ_ID varchar(20) not null,
>CREATE_DT timestamp not null,
>KEYWORD varchar(100) not null,
>COUNT integer,
>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
> );
> 
> MY_TABLE has almost 5,200,000 rows,
> CREATE_DT has about 6 months range.
> 
> And executed query:
> 
> SELECT KEYWORD, SUM(COUNT)
> FROM MY_TABLE
> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
> GROUP BY KEYWORD;
> 
> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row 
> key.
> I created a secondary index on CREATE_DT but there's no improvement.
> 
> Query plan looks weird:
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT
>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND 
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
> 
> BUT If CREATE_DT comes first of row key, plan says range scan will be done.
> 
> Any suggestion? : )
> 
> Thanks,
> NaHeon
> 


Re: Are values of a sequence deleted after an incrementation?

2017-02-23 Thread Josh Elser

I believe the sequences track the current value of the sequence.

When your client requests 100 values, it would use 1-100, but Phoenix 
only needs to know that the next value it can give out is 101. I'm not 
100% sure, but I think this is how it works.


What are you concerned about?

Cheyenne Forbes wrote:

I am not sure how sequences work in phoenix but something popped up in
my mind.

If I request 100 values from a sequence will there be 100 values stored
in the database or just 1 value which is "100" telling phoenix the is
the number to be incremented?