Re: Is first query to a table region way slower?

2018-01-29 Thread Mujtaba Chohan
Just to remove one variable, can you repeat the same test after truncating
Phoenix Stats table? (either truncate SYSTEM.STATS from HBase shell or use
sql: delete from SYSTEM.STATS)

On Mon, Jan 29, 2018 at 4:36 PM, Pedro Boado  wrote:

> Yes there is a rs.next().
>
> In fact if I run this SELECT * FROM table LIMIT 1 in a loop for four
> different tables in the same cluster I get relatively consistent response
> times across iterations, but same pattern if I execute the code over and
> over again. So basically first call per table is way slower.
>
> And for some reason call to TABLE4 is way slower than the others ( only
> difference is this table being quite big compared to the others ) .
>
> By hooking a jmeter to the vm I see new threads being created and
> destroyed in both hconnection and phoenix threadpools per loop ( I am not
> pooling connections ) , and quite a lot of network IO in the IPC Network
> thread to one of the RS during the 4 seconds the first query takes (
> basically this thread is doing Net IO during 60-70% of the 4200 msec ) .
>
>
>  Starting healthcheck '1'
>  Checking table TABLE1 state took 874 msec.
>  Checking table TABLE2 state took 471 msec.
>  Checking table TABLE3 state took 844 msec.
>  Checking table TABLE4 state took 4234 msec.
>  Starting healthcheck '2'
>  Checking table TABLE1 state took 103 msec.
>  Checking table TABLE2 state took 98 msec.
>  Checking table TABLE3 state took 78 msec.
>  Checking table TABLE4 state took 148 msec.
>  Starting healthcheck '3'
>  Checking table TABLE1 state took 351 msec.
>  Checking table TABLE2 state took 108 msec.
>  Checking table TABLE3 state took 84 msec.
>  Checking table TABLE4 state took 137 msec.
>  Starting healthcheck '4'
>  Checking table TABLE1 state took 102 msec.
>  Checking table TABLE2 state took 94 msec.
>  Checking table TABLE3 state took 77 msec.
>  Checking table TABLE4 state took 138 msec.
>  Starting healthcheck '5'
>  Checking table TABLE1 state took 103 msec.
>  Checking table TABLE2 state took 93 msec.
>  Checking table TABLE3 state took 77 msec.
>  Checking table TABLE4 state took 142 msec.
> ...
>
>
> Any other idea maybe?
>
>
>
>
>
> On 29 Jan 2018 01:55, "James Taylor"  wrote:
>
>> Did you do an rs.next() on the first query? Sounds related to HConnection
>> establishment. Also, least expensive query is SELECT 1 FROM T LIIMIT 1.
>>
>> Thanks,
>> James
>>
>> On Sun, Jan 28, 2018 at 5:39 PM Pedro Boado 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm running into issues with a java springboot app that ends up querying
>>> a Phoenix cluster (from out of the cluster) through the non-thin client.
>>>
>>> Basically this application has a high latency - around 2 to 4 seconds -
>>> for the first query per  primary key to each region of a table with 180M
>>> records ( and 10 regions ) . Following calls - for different keys within
>>> the same region - have an average response time of ~60-80ms. No secondary
>>> indexes involved. No writes to the table at all during these queries.
>>>
>>> I don't think it's related to HConnection establishing as it was already
>>> stablished before the query ran ( a SELECT * FROM table LIMIT 1 is executed
>>> as soon as the datasource is created )
>>>
>>> I've been doing some quick profiling and almost all the time is spent
>>> inside the actual jdbc call.
>>>
>>> So here's the question: in your experience, is this normal behaviour -
>>> so I have to workaround the problem from application code warming up
>>> connections during app startup -  or is it something unusual? Any
>>> experience reducing first query latencies?
>>>
>>> Thanks!
>>>
>>>


Re: Efficient way to get the row count of a table

2017-12-19 Thread Mujtaba Chohan
Another alternate outside Phoenix is to use
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html
M/R.

On Tue, Dec 19, 2017 at 3:18 PM, James Taylor 
wrote:

> If it needs to be 100% accurate, then count(*) is the only way. If your
> data is write-once data, you might be able to track the row count at the
> application level through some kind of atomic counter in a different table
> (but this will likely be brittle). If you can live with an estimate, you
> could enable statistics [1], optionally configuring Phoenix not to use
> stats for parallelization [2], and query the SYSTEM.STATS table to get an
> estimate [3].
>
> Another interesting alternative if you want the approximate row count when
> you have a where clause would be to use the new table sampling feature [4].
> You'd also want stats enabled for this to be more accurate too.
>
> Thanks,
> James
>
>
> [1] https://phoenix.apache.org/update_statistics.html
> [2] phoenix.use.stats.parallelization=false
> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
> physical_name='my_schema.my_table'
>  and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
> have multiple column families
> [4] https://phoenix.apache.org/tablesample.html
>
> On Tue, Dec 19, 2017 at 2:57 PM, Jins George 
> wrote:
>
>> Hi,
>>
>> Is there a way to get the total row count of a phoenix table without
>> running select count(*) from table ?
>> my use case is to monitor the record count in a table every x minutes, so
>> didn't want to put load on the system by running a select count(*) query.
>>
>> Thanks,
>> Jins George
>>
>
>


Re: Spark & UpgradeInProgressException: Cluster is being concurrently upgraded from 4.11.x to 4.12.x

2017-11-10 Thread Mujtaba Chohan
Probably being hit by https://issues.apache.org/jira/browse/PHOENIX-4335.
Please upgrade to 4.13.0 which will be available by EOD today.

On Fri, Nov 10, 2017 at 8:37 AM, Stepan Migunov <
stepan.migu...@firstlinesoftware.com> wrote:

> Hi,
>
>
>
> I have just upgraded my cluster to Phoenix 4.12 and got an issue with
> tasks running on Spark 2.2 (yarn cluster mode). Any attempts to use method
> phoenixTableAsDataFrame to load data from existing database causes an
> exception (see below).
>
>
>
> The tasks worked fine on version 4.11. I have checked connection with
> sqlline - it works now and shows that version is 4.12. Moreover, I have
> noticed, that if limit the number of executors to one, the Spark's task
> executing successfully too!
>
>
>
> It looks like that executors running in parallel "interferes" each other’s
> and could not acquire version's mutex.
>
>
>
> Any suggestions please?
>
>
>
> *final Connection connection =
> ConnectionUtil.getInputConnection(configuration, overridingProps);*
>
> *User class threw exception: org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent
> failure: Lost task 0.3 in stage 0.0 (TID 36, n7701-hdp005, executor 26):
> java.lang.RuntimeException:
> org.apache.phoenix.exception.UpgradeInProgressException: Cluster is being
> concurrently upgraded from 4.11.x to 4.12.x. Please retry establishing
> connection.*
>
> *at
> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(PhoenixInputFormat.java:201)*
>
> *at
> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(PhoenixInputFormat.java:76)*
>
> *at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)*
>
> *at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:179)*
>
> *at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)*
>
> *at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)*
>
> *at org.apache.phoenix.spark.PhoenixRDD.compute(PhoenixRDD.scala:64)*
>
> *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
>
> *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
>
> *at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)*
>
> *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
>
> *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
>
> *at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)*
>
> *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
>
> *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
>
> *at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)*
>
> *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
>
> *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
>
> *at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)*
>
> *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
>
> *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
>
> *at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)*
>
> *at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)*
>
> *at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)*
>
> *at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)*
>
> *at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)*
>
> *at org.apache.spark.scheduler.Task.run(Task.scala:108)*
>
> *at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)*
>
> *at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
>
> *at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
>
> *at java.lang.Thread.run(Thread.java:745)*
>
> *Caused by: org.apache.phoenix.exception.UpgradeInProgressException:
> Cluster is being concurrently upgraded from 4.11.x to 4.12.x. Please retry
> establishing connection.*
>
> *at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.acquireUpgradeMutex(ConnectionQueryServicesImpl.java:3173)*
>
> *at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.upgradeSystemTables(ConnectionQueryServicesImpl.java:2567)*
>
> *at
> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:2440)*
>
> *at
> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:2360)*
>
> *at
> org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:76)*
>
> *at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2360)*
>
> *at
> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:255)*
>
> *at
> org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:150)*
>
> *at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:221)*
>
> *at java.sql.DriverManager.getConnection(DriverManager.java:664)*
>
> *at 

Re: Short Tables names and column names

2017-05-30 Thread Mujtaba Chohan
Column mapping is applicable to mutable tables as well. It's only
SINGLE_CELL_ARRAY_WITH_OFFSETS that works with immutable data. So keep
short Column Family name (i.e. not the table name) and with Phoenix column
mapping enabled will cover the points you mentioned in original post.

On Tue, May 30, 2017 at 1:36 PM, Ash N <742...@gmail.com> wrote:

> Hi Mujtaba,
>
> Thank you for your immediate response.
>
> I read the link and the blog.  and still cannot conclude the advise for
> MUTABLE tables :(
>
> Invariably most of our tables are MUTABLE, meaning updates *will *occur.
>
> So in that case...
>
> Should we keep our *table *names short?
> and
> Should we keep our *column *names short?
>
>
> Based on the link and the blog - I understand that we can get away with
> long names for IMMUTABLE tables - (Updates do not occur)
>
> please help.
>
> thanks,
> -ash
>
>
> On Tue, May 30, 2017 at 4:16 PM, Mujtaba Chohan <mujt...@apache.org>
> wrote:
>
>> Holds true for Phoenix as well and it provides built-in support for
>> column mapping so you can still use long column names, see
>> http://phoenix.apache.org/columnencoding.html. Also see related
>> performance optimization, SINGLE_CELL_ARRAY_WITH_OFFSETS encoding for
>> immutable data.
>>
>> On Tue, May 30, 2017 at 1:09 PM, Ash N <742...@gmail.com> wrote:
>>
>>> Hello All,
>>>
>>> it is recommended to keep HBase column family and attribute names short.
>>> does this recommendation apply to Apache Phoenix as well?
>>> Keep the table and column names short?
>>>
>>> 6.3.2.1. Column Families
>>>
>>> Try to keep the ColumnFamily names as small as possible, preferably one
>>> character (e.g. "d" for data/default).
>>>
>>> See Section 9.7.5.4, “KeyValue”
>>> <http://hbase.apache.org/0.94/book/regions.arch.html#keyvalue> for more
>>> information on HBase stores data internally to see why this is important.
>>> 6.3.2.2. Attributes
>>>
>>> Although verbose attribute names (e.g., "myVeryImportantAttribute") are
>>> easier to read, prefer shorter attribute names (e.g., "via") to store in
>>> HBase.
>>>
>>> See Section 9.7.5.4, “KeyValue”
>>> <http://hbase.apache.org/0.94/book/regions.arch.html#keyvalue> for more
>>> information on HBase stores data internally to see why this is important.
>>>
>>> thanks for your help.
>>>
>>> -ash
>>>
>>>
>>>
>>>
>>
>


Re: Short Tables names and column names

2017-05-30 Thread Mujtaba Chohan
Holds true for Phoenix as well and it provides built-in support for column
mapping so you can still use long column names, see
http://phoenix.apache.org/columnencoding.html. Also see related performance
optimization, SINGLE_CELL_ARRAY_WITH_OFFSETS encoding for immutable data.

On Tue, May 30, 2017 at 1:09 PM, Ash N <742...@gmail.com> wrote:

> Hello All,
>
> it is recommended to keep HBase column family and attribute names short.
> does this recommendation apply to Apache Phoenix as well?
> Keep the table and column names short?
>
> 6.3.2.1. Column Families
>
> Try to keep the ColumnFamily names as small as possible, preferably one
> character (e.g. "d" for data/default).
>
> See Section 9.7.5.4, “KeyValue”
>  for more
> information on HBase stores data internally to see why this is important.
> 6.3.2.2. Attributes
>
> Although verbose attribute names (e.g., "myVeryImportantAttribute") are
> easier to read, prefer shorter attribute names (e.g., "via") to store in
> HBase.
>
> See Section 9.7.5.4, “KeyValue”
>  for more
> information on HBase stores data internally to see why this is important.
>
> thanks for your help.
>
> -ash
>
>
>
>


Re: Export large query results to CSV

2017-05-15 Thread Mujtaba Chohan
You might be able to use sqlline to export. Use !outputformat csv and
!record commands to export as CSV locally.

On Sun, May 14, 2017 at 8:22 PM, Josh Elser  wrote:

> I am not aware of any mechanisms in Phoenix that will automatically write
> formatted data, locally or remotely. This will require you to write some
> code.
>
>
> cmbendre wrote:
>
>> Hi,
>>
>> Some of our queries on Phoenix cluster gives millions of rows as a result.
>> How do i export these results to a csv file or s3 location ?
>>
>> The query output size is approximately in range of GBs
>>
>> Thanks,
>> Chaitanya
>>
>>
>>
>> --
>> View this message in context: http://apache-phoenix-user-lis
>> t.1124778.n5.nabble.com/Export-large-query-results-to-CSV-tp3530.html
>> Sent from the Apache Phoenix User List mailing list archive at Nabble.com.
>>
>


Re: Using Apache perf with Hbase 1.1

2016-10-18 Thread Mujtaba Chohan
>
> Cannot get all table regions
>

Check that there are no offline regions. See related thread here

.

On Tue, Oct 18, 2016 at 2:11 PM, Pradheep Shanmugam <
pradheep.shanmu...@infor.com> wrote:

> Hi,
>
> I am trying to connect pherf to hbase cluster running hbase 1.1 using
> below. Could you please help me connect pherf to hbase cluster running v 1.1
>
> java -Xms512m -Xmx3072m  -cp "/home/ambari/pherf/phoenix/
> bin/../phoenix-pherf/config:/etc/hbase/conf:/home/ambari/
> pherf/phoenix/bin/../phoenix-client/target/phoenix-server-
> client-4.7.0-HBase-1.1.jar:/home/ambari/pherf/phoenix/bin/
> ../phoenix-pherf/target/phoenix-pherf-4.8.1-HBase-1.1.jar"
> -Dlog4j.configuration=file:/home/ambari/pherf/phoenix/bin/log4j.properties
> org.apache.phoenix.pherf.Pherf -drop all -l -q -z hbase-perf-rs1
> -schemaFile '.*user_defined_schema.sql' -scenarioFile
> '.*user_defined_scenario.xml’
> And I get below exception
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/commons/cli/ParseException
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2615)
> at java.lang.Class.getMethod0(Class.java:2856)
> at java.lang.Class.getMethod(Class.java:1668)
> at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
> at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
> Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli.
> ParseException
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 6 more
>
> When I tried to connect using below,
> java -Xms512m -Xmx3072m  -cp "/home/ambari/pherf/phoenix/
> bin/../phoenix-pherf/config:/etc/hbase/conf:/home/ambari/
> pherf/phoenix/bin/../phoenix-client/target/phoenix-4.9.0-
> HBase-1.2-SNAPSHOT-client.jar:/home/ambari/pherf/phoenix/
> bin/../phoenix-pherf/target/phoenix-pherf-4.9.0-HBase-1.2-SNAPSHOT-minimal.jar"
> -Dlog4j.configuration=file:/home/ambari/pherf/phoenix/bin/log4j.properties
> org.apache.phoenix.pherf.Pherf -drop all -l -q -z hbase-perf-rs1
> -schemaFile '.*user_defined_schema.sql' -scenarioFile
> '.*user_defined_scenario.xml’
>
> I got below error. I thought it could be because of connecting to hbase
> 1.1 with 1.2 client?
>
> java.sql.SQLException: ERROR 1102 (XCL02): Cannot get all table regions.
> at org.apache.phoenix.exception.SQLExceptionCode$Factory$1.
> newException(SQLExceptionCode.java:457)
> at org.apache.phoenix.exception.SQLExceptionInfo.buildException(
> SQLExceptionInfo.java:145)
> at org.apache.phoenix.query.ConnectionQueryServicesImpl.
> getAllTableRegions(ConnectionQueryServicesImpl.java:549)
> at org.apache.phoenix.iterate.BaseResultIterators.getParallelScans(
> BaseResultIterators.java:542)
> at org.apache.phoenix.iterate.BaseResultIterators.getParallelScans(
> BaseResultIterators.java:477)
> at org.apache.phoenix.iterate.BaseResultIterators.(
> BaseResultIterators.java:370)
> at org.apache.phoenix.iterate.ParallelIterators.(
> ParallelIterators.java:60)
> at org.apache.phoenix.execute.ScanPlan.newIterator(ScanPlan.java:218)
> at org.apache.phoenix.execute.BaseQueryPlan.iterator(
> BaseQueryPlan.java:341)
> at org.apache.phoenix.execute.BaseQueryPlan.iterator(
> BaseQueryPlan.java:206)
> at org.apache.phoenix.jdbc.PhoenixStatement$1.call(
> PhoenixStatement.java:290)
> at org.apache.phoenix.jdbc.PhoenixStatement$1.call(
> PhoenixStatement.java:270)
> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
> at org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(
> PhoenixStatement.java:269)
> at org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(
> PhoenixStatement.java:1476)
> at org.apache.phoenix.jdbc.PhoenixDatabaseMetaData.getTables(
> PhoenixDatabaseMetaData.java:1149)
> at org.apache.phoenix.pherf.util.PhoenixUtil.getTableMetaData(
> PhoenixUtil.java:220)
> at org.apache.phoenix.pherf.util.PhoenixUtil.deleteTables(
> PhoenixUtil.java:192)
> at org.apache.phoenix.pherf.Pherf.run(Pherf.java:234)
> at org.apache.phoenix.pherf.Pherf.main(Pherf.java:188)
> Caused by: java.io.IOException: hconnection-0xa59a583 closed
> at org.apache.hadoop.hbase.client.ConnectionManager$
> HConnectionImplementation.getKeepAliveZooKeeperWatcher(
> ConnectionManager.java:1685)
> at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(
> ZooKeeperRegistry.java:122)
> at org.apache.hadoop.hbase.client.ConnectionManager$
> HConnectionImplementation.isTableDisabled(ConnectionManager.java:979)
> at 

Re: Question regarding designing row keys

2016-10-04 Thread Mujtaba Chohan
If you lead with timestamp key, you might want to consider experimenting
with salting  as writing data would
hotspot on single region if keys are monotonically increasing.

On Tue, Oct 4, 2016 at 8:04 AM, Ciureanu Constantin <
ciureanu.constan...@gmail.com> wrote:

> select * from metric_table where metric_type='x'
> -- so far so good
>
> and timestamp > 'start_date' and timestamp < 'end_date'.
> -- here in case the timestamp is long (BIGINT in Phoenix) - it should work
> fine!
> Try also with "timestamp between (x and y)"
>
> Anyway - my proposal would be to reverse the key parts - have timestamp
> first, then metric type, then other parts in the key.
>
> Using the timestamp it would define the start+stop of the scan range
> (that's a must, step 1) - then it would filter locally the metric type with
> Skips when it's not the searched value then some other parts of the key
> with lower importance (if any of them are part of the where clause).
>
>  Note: This new key proposal would solve your current use-case / but
> wouldn't be perfect for potential new use-case - then you would need
> indexes or duplicated data in other tables ...
>
> 2016-10-04 6:03 GMT+02:00 Krishna :
>
>> You have two options:
>> - Modify your primary key to include metric_type & timestamp as leading
>> columns.
>> - Create an index on metric_type & timestamp
>>
>> On Monday, October 3, 2016, Kanagha  wrote:
>>
>>> Sorry for the confusion.
>>>
>>> metric_type,
>>> timestamp,
>>> metricId  is defined as the primary key via Phoenix for metric_table.
>>>
>>> Thanks
>>>
>>> Kanagha
>>>
>>> On Mon, Oct 3, 2016 at 3:41 PM, Michael McAllister <
>>> mmcallis...@homeaway.com> wrote:
>>>
 >

 there is no indexing available on this table yet.

 >



 So you haven’t defined a primary key constraint? Can you share your
 table creation DDL?



 Michael McAllister

 Staff Data Warehouse Engineer | Decision Systems

 mmcallis...@homeaway.com | C: 512.423.7447 | skype:
 michael.mcallister.ha | webex: https://h.a/mikewebex

 This electronic communication (including any attachment) is
 confidential.  If you are not an intended recipient of this communication,
 please be advised that any disclosure, dissemination, distribution, copying
 or other use of this communication or any attachment is strictly
 prohibited.  If you have received this communication in error, please
 notify the sender immediately by reply e-mail and promptly destroy all
 electronic and printed copies of this communication and any attachment.



 *From: *Kanagha 
 *Reply-To: *"user@phoenix.apache.org" 
 *Date: *Monday, October 3, 2016 at 5:32 PM
 *To: *"u...@hbase.apache.org" , "
 user@phoenix.apache.org" 
 *Subject: *Re: Question regarding designing row keys



 there is no indexing available on this table yet.

>>>
>>>
>


Re: Phoenix has slow response times compared to HBase

2016-08-31 Thread Mujtaba Chohan
Something seems inherently wrong in these test results.

* How are you running Phoenix queries? Were the concurrent Phoenix queries
using the same JVM? Was the JVM restarted after changing number of
concurrent users?
* Is the response time plotted when query is executed for the first time or
second or average of both?
* Is the UUID filtered on randomly distributed? Does UUID match a single
row?
* It seems that even non-concurrent Phoenix query which filters on UUID
takes 500ms in your environment. Can you try the same query in Sqlline a
few times and see how much time it takes for each run?
* What is the explain <https://phoenix.apache.org/language/#explain> plan
for your Phoenix query?
* If it's slow in Sqlline as well then try truncating your SYSTEM.STATS
table and reconnect Sqlline and execute the query again
* Can you share your table schema and how you ran Phoenix queries and your
HBase equivalent code?
* Any phoenix tuning defaults that you changed?

Thanks,
Mujtaba

(previous response wasn't complete before I hit send)


On Wed, Aug 31, 2016 at 10:40 AM, Mujtaba Chohan <mujt...@apache.org> wrote:

> Something seems inherently wrong in these test results.
>
> * How are you running Phoenix queries? Were the concurrent Phoenix queries
> using the same JVM? Was the JVM restarted after changing number of
> concurrent users?
> * Is the response time plotted when query is executed for the first time
> or second or average of both?
> * Is the UUID filtered on randomly distributed? Does UUID match a single
> row?
> * It seems that even non-concurrent Phoenix query which filters on UUID
> takes 500ms in your environment. Can you try the same query in Sqlline a
> few times and see how much time it takes for each run?
> * If it's slow in Sqlline as well then try truncating your SYSTEM.STATS
> * Can you share your table schema and how you ran Phoenix queries and your
> HBase equivalent code?
>
>
>
>
> On Wed, Aug 31, 2016 at 5:42 AM, Narros, Eduardo (ELS-LON) <
> e.nar...@elsevier.com> wrote:
>
>> Hi,
>>
>>
>> We are exploring starting to use Phoenix and have done some load tests to
>> see whether Phoenix would scale. We have noted that compared to HBase,
>> Phoenix response times have a much slower average as the number of
>> concurrent users increases. We are trying to understand whether this is
>> expected or there is something we are missing out.
>>
>>
>> This is the test we have performed:
>>
>>
>>- Create table (20 columns) and load it with 400 million records
>>indexed via a column called 'uuid'.
>>- Perform the following queries using 10,20,100,200,400 and 600 users
>>per second, each user will perform each query twice:
>>   - Phoenix: select * from schema.DOCUMENTS where uuid = ?
>>   - Phoenix: select /*+ SERIAL SMALL */* from schema.DOCUMENTS where
>>   uuid = ?
>>   - Hbase equivalent to: select * from schema.DOCUMENTS where uuid =
>>   ?
>>- The results are attached and they show that Phoenix response times
>>are at least an order of magnitude above those of HBase
>>
>> The tests were run from the Master node of a CDH5.7.2 cluster with
>> Phoenix 4.7.0.
>>
>> Are these test results expected?
>>
>> Kind Regards,
>>
>> Edu
>>
>>
>> --
>>
>> Elsevier Limited. Registered Office: The Boulevard, Langford Lane,
>> Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084,
>> Registered in England and Wales.
>>
>
>


Re: Phoenix has slow response times compared to HBase

2016-08-31 Thread Mujtaba Chohan
Something seems inherently wrong in these test results.

* How are you running Phoenix queries? Were the concurrent Phoenix queries
using the same JVM? Was the JVM restarted after changing number of
concurrent users?
* Is the response time plotted when query is executed for the first time or
second or average of both?
* Is the UUID filtered on randomly distributed? Does UUID match a single
row?
* It seems that even non-concurrent Phoenix query which filters on UUID
takes 500ms in your environment. Can you try the same query in Sqlline a
few times and see how much time it takes for each run?
* If it's slow in Sqlline as well then try truncating your SYSTEM.STATS
* Can you share your table schema and how you ran Phoenix queries and your
HBase equivalent code?




On Wed, Aug 31, 2016 at 5:42 AM, Narros, Eduardo (ELS-LON) <
e.nar...@elsevier.com> wrote:

> Hi,
>
>
> We are exploring starting to use Phoenix and have done some load tests to
> see whether Phoenix would scale. We have noted that compared to HBase,
> Phoenix response times have a much slower average as the number of
> concurrent users increases. We are trying to understand whether this is
> expected or there is something we are missing out.
>
>
> This is the test we have performed:
>
>
>- Create table (20 columns) and load it with 400 million records
>indexed via a column called 'uuid'.
>- Perform the following queries using 10,20,100,200,400 and 600 users
>per second, each user will perform each query twice:
>   - Phoenix: select * from schema.DOCUMENTS where uuid = ?
>   - Phoenix: select /*+ SERIAL SMALL */* from schema.DOCUMENTS where
>   uuid = ?
>   - Hbase equivalent to: select * from schema.DOCUMENTS where uuid = ?
>- The results are attached and they show that Phoenix response times
>are at least an order of magnitude above those of HBase
>
> The tests were run from the Master node of a CDH5.7.2 cluster with Phoenix
> 4.7.0.
>
> Are these test results expected?
>
> Kind Regards,
>
> Edu
>
>
> --
>
> Elsevier Limited. Registered Office: The Boulevard, Langford Lane,
> Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084,
> Registered in England and Wales.
>


Re:

2016-07-28 Thread Mujtaba Chohan
To use pherf-cluster.py script make sure $HBASE_DIR/bin/hbase file is
available which is used to construct classpath. Also add the following line
to script before java_cmd is executed to make sure *hbasecp* variable
contains phoenix jar: print "Classpath used to launch pherf: " + hbasecp

Also try running pherf-standalone.py which does not need any variable to be
set and uses fat phoenix-client.jar with all dependencies bundled.

- mujtaba

On Thu, Jul 28, 2016 at 10:14 AM, Nathan Davis 
wrote:

> Hi All,
> I'm trying to run pherf-cluster.py against an EMR cluster (on the master
> server). The command I'm using is `HBASE_DIR=/usr/lib/hbase
> ./pherf-cluster.py -drop all -l -q -z localhost -schemaFile
> ./config/datamodel/user_defined_schema.sql -scenarioFile
> ./config/scenario/user_defined_scenario.xml`. I get error 
> "java.lang.NoClassDefFoundError:
> org/apache/phoenix/schema/TableNotFoundException". Below is part of my
> terminal session that shows the applicable directories and the failed pherf
> command.
>
> [ec2-user@ip-10-2-* bin]$ pwd
>> /usr/lib/phoenix/bin
>>
>
>
>> [ec2-user@ip-10-2-* bin]$ ls -l /usr/lib/phoenix/
>> total 221104
>> drwxr-xr-x 3 root root 4096 Jul 28 16:54 bin
>> -rw-r--r-- 1 root root 98170649 Jul  8 03:18
>> phoenix-4.7.0-HBase-1.2-client.jar
>> -rw-r--r-- 1 root root  4898513 Jul  8 03:18
>> phoenix-4.7.0-HBase-1.2-client-minimal.jar
>> -rw-r--r-- 1 root root 46138953 Jul  8 03:18
>> phoenix-4.7.0-HBase-1.2-client-spark.jar
>> -rw-r--r-- 1 root root 31312803 Jul  8 03:18
>> phoenix-4.7.0-HBase-1.2-client-without-hbase.jar
>> -rw-r--r-- 1 root root 25644258 Jul  8 03:18
>> phoenix-4.7.0-HBase-1.2-server.jar
>> -rw-r--r-- 1 root root 6044 Jul  8 03:18
>> phoenix-4.7.0-HBase-1.2-tests.jar
>> -rw-r--r-- 1 root root  4152940 Jul  8 03:18
>> phoenix-4.7.0-HBase-1.2-thin-client.jar
>> -rw-r--r-- 1 root root 2884 Jul  8 03:18
>> phoenix-assembly-4.7.0-HBase-1.2-tests.jar
>> lrwxrwxrwx 1 root root   34 Jul 22 17:32 phoenix-client.jar ->
>> phoenix-4.7.0-HBase-1.2-client.jar
>> -rw-r--r-- 1 root root  3631295 Jul  8 03:18
>> phoenix-core-4.7.0-HBase-1.2.jar
>> -rw-r--r-- 1 root root  1674792 Jul  8 03:18
>> phoenix-core-4.7.0-HBase-1.2-tests.jar
>> -rw-r--r-- 1 root root35501 Jul  8 03:18
>> phoenix-flume-4.7.0-HBase-1.2.jar
>> -rw-r--r-- 1 root root23736 Jul  8 03:18
>> phoenix-flume-4.7.0-HBase-1.2-tests.jar
>> -rw-r--r-- 1 root root   159771 Jul  8 03:18
>> phoenix-pherf-4.7.0-HBase-1.2.jar
>> -rw-r--r-- 1 root root  4479303 Jul  8 03:18
>> phoenix-pherf-4.7.0-HBase-1.2-minimal.jar
>> -rw-r--r-- 1 root root58160 Jul  8 03:18
>> phoenix-pherf-4.7.0-HBase-1.2-tests.jar
>> -rw-r--r-- 1 root root42216 Jul  8 03:18
>> phoenix-pig-4.7.0-HBase-1.2.jar
>> -rw-r--r-- 1 root root43578 Jul  8 03:18
>> phoenix-pig-4.7.0-HBase-1.2-tests.jar
>> -rw-r--r-- 1 root root18810 Jul  8 03:18
>> phoenix-server-4.7.0-HBase-1.2.jar
>> -rw-r--r-- 1 root root  3357692 Jul  8 03:18
>> phoenix-server-4.7.0-HBase-1.2-runnable.jar
>> -rw-r--r-- 1 root root20170 Jul  8 03:18
>> phoenix-server-4.7.0-HBase-1.2-tests.jar
>> -rw-r--r-- 1 root root10451 Jul  8 03:18
>> phoenix-server-client-4.7.0-HBase-1.2.jar
>> -rw-r--r-- 1 root root 7139 Jul  8 03:18
>> phoenix-server-client-4.7.0-HBase-1.2-tests.jar
>> lrwxrwxrwx 1 root root   34 Jul 22 17:32 phoenix-server.jar ->
>> phoenix-4.7.0-HBase-1.2-server.jar
>> -rw-r--r-- 1 root root77327 Jul  8 03:18
>> phoenix-spark-4.7.0-HBase-1.2.jar
>> -rw-r--r-- 1 root root91730 Jul  8 03:18
>> phoenix-spark-4.7.0-HBase-1.2-tests.jar
>> lrwxrwxrwx 1 root root   39 Jul 22 17:32 phoenix-thin-client.jar ->
>> phoenix-4.7.0-HBase-1.2-thin-client.jar
>> -rw-r--r-- 1 root root16329 Jul  8 03:18
>> phoenix-tracing-webapp-4.7.0-HBase-1.2.jar
>> -rw-r--r-- 1 root root  2284964 Jul  8 03:18
>> phoenix-tracing-webapp-4.7.0-HBase-1.2-runnable.jar
>> -rw-r--r-- 1 root root 8065 Jul  8 03:18
>> phoenix-tracing-webapp-4.7.0-HBase-1.2-tests.jar
>>
>
>
>> [ec2-user@ip-10-2-1-118 bin]$ ls -l /usr/lib/hbase/
>> total 152732
>> drwxr-xr-x 4 root root 4096 Jul 22 17:32 bin
>> lrwxrwxrwx 1 root root   15 Jul 22 17:32 conf -> /etc/hbase/conf
>> -rw-r--r-- 1 root root20861 Jul  8 02:17 hbase-annotations-1.2.1.jar
>> -rw-r--r-- 1 root root14224 Jul  8 02:17
>> hbase-annotations-1.2.1-tests.jar
>> lrwxrwxrwx 1 root root   27 Jul 22 17:32 hbase-annotations.jar ->
>> hbase-annotations-1.2.1.jar
>> -rw-r--r-- 1 root root  1297581 Jul  8 02:17 hbase-client-1.2.1.jar
>> lrwxrwxrwx 1 root root   22 Jul 22 17:32 hbase-client.jar ->
>> hbase-client-1.2.1.jar
>> -rw-r--r-- 1 root root   576307 Jul  8 02:17 hbase-common-1.2.1.jar
>> -rw-r--r-- 1 root root   228279 Jul  8 02:17 hbase-common-1.2.1-tests.jar
>> lrwxrwxrwx 1 root root   22 Jul 22 17:32 hbase-common.jar ->
>> hbase-common-1.2.1.jar
>> -rw-r--r-- 1 root root   131596 Jul  8 02:17 hbase-examples-1.2.1.jar
>> lrwxrwxrwx 1 

Re: How to tell when an insertion has "finished"

2016-07-28 Thread Mujtaba Chohan
Oh sorry I thought OP was referring to HDFS level replication.

On Thu, Jul 28, 2016 at 3:48 PM, James Taylor <jamestay...@apache.org>
wrote:

> I believe you can also measure the depth of the replication queue to know
> what's pending. HBase replication is asynchronous, so you're right that
> Phoenix would return while replication may still be occurring.
>
> On Thu, Jul 28, 2016 at 12:06 PM, Mujtaba Chohan <mujt...@apache.org>
> wrote:
>
>> Query running first time would be slower since data is not in HBase cache
>> rather than things being not settled. Replication shouldn't be putting load
>> on cluster which you can check by turning replication off. On HBase side to
>> force things to be optimal before running perf queries is to do a major
>> compaction and wait for compaction to complete.
>>
>> - mujtaba
>>
>> On Thu, Jul 28, 2016 at 8:09 AM, Heather, James (ELS) <
>> james.heat...@elsevier.com> wrote:
>>
>>> If you upsert lots of rows into a table, presumably Phoenix will return
>>> as soon as HBase has received the data, but before the data has been
>>> replicated?
>>>
>>>
>>> Is there a way to tell when everything has "settled", i.e., when
>>> everything has finished replicating or whatever it needs to do?
>>>
>>>
>>> The reason I ask is that this might affect our benchmarking. If we add
>>> lots of rows, and then run some sample queries straight away, they might
>>> return more slowly initially, if the replication is still taking place.
>>>
>>>
>>> (Does this make sense? I'm not completely clear on how HBase replication
>>> works anyway.)
>>>
>>>
>>> James
>>>
>>> --
>>>
>>> Elsevier Limited. Registered Office: The Boulevard, Langford Lane,
>>> Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084,
>>> Registered in England and Wales.
>>>
>>
>>
>


Re: How to tell when an insertion has "finished"

2016-07-28 Thread Mujtaba Chohan
Query running first time would be slower since data is not in HBase cache
rather than things being not settled. Replication shouldn't be putting load
on cluster which you can check by turning replication off. On HBase side to
force things to be optimal before running perf queries is to do a major
compaction and wait for compaction to complete.

- mujtaba

On Thu, Jul 28, 2016 at 8:09 AM, Heather, James (ELS) <
james.heat...@elsevier.com> wrote:

> If you upsert lots of rows into a table, presumably Phoenix will return as
> soon as HBase has received the data, but before the data has been
> replicated?
>
>
> Is there a way to tell when everything has "settled", i.e., when
> everything has finished replicating or whatever it needs to do?
>
>
> The reason I ask is that this might affect our benchmarking. If we add
> lots of rows, and then run some sample queries straight away, they might
> return more slowly initially, if the replication is still taking place.
>
>
> (Does this make sense? I'm not completely clear on how HBase replication
> works anyway.)
>
>
> James
>
> --
>
> Elsevier Limited. Registered Office: The Boulevard, Langford Lane,
> Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084,
> Registered in England and Wales.
>


Re: Local Phoenix installation for testing

2016-07-21 Thread Mujtaba Chohan
Would be simpler and reliable if you used mini-cluster with Phoenix for
unit tests. In your project which should already have phoenix-core as a
dependency just extend your test class from
org.apache.phoenix.end2end.BaseHBaseManagedTimeIT which will take care of
mini cluster setup with Phoenix. Example
https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=blob;f=phoenix-pherf/src/it/java/org/apache/phoenix/pherf/SchemaReaderIT.java;h=4ff1fb506ad9d362f493ee394dbe161d4e47e501;hb=refs/heads/4.x-HBase-0.98

On Thu, Jul 21, 2016 at 2:49 PM, James Taylor 
wrote:

> Ah, I see. You can run the unit tests against a real cluster by
> setting hbase.test.cluster.distributed to true as an environment variable
> or in your hbase-site.xml. See BaseTest.isDistributedClusterModeEnabled().
> I believe the connection information is gotten from the hbase-site.xml in
> this case.
>
> HTH,
> James
>
> On Thu, Jul 21, 2016 at 2:43 PM, Simon Wang  wrote:
>
>> I should have been more clear about the use case. I apologize.
>>
>> So we have a java service that queries Phoenix through jdbc. For security
>> reasons, directly connecting to HBase cluster from local isn’t allowed.
>> Uploading and rebuilding on remote machine every time for testing isn’t the
>> most efficient dev process. So we are wondering if we can do end2end tests
>> locally by setting up Phoenix & HBase on local machine.
>>
>> For example, a way to make Phoenix work with standalone/minicluster mode
>> HBase would be nice.
>>
>> Best,
>> Simon
>>
>> On Jul 21, 2016, at 2:37 PM, James Taylor  wrote:
>>
>> Hi Simon,
>> Do you mean to run the unit tests? There's no setup required. You can
>> directly run the unit tests through maven or Eclipse.
>> Thanks,
>> James
>>
>> On Thu, Jul 21, 2016 at 2:34 PM, Simon Wang 
>> wrote:
>>
>>> Hi all,
>>>
>>> Does anyone have previous experience of setting up Phoenix locally for
>>> testing purposes? I looked into HBase mini cluster but I can’t figure out
>>> how Phoenix should work with it.
>>>
>>> Thanks in advance!
>>>
>>> Best,
>>> Simon
>>
>>
>>
>>
>


Re: phoenix.query.maxServerCacheBytes not used

2016-07-19 Thread Mujtaba Chohan
phoenix.query.maxServerCacheBytes is a client side parameter. If you are
using bin/sqlline.py then set this property in bin/hbase-site.xml and
restart sqlline.

- mujtaba

On Tue, Jul 19, 2016 at 1:59 PM, Nathan Davis 
wrote:

> Hi,
> I am running a standalone HBase locally with Phoenex installed by dropping
> the jars into HBase lib directory. I have added the following to my
> hbase-site.xml and restarted HBase:
>
>   
>> phoenix.query.maxServerCacheBytes
>> 419430400
>>   
>>   
>> phoenix.query.maxGlobalMemoryPercentage
>> 25
>>   
>
>
> However, I am still getting the following error when doing a regular inner
> join to an 5mill-sized RHS table (Notice that the error says "...maximum
> allowed size (104857600 bytes)" even though I have changed that setting to
> 400MB):
>
> java.sql.SQLException: Encountered exception in sub plan [0] execution.
>> at org.apache.phoenix.execute.HashJoinPlan.iterator(HashJoinPlan.java:193)
>> at org.apache.phoenix.execute.HashJoinPlan.iterator(HashJoinPlan.java:138)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:276)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:261)
>> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:260)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:248)
>> at
>> org.apache.phoenix.jdbc.PhoenixPreparedStatement.execute(PhoenixPreparedStatement.java:172)
>> at
>> org.apache.phoenix.jdbc.PhoenixPreparedStatement.execute(PhoenixPreparedStatement.java:177)
>> at
>> org.apache.phoenix.jdbc.PhoenixConnection.executeStatements(PhoenixConnection.java:354)
>> at
>> org.apache.phoenix.util.PhoenixRuntime.executeStatements(PhoenixRuntime.java:298)
>> at org.apache.phoenix.util.PhoenixRuntime.main(PhoenixRuntime.java:243)
>> Caused by: org.apache.phoenix.join.MaxServerCacheSizeExceededException:
>> Size of hash cache (104857638 bytes) exceeds the maximum allowed size
>> (104857600 bytes)
>> at
>> org.apache.phoenix.join.HashCacheClient.serialize(HashCacheClient.java:110)
>> at
>> org.apache.phoenix.join.HashCacheClient.addHashCache(HashCacheClient.java:83)
>> at
>> org.apache.phoenix.execute.HashJoinPlan$HashSubPlan.execute(HashJoinPlan.java:381)
>> at org.apache.phoenix.execute.HashJoinPlan$1.call(HashJoinPlan.java:162)
>> at org.apache.phoenix.execute.HashJoinPlan$1.call(HashJoinPlan.java:158)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask.run(JobManager.java:183)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>
>
> It seems like my `maxServerCacheBytes` setting is not getting picked up,
> but not sure why. I'm pretty newb to Phoenix so I'm sure it's something
> simple...
>
> Thanks up front for the help!
>
> -Nathan Davis
>


Re: Index tables at scale

2016-07-11 Thread Mujtaba Chohan
FYI if you keys are not written in order i.e. you are not concerned about
write hot-spotting/write throughput then try writing your data to an
un-salted table. Read performance for un-salted table can be comparable or
better to salted one with stats
<https://phoenix.apache.org/update_statistics.html>.

On Mon, Jul 11, 2016 at 2:31 PM, Simon Wang <simon.w...@airbnb.com> wrote:

> This indexes will be salted indeed. (so is the data table). If all indexes
> reside in the same table, there will be only 512 regions in total (256 for
> data table, 256 for the combined index table). Indeed the combined index
> table will be 12x large as a single index table. But it doesn’t cover all
> columns so it should be fine.
>
> On Jul 11, 2016, at 2:26 PM, James Taylor <jamestay...@apache.org> wrote:
>
> Will the index be salted (and that's why it's 256 regions per table)? If
> not, how many regions would there be if all indexes are in the same table
> (assuming the table is 12x bigger than one index table)?
>
> On Monday, July 11, 2016, Simon Wang <simon.w...@airbnb.com> wrote:
>
>> Thanks, Mujtaba. What you wrote is exactly what I meant. While not all
>> our tables needs these many regions and indexes, the num of regions/region
>> server can grow quickly.
>>
>> -Simon
>>
>> On Jul 11, 2016, at 2:17 PM, Mujtaba Chohan <mujt...@apache.org> wrote:
>>
>> 12 index tables * 256 region per table = ~3K regions for index tables
>> assuming we are talking of covered index which implies 200+ regions/region
>> server on a 15 node cluster.
>>
>> On Mon, Jul 11, 2016 at 1:58 PM, James Taylor <jamestay...@apache.org>
>> wrote:
>>
>>> Hi Simon,
>>>
>>> I might be missing something, but with 12 separate index tables or 1
>>> index table, the amount of data will be the same. Won't there be the same
>>> number of regions either way?
>>>
>>> Thanks,
>>> James
>>>
>>> On Sun, Jul 10, 2016 at 10:50 PM, Simon Wang <simon.w...@airbnb.com>
>>> wrote:
>>>
>>>> Hi James,
>>>>
>>>> Thanks for the response.
>>>>
>>>> In our use case, there is a 256 region table, and we want to build ~12
>>>> indexes on it. We have 15 region servers. If each index is in its own
>>>> table, that would be a total of 221 regions per region server of this
>>>> single table. I think the extra write time cost is okay. But the number of
>>>> regions is too high for us.
>>>>
>>>> Best,
>>>> Simon
>>>>
>>>>
>>>> On Jul 9, 2016, at 1:18 AM, James Taylor <jamestay...@apache.org>
>>>> wrote:
>>>>
>>>> Hi Simon,
>>>> The reason we've taken this approach with views is that it's possible
>>>> with multi-tenancy that the number of views would grow unbounded since you
>>>> might end up with a view per tenant (100K or 1M views or more - clearly too
>>>> many for HBase to handle as separate tables).
>>>>
>>>> With secondary indexes directly on physical tables, you're somewhat
>>>> bounded by the hit you're willing to take on the write side, as the cost of
>>>> maintaining the index is similar to the cost of the write to the data
>>>> table. So the extra number of physical tables for indexes seems within the
>>>> bounds of what HBase could handle.
>>>>
>>>> How many secondary indexes are you creating and are you ok with the
>>>> extra write-time cost?
>>>>
>>>> From a code consistency standpoint, using the same approach across
>>>> local, global, and view indexes might simplify things, though. Please file
>>>> a JIRA with a bit more detail on your use case.
>>>>
>>>> Thanks,
>>>> James
>>>>
>>>>
>>>>
>>>> On Fri, Jul 8, 2016 at 8:59 PM, Simon Wang <simon.w...@airbnb.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am writing to ask if there is a way to let Phoenix store all indexes
>>>>> on a single table in the same HBase table. If each index must be stored in
>>>>> a separate table, creating more than a few indexes on table with a large
>>>>> number of regions will not scale well.
>>>>>
>>>>> From what I have learned, when Phoenix builds indexes on a view, it
>>>>> stores all indexes in a table associated with the underlying table of the
>>>>> view. e.g. if V1 is a view of T1, all indexes on V1 will be stored in
>>>>> _IDX_T1. It would be great if this behavior can be optionally turned on 
>>>>> for
>>>>> indexes on tables.
>>>>>
>>>>> Best,
>>>>> Simon
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>


Re: Index tables at scale

2016-07-11 Thread Mujtaba Chohan
12 index tables * 256 region per table = ~3K regions for index tables
assuming we are talking of covered index which implies 200+ regions/region
server on a 15 node cluster.

On Mon, Jul 11, 2016 at 1:58 PM, James Taylor 
wrote:

> Hi Simon,
>
> I might be missing something, but with 12 separate index tables or 1 index
> table, the amount of data will be the same. Won't there be the same number
> of regions either way?
>
> Thanks,
> James
>
> On Sun, Jul 10, 2016 at 10:50 PM, Simon Wang 
> wrote:
>
>> Hi James,
>>
>> Thanks for the response.
>>
>> In our use case, there is a 256 region table, and we want to build ~12
>> indexes on it. We have 15 region servers. If each index is in its own
>> table, that would be a total of 221 regions per region server of this
>> single table. I think the extra write time cost is okay. But the number of
>> regions is too high for us.
>>
>> Best,
>> Simon
>>
>>
>> On Jul 9, 2016, at 1:18 AM, James Taylor  wrote:
>>
>> Hi Simon,
>> The reason we've taken this approach with views is that it's possible
>> with multi-tenancy that the number of views would grow unbounded since you
>> might end up with a view per tenant (100K or 1M views or more - clearly too
>> many for HBase to handle as separate tables).
>>
>> With secondary indexes directly on physical tables, you're somewhat
>> bounded by the hit you're willing to take on the write side, as the cost of
>> maintaining the index is similar to the cost of the write to the data
>> table. So the extra number of physical tables for indexes seems within the
>> bounds of what HBase could handle.
>>
>> How many secondary indexes are you creating and are you ok with the extra
>> write-time cost?
>>
>> From a code consistency standpoint, using the same approach across local,
>> global, and view indexes might simplify things, though. Please file a JIRA
>> with a bit more detail on your use case.
>>
>> Thanks,
>> James
>>
>>
>>
>> On Fri, Jul 8, 2016 at 8:59 PM, Simon Wang  wrote:
>>
>>> Hi all,
>>>
>>> I am writing to ask if there is a way to let Phoenix store all indexes
>>> on a single table in the same HBase table. If each index must be stored in
>>> a separate table, creating more than a few indexes on table with a large
>>> number of regions will not scale well.
>>>
>>> From what I have learned, when Phoenix builds indexes on a view, it
>>> stores all indexes in a table associated with the underlying table of the
>>> view. e.g. if V1 is a view of T1, all indexes on V1 will be stored in
>>> _IDX_T1. It would be great if this behavior can be optionally turned on for
>>> indexes on tables.
>>>
>>> Best,
>>> Simon
>>
>>
>>
>>
>


Re: Phoenix performance at scale

2016-07-08 Thread Mujtaba Chohan
>
> How do response times vary as the number of rows in a table increases?
> How do response times vary as the number of HBase nodes increases?
>

It's linear but there are many factors how that linear line/curve looks
like as it depends on the type of query you are executing and how data gets
spread over region servers.

For example if you are running aggregate query over entire table, as the
data size grows and it's get split from a single region lets say on a 10
node cluster to 20 regions with 2 regions/region server. Phoenix would be
able to better utilize resources on each region server in parallel with
more data that is spread across cluster compared data ending up on single
region that corresponds to fewer rows. Stats
 also come into play to
effectively utilize 100% resources available even when there are few
region(s) per region server.

However there is a limit to how much you can gain by parallelism due to
limits of disk I/O, CPU etc therefore overall trend that you would see
would be linear when data grows to billion of rows. In the following graph
the dotted line will move to the right as you add more nodes.

[image: Inline image 2]


> How do response times vary as the number of secondary indexes on a table
> increases


if all columns are covered then write time would slow down by approx 100%
for each index. Read would depend on how effectively you can use index to
reduce number of rows scanned.

- mujtaba

On Fri, Jul 8, 2016 at 8:02 AM, Heather, James (ELS) <
james.heat...@elsevier.com> wrote:

> Are there any stats/guidelines/figures available for how well Phoenix
> performs as size increases? I'm interested particularly in three things:
>
>
>1. How do response times vary as the number of rows in a table
>increases?
>2. How do response times vary as the number of secondary indexes on a
>table increases?
>3. How do response times vary as the number of HBase nodes increases?
>
>
> I'm expecting that each one will be roughly linear, but I'd appreciate any
> links to any studies that have been done.
>
> This is also going on the assumption that the table structure is well
> defined: obviously adding nodes won't help if there is significant region
> hotspotting.
>
> James
>
> --
>
> Elsevier Limited. Registered Office: The Boulevard, Langford Lane,
> Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084,
> Registered in England and Wales.
>


Re: Number of Columns in Phoenix Table

2016-06-29 Thread Mujtaba Chohan
I haven't exhaustively perf. tested but I have a Phoenix table with 15K
columns in a single column family storing values in only 20 or so columns
per row and it's performance seems on par with table with few columns.

On Wed, Jun 29, 2016 at 3:27 AM, Siddharth Ubale <
siddharth.ub...@syncoms.com> wrote:

> Hi ,
>
>
>
> Is there any limit on the number of columns that can be stored in a
> Phoenix table where all the columns are qualifiers of a single Column
> family ?
>
> I would like to create a table which will scale horizontally with dynamic
> columns and should not be more than 2500 columns , each of these columns
> will be column qualifier for one column family in hbase table.
>
> I intend to store data in only 100 columns of the 2500 created columns for
> each row  as the remaining will not serve any purpose for the row data.
>
> How would phoenix perform in such a situation?
>
>
>
>
>
>
>
> Thanks,
>
> Siddharth Ubale,
>
>
>


Re: Phoenix Performance issue

2016-05-11 Thread Mujtaba Chohan
This is with 4.5.2-HBase-0.98 and 4.x-HBase-0.98 head, got almost the same
numbers with both.

On Wed, May 11, 2016 at 12:19 AM, Naveen Nahata <nahata.ii...@gmail.com>
wrote:

> Thanks Mujtaba.
>
> Could you tell me which version of phoenix are you using ?
>
> -Naveen Nahata
>
> On 11 May 2016 at 04:12, Mujtaba Chohan <mujt...@apache.org> wrote:
>
>> Tried the following in Sqlline/Phoenix and HBase shell. Both take ~20ms
>> for
>> point lookups with local HBase.
>>
>> hbase(main):015:0> get 'MYTABLE','a'
>> COLUMN
>> CELL
>>
>>  0:MYCOLtimestamp=1462515518048,
>> value=b
>>
>>  0:_0   timestamp=1462515518048,
>> value=
>>
>> 2 row(s) in 0.0190 seconds
>>
>> 0: jdbc:phoenix:localhost> select * from mytable where pk1='a';
>> +--++
>> | PK1  | MYCOL  |
>> +--++
>> | a| b  |
>> +--++
>> 1 row selected (0.028 seconds)
>>
>> In your test, are you factoring out initial cost of setting up Phoenix
>> connection? If no then see performance of subsequent runs by measuring
>> time
>> in a loop for executeStatement and iterate over resultSet.
>>
>> -mujtaba
>>
>>
>> On Tue, May 10, 2016 at 12:55 PM, Naveen Nahata ( SC ) <
>> naveen.nah...@flipkart.com> wrote:
>>
>> > Hi,
>> >
>> > I am using phoenix 4.5.2-HBase-0.98 to connect HBase. To benchmark
>> > phoenix perforance executed select statement on primary key using
>> phoenix
>> > driver and hbase client.
>> >
>> > Surprisingly figured out PhoenixDriver is approx. 10~15 times slower
>> then
>> > hbase client.
>> >
>> >
>> > ​
>> > Addition to this looked explain statement from phoenix, which stats
>> query
>> > is look up on one key.
>> >
>> >
>> >
>> > ​
>> > If query on look up on 1 key why its taking so long ?
>> >
>> > Code Ref.
>> >
>> > // Connecting phoenix
>> >
>> > String sql = "select * from fklogistics.shipment where shipmentId =
>> 'WSRR4271782117'";
>> > long startTime = System.nanoTime();
>> > ResultSet rs1 = st.executeQuery(sql);
>> > long endTime = System.nanoTime();
>> > long duration = endTime - startTime;
>> > System.out.println("Time take by phoenix :" + duration);
>> >
>> > // Connecting HBase
>> >
>> > Get get = new Get(row);
>> > startTime = System.nanoTime();
>> > Result rs = table1.get(get);
>> > endTime = System.nanoTime();
>> > duration = endTime - startTime;
>> > System.out.println("Time take by hbase :" + duration);
>> >
>> > Please suggest why query is so slow ? Also will upgrading phoenix
>> driver can help in this ?
>> >
>> > Thanks & Regards,
>> >
>> > Naveen Nahata
>> >
>> >
>> >
>>
>
>


Re: Region Server Crash On Upsert Query Execution

2016-03-31 Thread Mujtaba Chohan
For Phoenix phoenix.query.maxGlobalMemoryPercentage is 15% of heap
https://phoenix.apache.org/tuning.html. Block cache and memstore memory
setting are via usual HBase settings and their usage is exposed via jmx at
http://:60030/jmx. Was there any useful info in GC logs? Also 2GB
heap is on the low side, can you rerun you test with setting heap to 5 and
10GB?

On Thu, Mar 31, 2016 at 7:01 AM, Amit Shah <amits...@gmail.com> wrote:

> Another such instance of the crash is described below.
>
>
> When the regions are evenly distributed across the 3 region servers, one
> of the region server crashes without any errors in the logs. It has long GC
> pauses. The heap usage on the server had not crossed above 900 MB and the
> allocated heap is upto 2 GB.   Attached are logs and jconsole screenshot.
>
>
>
> Wonder what is causing the GC pauses? Any idea on how is the region server
> heap distribution across the block cache, phoenix usage, memstore etc?
>
>
> Thanks,
>
> Amit.
>
>
>
> On Thu, Mar 31, 2016 at 7:14 PM, Amit Shah <amits...@gmail.com> wrote:
>
>> There have been multiple reasons of the region server jvm crash. For one
>> of such errors, the logs are attached. Let me know your inputs.
>>
>> Thanks,
>> Amit.
>>
>>
>> On Thu, Mar 31, 2016 at 6:15 PM, Mujtaba Chohan <mujt...@apache.org>
>> wrote:
>>
>>> Can you attached last couple of hundred lines from RS log before it
>>> crashed? Also what's the RS heap size?
>>>
>>>
>>> On Thu, Mar 31, 2016 at 1:48 AM, Amit Shah <amits...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have been experimenting hbase (version 1.0) and phoenix (version
>>>> 4.6) for our OLAP workload. In order to precalculate aggregates we have
>>>> been executing an upsert phoenix query that aggregates raw data (over 10
>>>> mil records) to generate an OLAP cube.
>>>>
>>>> While executing the query, one of the region servers in a cluster of 3
>>>> RS crashes. I am trying to figure out what could be causing the region
>>>> server to crash.
>>>> The server shows high disk operations before the jvm crashed. Kindly
>>>> find the disk and other stats attached.
>>>>
>>>> Any suggestions on where could I look into would be helpful.
>>>>
>>>> The upsert query that was executed is
>>>>
>>>> upsert into AGENT_TER_PRO
>>>> (AGENT_ID,TERRITORY_ID,PRODUCT_ID,SUM_TOTAL_SALES,SUM_TOTAL_EXPENSES,SUM_UNIT_CNT_SOLD,AVG_PRICE_PER_UNIT)
>>>> select /*+ INDEX(TRANSACTIONS  AG_TER_PRO2) */
>>>>  AGENT_ID,TERRITORY_ID,PRODUCT_ID, sum(TOTAL_SALES)
>>>> SUM_TOTAL_SALES,sum(TOTAL_EXPENSES) SUM_TOTAL_EXPENSES,sum(UNIT_CNT_SOLD)
>>>> SUM_UNIT_CNT_SOLD,AVG(PRICE_PER_UNIT)  AVG_PRICE_PER_UNIT  from
>>>> TRANSACTIONS   group by AGENT_ID,TERRITORY_ID,PRODUCT_ID;
>>>>
>>>> Thanks,
>>>> Amit.
>>>>
>>>>
>>>
>>
>


Re: Tephra not starting correctly.

2016-03-31 Thread Mujtaba Chohan
Shouldn't be a bug there as it has been working in our environment. To
verify can you please try this? Copy only tephra and tephra-env.sh files
supplied with Phoenix in a new directory with HBASE_HOME env variable set
and then run tephra.

Thanks,
Mujtaba

On Wed, Mar 30, 2016 at 9:59 PM, F21 <f21.gro...@gmail.com> wrote:

> I just downloaded the tephra 0.7.0 from github and extracted it into the
> container.
>
> Using the same setup as before, I ran:
> export HBASE_CP=/opt/hbase/lib
> export HBASE_HOME=/opt/hbase
>
> Running the standalone tephra using ./tephra start worked correctly and it
> was able to become the leader.
>
> Do you think this might be a bug?
>
> On 31/03/2016 11:53 AM, Mujtaba Chohan wrote:
>
> I still see you have the following on classpath:
> opt/hbase/phoenix-assembly/target/*
>
> On Wed, Mar 30, 2016 at 5:42 PM, F21 <f21.gro...@gmail.com> wrote:
>
>> Thanks for the hints.
>>
>> If I remove the client jar, it complains about a missing class:
>> 2016-03-31 00:38:25,929 INFO  [main] tephra.TransactionServiceMain:
>> Starting TransactionServiceMain
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> com/google/common/util/concurrent/Service$Listener
>> at
>> co.cask.tephra.distributed.TransactionService.doStart(TransactionService.java:78)
>> at
>> com.google.common.util.concurrent.AbstractService.start(AbstractService.java:90)
>> at
>> com.google.common.util.concurrent.AbstractService.startAndWait(AbstractService.java:129)
>> at
>> co.cask.tephra.TransactionServiceMain.start(TransactionServiceMain.java:116)
>> at
>> co.cask.tephra.TransactionServiceMain.doMain(TransactionServiceMain.java:83)
>> at
>> co.cask.tephra.TransactionServiceMain.main(TransactionServiceMain.java:47)
>> Caused by: java.lang.ClassNotFoundException:
>> com.google.common.util.concurrent.Service$Listener
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 6 more
>> 2016-03-31 00:38:25,931 INFO  [Thread-0] tephra.TransactionServiceMain:
>> Stopping TransactionServiceMain
>>
>> After adding the client-without-hbase jar, I get a missing method error:
>> java.lang.NoSuchMethodError:
>> co.cask.tephra.TransactionManager.addListener(Lcom/google/common/util/concurrent/Service$Listener;Ljava/util/concurrent/Executor;)V
>> at
>> co.cask.tephra.distributed.TransactionService$1.leader(TransactionService.java:83)
>> at
>> org.apache.twill.internal.zookeeper.LeaderElection.becomeLeader(LeaderElection.java:229)
>> at
>> org.apache.twill.internal.zookeeper.LeaderElection.access$1800(LeaderElection.java:53)
>> at
>> org.apache.twill.internal.zookeeper.LeaderElection$5.onSuccess(LeaderElection.java:207)
>> at
>> org.apache.twill.internal.zookeeper.LeaderElection$5.onSuccess(LeaderElection.java:186)
>> at
>> com.google.common.util.concurrent.Futures$5.run(Futures.java:768)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> I am not very familiar with java or phoenix itself, but here's my
>> classpath:
>> 2016-03-31 00:41:06,062 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:java.class.path=/opt/hbase/bin/../lib/hadoop-mapreduce-client-core-2.5.1.jar:/opt/hbase/bin/../lib/api-asn1-api-1.0.0-M20.jar:/opt/hbase/bin/../lib/hadoop-mapreduce-client-app-2.5.1.jar:/opt/hbase/bin/../lib/commons-beanutils-1.7.0.jar:/opt/hbase/bin/../lib/jsp-2.1-6.1.14.jar:/opt/hbase/bin/../lib/jasper-compiler-5.5.23.jar:/opt/hbase/bin/../lib/hbase-rest-1.1.3.jar:/opt/hbase/bin/../lib/hadoop-annotations-2.5.1.jar:/opt/hbase/bin/../lib/hbase-hadoop2-compat-1.1.3.jar:/opt/hbase/bin/../lib/hadoop-common-2.5.1.jar:/opt/hbase/bin/../lib/disruptor-3.3.0.jar:/opt/hbase/bin/../lib/jackson-core-asl-1.9.13.jar:/opt/hbase/bin/../lib/aopalliance-1.0.jar:/opt/hbase/bin/../lib/jaxb-api-2.2.2.jar:/opt/hbase/bin/../lib/jaxb-impl-2.2.3-1.jar:/opt/hbase/bin/../lib/java-xmlbuilder-0.4.jar:/opt/hbase/bin/../lib/protobuf-java-2.5.0.jar:/opt/hbase/bin/../lib/junit-4.12.jar:/opt/hbase/bin/../lib/hbase-shell-1.1.3.jar:/opt/hbase/bin/../lib/phoenix-4.7.0-HBase-1.1-server.jar:/opt/hbase/bin/..
>> /lib/hbase-it-1.1.3.jar:/opt/h

Re: Tephra not starting correctly.

2016-03-30 Thread Mujtaba Chohan
hadoop-mapreduce/.//hadoop-mapreduce-client-hs-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//apacheds-kerberos-codec-2.0.0-M15.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-client-core.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jsch-0.1.42.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-client-hs.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jersey-json-1.9.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-client-app.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-collections-3.2.2.jar:/usr/hdp/2.4.0.0-169/h
> adoop-mapreduce/.//hadoop-sls-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-client-hs-plugins.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-extras-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-ant.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//mockito-all-1.8.5.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//servlet-api-2.5.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jersey-server-1.9.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jackson-mapper-asl-1.9.13.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-cli-1.2.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//curator-framework-2.7.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-logging-1.1.3.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jackson-jaxrs-1.9.13.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-datajoin-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//httpclient-4.2.5.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//guava-11.0.2.jar:/usr/hdp/2.4.
> 0.0-169/hadoop-mapreduce/.//snappy-java-1.0.4.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-distcp-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-httpclient-3.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-ant-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jsr305-3.0.0.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-net-3.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//metrics-core-3.0.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-configuration-1.6.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jetty-util-6.1.26.hwx.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-archives-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-gridmix-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jetty-6.1.26.hwx.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//apacheds-i18n-2.0.0-M15.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-codec-1.4.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-archive
> s.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-openstack-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hamcrest-core-1.3.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-digester-1.8.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jettison-1.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-client-jobclient.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//avro-1.7.4.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-sls.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//junit-4.11.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-gridmix.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//xz-1.0.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//netty-3.6.2.Final.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//microsoft-windowsazure-storage-sdk-0.6.0.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-beanutils-core-1.8.0.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//curator-recipes-2.7.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//htrace-core-3.1.0-incubating.jar:/u
> sr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-math3-3.1.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jsp-api-2.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-examples-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-client-hs-plugins-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-client-common-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//activation-1.1.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-lang-2.6.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-client-jobclient-tests.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-auth-2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-extras.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//httpcore-4.2.5.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-openstack.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//asm-3.2.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//gson-2.2.4.jar:/usr/hdp/2.4.0.0-169/hado
> op-mapreduce/.//hadoop-auth.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//hadoop-mapreduce-client-jobclient-2.7.1.2.4.0.0-169-tests.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//commons-lang3-3.3.2.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//jackson-xc-1.9.13.jar:/usr/hdp/2.4.0.0-169/hadoop-mapreduce/.//log4j-1.2.17.jar
> :
>
>
> On 31/03/2016 11:35 AM, Mujtaba Chohan wrote:
>
> Y

Re: Tephra not starting correctly.

2016-03-30 Thread Mujtaba Chohan
You definitely need hbase.zookeeper.quorum set to be able to
connect. I think what's happening is since you have phoenix client jar on
classpath (which is not needed as hbase/lib/* + phoenix-server.jar on
classpath should contain all the necessary libraries) contains guava v13
classes bundled however Tephra works with guava v12 which is already in
HBase/lib directory.

To solve see, either remove phoenix-client.jar from classpath and see if it
complains about any missing library that you can add or remove
guava classes which are bundled in Phoenix-client.jar and then start Tephra.

On Wed, Mar 30, 2016 at 5:07 PM, F21 <f21.gro...@gmail.com> wrote:

> I removed the following from hbase-site.xml and tephra started correctly:
>
>   
> hbase.zookeeper.quorum
> f826338-zookeeper.f826338
>   
>
> However, it now keeps trying to connect to zookeeper on localhost, which
> wouldn't work, because my zookeeper is on another host:
>
> 2016-03-31 00:06:21,972 WARN  [main-SendThread(localhost:2181)]
> zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error,
> closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
> at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>
> Any ideas how this can be fixed?
>
>
>
> On 31/03/2016 4:47 AM, Mujtaba Chohan wrote:
>
> Few pointers:
>
> - phoenix-core-*.jar is a subset of phoenix-*-server.jar so just
> phoenix-*-server.jar in hbase/lib is enough for region servers and master.
> - phoenix-server-*-runnable.jar and phoenix-*-server.jar should be enough
> for query server. Client jar would only duplicate HBase classes in
> hbase/lib.
> - Check for exception starting tephra in
> /tmp/tephra-*/tephra-service-*.log (assuming this is the log location
> configured in your tephra-env.sh)
>
> - mujtaba
>
>
> On Wed, Mar 30, 2016 at 2:54 AM, F21 <f21.gro...@gmail.com> wrote:
>
>> I have been trying to get tephra working, but wasn't able to get it
>> starting successfully.
>>
>> I have a HDFS and HBase 1.1 cluster running in docker containers. I have
>> confirmed that Phoenix, HDFS and HBase are both working correctly. Phoenix
>> and the Phoenix query server are also installed correctly and I can access
>> the cluster using Squirrel SQL with the thin client.
>>
>> Here's what I have done:
>>
>> In the hbase-site.xml of the region servers and masters, add the
>> following:
>>
>> 
>>   data.tx.snapshot.dir
>>   /tmp/tephra/snapshots
>> 
>>
>> 
>>   data.tx.timeout
>>   60
>> 
>>
>> In the hbase-site.xml of the phoenix query server, add:
>>
>> 
>>   phoenix.transactions.enabled
>>   true
>> 
>>
>> For the master, copy the following to hbase/lib:
>> phoenix-4.7.0-HBase-1.1-server
>> phoenix-core-4.7.0-HBase-1.1
>>
>> Also, copy tephra and tephra-env.sh to hbase/bin
>>
>> On the region server, copy the following to hbase/lib:
>> phoenix-4.7.0-HBase-1.1-server
>> phoenix-core-4.7.0-HBase-1.1
>>
>> For the phoenix query server, copy the following to hbase/lib:
>> phoenix-server-4.7.0-HBase-1.1-runnable
>> phoenix-4.7.0-HBase-1.1-client
>>
>> This is what I get when I try to start tephra on the master:
>>
>> root@f826338-hmaster1:/opt/hbase/bin# ./tephra start
>> Wed Mar 30 09:54:08 UTC 2016 Starting tephra service on
>> f826338-hmaster1.f826338
>> Running class co.cask.tephra.TransactionServiceMain
>>
>> root@f826338-hmaster1:/opt/hbase/bin# ./tephra status
>> checking status
>>  * tephra is not running
>>
>> Any pointers appreciated! :)
>>
>>
>>
>
>


Re: Tephra not starting correctly.

2016-03-30 Thread Mujtaba Chohan
Few pointers:

- phoenix-core-*.jar is a subset of phoenix-*-server.jar so just
phoenix-*-server.jar in hbase/lib is enough for region servers and master.
- phoenix-server-*-runnable.jar and phoenix-*-server.jar should be enough
for query server. Client jar would only duplicate HBase classes in
hbase/lib.
- Check for exception starting tephra in /tmp/tephra-*/tephra-service-*.log
(assuming this is the log location configured in your tephra-env.sh)

- mujtaba


On Wed, Mar 30, 2016 at 2:54 AM, F21  wrote:

> I have been trying to get tephra working, but wasn't able to get it
> starting successfully.
>
> I have a HDFS and HBase 1.1 cluster running in docker containers. I have
> confirmed that Phoenix, HDFS and HBase are both working correctly. Phoenix
> and the Phoenix query server are also installed correctly and I can access
> the cluster using Squirrel SQL with the thin client.
>
> Here's what I have done:
>
> In the hbase-site.xml of the region servers and masters, add the following:
>
> 
>   data.tx.snapshot.dir
>   /tmp/tephra/snapshots
> 
>
> 
>   data.tx.timeout
>   60
> 
>
> In the hbase-site.xml of the phoenix query server, add:
>
> 
>   phoenix.transactions.enabled
>   true
> 
>
> For the master, copy the following to hbase/lib:
> phoenix-4.7.0-HBase-1.1-server
> phoenix-core-4.7.0-HBase-1.1
>
> Also, copy tephra and tephra-env.sh to hbase/bin
>
> On the region server, copy the following to hbase/lib:
> phoenix-4.7.0-HBase-1.1-server
> phoenix-core-4.7.0-HBase-1.1
>
> For the phoenix query server, copy the following to hbase/lib:
> phoenix-server-4.7.0-HBase-1.1-runnable
> phoenix-4.7.0-HBase-1.1-client
>
> This is what I get when I try to start tephra on the master:
>
> root@f826338-hmaster1:/opt/hbase/bin# ./tephra start
> Wed Mar 30 09:54:08 UTC 2016 Starting tephra service on
> f826338-hmaster1.f826338
> Running class co.cask.tephra.TransactionServiceMain
>
> root@f826338-hmaster1:/opt/hbase/bin# ./tephra status
> checking status
>  * tephra is not running
>
> Any pointers appreciated! :)
>
>
>


Re: Speeding Up Group By Queries

2016-03-29 Thread Mujtaba Chohan
Optimization did help somewhat but not to the extent I was expecting. See
chart below.

[image: Inline image 1]

Can you share your table schema so I can experiment with it? Another thing
you can try is reducing guidepost <https://phoenix.apache.org/tuning.html>
width for this table by executing UPDATE STATISTICS TRANSACTIONS SET
"phoenix.stats.guidepost.width"=5000;




On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <amits...@gmail.com> wrote:

> Hi Mujtaba,
>
> I did try the two optimization techniques by recreating the table and then
> loading it again with 10 mil records. They do not seem to help out much in
> terms of the timings. Kindly find the phoenix log file attached. Let me
> know if I am missing anything.
>
> Thanks,
> Amit.
>
> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <mujt...@apache.org>
> wrote:
>
>> Here's the chart for time it takes for each of the parallel scans after
>> split. On RS where data is not read from disk scan gets back in ~20 secs
>> but for the RS which has 6 it's ~45 secs.
>>
>> [image: Inline image 2]
>>
>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>> regions
>>>
>>
>> Two things that you should try to reduce disk reads or maybe a
>> combination of both 1. Have only the columns used in your group by query in
>> a separate column family CREATE TABLE T (K integer primary key,
>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your
>> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
>> compaction.
>>
>> I tried to compact the table from the hbase web UI
>>>
>>
>> You need to do *major_compact* from HBase shell. From UI it's minor.
>>
>> - mujtaba
>>
>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <amits...@gmail.com> wrote:
>>
>>> Thanks Mujtaba and James for replying back.
>>>
>>> Mujtaba, Below are details to your follow up queries
>>>
>>> 1. How wide is your table
>>>
>>>
>>> I have 26 columns in the TRANSACTIONS table with a couple of columns
>>> combined to be marked as a primary key
>>>
>>> 2. How many region servers is your data distributed on and what's the
>>>> heap size?
>>>
>>>
>>> When I posted the initial readings of the query taking around 2 minutes,
>>> I had one region server storing 4 regions for the 10 mil records
>>> TRANSACTIONS table. The heap size on the master server is 1 GB while the
>>> region server has 3.63 GB heap setting.
>>>
>>> Later I added 2 more region servers to the cluster and configured them
>>> as data nodes and region servers. After this step, the regions got split on
>>> two region servers with the count as 2 on one region server and 6 on
>>> another. I didn't follow what action caused this region split or was it
>>> automatically done by hbase (load balancer??)
>>>
>>> 3. Do you see lots of disk I/O on region servers during aggregation?
>>>
>>>
>>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>> regions. Kindly find the disk io statistics attached as images.
>>>
>>> 4. Can you try your query after major compacting your table?
>>>
>>>
>>> I tried to compact the table from the hbase web UI. For some reason, the
>>> compaction table attribute on the web ui is still shown as NONE. After
>>> these changes, the query time is down to *42 secs. *
>>> Is compression different from compaction? Would the query performance
>>> improve by compressing the data by one of the algorithms? Logically it
>>> doesn't sound right though.
>>>
>>> Can you also replace log4j.properties with the attached one and reply
>>>> back with phoenix.log created by executing your query in sqlline?
>>>
>>>
>>> After replacing the log4j.properties, I have captured the logs for the
>>> group by query execution and attached.
>>>
>>>
>>> James,
>>> If I follow the queries that you pasted, I see the index getting used
>>> but if I try to explain the query plan on the pre-loaded TRANSACTIONS table
>>> I do not see the index being used. Probably the query plan is changing
>>> based on whether the table has data or not.
>>>
>>> The query time is reduced down to 42 secs right now. Let me know if you
>>> have more suggestions on to improve it further.
>>>
>>> Th

Re: Speeding Up Group By Queries

2016-03-28 Thread Mujtaba Chohan
Here's the chart for time it takes for each of the parallel scans after
split. On RS where data is not read from disk scan gets back in ~20 secs
but for the RS which has 6 it's ~45 secs.

[image: Inline image 2]

 Yes I see disk reads with 607 ios/second on the hosts that stores 6 regions
>

Two things that you should try to reduce disk reads or maybe a combination
of both 1. Have only the columns used in your group by query in a separate
column family CREATE TABLE T (K integer primary key, GRPBYCF.UNIT_CNT_SOLD
integer, GRPBYCF.TOTAL_SALES integer, GRPBYCF.T_COUNTRY varchar, ...) 2.
Turn on snappy compression for your table ALTER TABLE T SET
COMPRESSION='SNAPPY' followed by a major compaction.

I tried to compact the table from the hbase web UI
>

You need to do *major_compact* from HBase shell. From UI it's minor.

- mujtaba

On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <amits...@gmail.com> wrote:

> Thanks Mujtaba and James for replying back.
>
> Mujtaba, Below are details to your follow up queries
>
> 1. How wide is your table
>
>
> I have 26 columns in the TRANSACTIONS table with a couple of columns
> combined to be marked as a primary key
>
> 2. How many region servers is your data distributed on and what's the heap
>> size?
>
>
> When I posted the initial readings of the query taking around 2 minutes, I
> had one region server storing 4 regions for the 10 mil records TRANSACTIONS
> table. The heap size on the master server is 1 GB while the region server
> has 3.63 GB heap setting.
>
> Later I added 2 more region servers to the cluster and configured them as
> data nodes and region servers. After this step, the regions got split on
> two region servers with the count as 2 on one region server and 6 on
> another. I didn't follow what action caused this region split or was it
> automatically done by hbase (load balancer??)
>
> 3. Do you see lots of disk I/O on region servers during aggregation?
>
>
>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
> regions. Kindly find the disk io statistics attached as images.
>
> 4. Can you try your query after major compacting your table?
>
>
> I tried to compact the table from the hbase web UI. For some reason, the
> compaction table attribute on the web ui is still shown as NONE. After
> these changes, the query time is down to *42 secs. *
> Is compression different from compaction? Would the query performance
> improve by compressing the data by one of the algorithms? Logically it
> doesn't sound right though.
>
> Can you also replace log4j.properties with the attached one and reply back
>> with phoenix.log created by executing your query in sqlline?
>
>
> After replacing the log4j.properties, I have captured the logs for the
> group by query execution and attached.
>
>
> James,
> If I follow the queries that you pasted, I see the index getting used but
> if I try to explain the query plan on the pre-loaded TRANSACTIONS table I
> do not see the index being used. Probably the query plan is changing based
> on whether the table has data or not.
>
> The query time is reduced down to 42 secs right now. Let me know if you
> have more suggestions on to improve it further.
>
> Thanks,
> Amit.
>
> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <jamestay...@apache.org>
> wrote:
>
>> Hi Amit,
>> Using 4.7.0-HBase-1.1 release, I see the index being used for that query
>> (see below). An index will help some, as the aggregation can be done in
>> place as the scan over the index is occurring (as opposed to having to hold
>> the distinct values found during grouping in memory per chunk of work and
>> sorting each chunk on the client). It's not going to prevent the entire
>> index from being scanned though. You'll need a WHERE clause to prevent that.
>>
>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
>> key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
>> No rows affected (1.32 seconds)
>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>> No rows affected (6.452 seconds)
>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
>>
>> +--+
>> |   PLAN
>>   |
>>
>> +--+
>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER TRANSACTIONS_COUNTRY_INDEX
>>  |
>> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COU

Re: Speeding Up Group By Queries

2016-03-25 Thread Mujtaba Chohan
That seems excessively slow for 10M rows which should be in order of few
seconds at most without index. 1. How wide is your table 2. How many region
servers is your data distributed on and what's the heap size? 3. Do you see
lots of disk I/O on region servers during aggregation? 4. Can you try your
query after major compacting your table?

Can you also replace log4j.properties with the attached one and reply back
with phoenix.log created by executing your query in sqlline?

Thanks,
Mujtaba


On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah  wrote:

> Hi,
>
> I am trying to evaluate apache hbase (version 1.0.0) and phoenix (version
> 4.6) deployed through cloudera for our OLAP workfload. I have a table
> that has 10 mil rows. I try to execute the below roll up query and it takes
> around 2 mins to return 1,850 rows.
>
> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY
> T_COUNTRY;
>
> I tried applying the "joining with indices" example given on the website
>  on the TRANSACTIONS table by
> creating an index on the grouped by column as below but that doesn't help.
>
> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>
> This index is not getting used when the query is executed. The query plan
> is as below
>
> +--+
> |   PLAN   |
> +--+
> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
> | CLIENT MERGE SORT|
> +--+
>
> Theoretically can secondary indexes help improve the performance of group
> by queries?
>
> Any suggestions on what are different options in phoenix I could try out
> to speed up GROUP BY queries?
>
> Thanks,
> Amit.
>


log4j.properties
Description: Binary data


Re: looking for help with Pherf setup

2016-02-29 Thread Mujtaba Chohan
This is a ClassNotFoundException. Can you make sure Phoenix jar is
available on classpath for Pherf? If Phoenix is available in HBase/lib
directory and HBASE_DIR environment variable is set then it should fix it.
Also to test out first, you can run pherf_standalone.py with local HBase to
see if everything works as expected locally before testing on cluster.

On Mon, Feb 29, 2016 at 8:49 AM, Peter Savage 
wrote:

> Hello,
>
> I'm trying to setup pherf and running into a few glitches. I have been
> using this page here:
>
> https://phoenix.apache.org/pherf.html
>
> but it appears to be slightly out of date, and I have not been able to get
> a test to run.
>
> We tried, using this sql -
>
> https://github.com/apache/phoenix/blob/master/phoenix-pherf/src/test/resources/datamodel/test_schema.sql
> to
> populate our schema, and then we ran against that scenario.xml, we received
> the following error:
>
>
> hadoop@localhost bin]$ ./pherf-cluster.py -drop all -l -q -z localhost 
> -schemaFile
> ../sandbox/database_schema.sql -scenarioFile ../sandbox/scenario.xml
>
> HBASE_DIR environment variable is currently set to: /opt/hbase
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/opt/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/
> slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-
> log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/phoenix/schema/TableNotFoundException
> at org.apache.phoenix.pherf.Pherf.(Pherf.java:52)
> at org.apache.phoenix.pherf.Pherf.main(Pherf.java:188)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.phoenix.schema.TableNotFoundException
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 2 more
>
>
> We did check that the table was there, so this is a bit puzzling.
>
> Thanks,
>
> Peter
>


Re: YCSB with Phoenix?

2016-02-19 Thread Mujtaba Chohan
You can apply this patch on YCSB to test out Phoenix with variable number
of VARCHAR fields as well as test out combination of single/multiple CFs,
compression and salt buckets. See usage details here
.


You can also use Pherf

which is a Phoenix specific performance test suite that uses XML scenario
to define how synthetic data is generated and queried. It's launcher
scripts are in phoenix/bin directory.

HTH

On Fri, Feb 19, 2016 at 9:54 AM, Gaurav Kanade 
wrote:

> Hi All
>
> I am relatively new to Phoenix and was working on some performance
> tuning/benchmarking experiments and tried to search online for whether
> there exists YCSB client to go through Phoenix.
>
> I came across this https://github.com/brianfrankcooper/YCSB/pull/178 and
> some related links but it seems this didnt seem to have moved forward.
>
> Could someone help me know where to look for the latest status on this ?
> Does there exist an easy way to test Phoenix via YCSB and if not what are
> the other existing options?
>
> Best,
> Gaurav
>
>
>


Re: Select by first part of composite primary key, is it effective?

2016-02-02 Thread Mujtaba Chohan
If you know your key space then you can use *SPLIT ON* in your table create
DDL. See http://phoenix.apache.org/language

On Tue, Feb 2, 2016 at 11:54 AM, Serega Sheypak <serega.shey...@gmail.com>
wrote:

> Hm... and what is the right to presplit table then?
>
> 2016-02-02 18:30 GMT+01:00 Mujtaba Chohan <mujt...@apache.org>:
>
>> If your filter matches few rows due to filter on leading part of PK then
>> your data might only reside in a single block which leads to less
>> overall disk reads for non-salted case vs need for multiple blocks reads for
>> salted one.
>>
>>
>> On Tuesday, February 2, 2016, Serega Sheypak <serega.shey...@gmail.com>
>> wrote:
>>
>>> > then you would be better off not using salt buckets all together
>>> rather than having 100 parallel scan and block reads in your case. I
>>> Didn't understand you correctly. What is difference between salted/not
>>> salted table in case of "primary key leading-part select"?
>>>
>>> 2016-02-02 1:18 GMT+01:00 Mujtaba Chohan <mujt...@apache.org>:
>>>
>>>> If you are filtering on leading part of row key which is highly
>>>> selective then you would be better off not using salt buckets all together
>>>> rather than having 100 parallel scan and block reads in your case. In our
>>>> test with billion+ row table, non-salted table offer much better
>>>> performance since it ends up reading fewer blocks from a single region.
>>>>
>>>> //mujtaba
>>>>
>>>> On Mon, Feb 1, 2016 at 1:16 PM, Serega Sheypak <
>>>> serega.shey...@gmail.com> wrote:
>>>>
>>>>> Hi, here is my table DDL:
>>>>> CREATE TABLE IF NOT EXISTS id_ref
>>>>> (
>>>>>id1 VARCHAR   NOT NULL,
>>>>>value1  VARCHAR,
>>>>>
>>>>>id2 VARCHAR NOT NULL,
>>>>>value2  VARCHAR
>>>>>CONSTRAINT id_ref_pk  PRIMARY KEY (id1, id2)
>>>>> )IMMUTABLE_ROWS=true,SALT_BUCKETS=100, VERSIONS=1, TTL=691200
>>>>>
>>>>> I'm trying to analyze result of explain:
>>>>>
>>>>> explain select id1, value1, id2, value2 from id_ref where id1 = 'xxx'
>>>>>
>>>>> . . . . . . . . . . . . . . . . . . . . . . .> ;
>>>>>
>>>>> *+--+*
>>>>>
>>>>> *| **  PLAN  ** |*
>>>>>
>>>>> *+--+*
>>>>>
>>>>> *| *CLIENT 100-CHUNK PARALLEL 100-WAY RANGE SCAN OVER ID_REF
>>>>> [0,'1fd5c44a75549162ca1602dda55f6d129cab61a6']* |*
>>>>>
>>>>> *| *CLIENT MERGE SORT   * |*
>>>>>
>>>>> *+--+*
>>>>>
>>>>>
>>>>> What happens? Client spawns 100 parallel scans (because of bucketing)
>>>>> and waits for 100 responses?
>>>>>
>>>>> Is it effective? What is the right way to optimize such query pattern:
>>>>> "select by first part of primary key"? Reduce the amount of buckets? I get
>>>>> exeption a while after restarting app:
>>>>>
>>>>>
>>>>> *Task org.apache.phoenix.job.JobManager$JobFutureTask@60a40644
>>>>> rejected from org.apache.phoenix.job.JobManager$1@58e3fe9aRunning, pool
>>>>> size = 128, active threads = 121, queued tasks = 5000, completed tasks =
>>>>> 2629565*
>>>>>
>>>>>
>>>>>
>>>>
>>>
>


Re: Select by first part of composite primary key, is it effective?

2016-02-02 Thread Mujtaba Chohan
If your filter matches few rows due to filter on leading part of PK then
your data might only reside in a single block which leads to less
overall disk reads for non-salted case vs need for multiple blocks reads for
salted one.

On Tuesday, February 2, 2016, Serega Sheypak <serega.shey...@gmail.com>
wrote:

> > then you would be better off not using salt buckets all together rather
> than having 100 parallel scan and block reads in your case. I
> Didn't understand you correctly. What is difference between salted/not
> salted table in case of "primary key leading-part select"?
>
> 2016-02-02 1:18 GMT+01:00 Mujtaba Chohan <mujt...@apache.org
> <javascript:_e(%7B%7D,'cvml','mujt...@apache.org');>>:
>
>> If you are filtering on leading part of row key which is highly selective
>> then you would be better off not using salt buckets all together rather
>> than having 100 parallel scan and block reads in your case. In our test
>> with billion+ row table, non-salted table offer much better performance
>> since it ends up reading fewer blocks from a single region.
>>
>> //mujtaba
>>
>> On Mon, Feb 1, 2016 at 1:16 PM, Serega Sheypak <serega.shey...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','serega.shey...@gmail.com');>> wrote:
>>
>>> Hi, here is my table DDL:
>>> CREATE TABLE IF NOT EXISTS id_ref
>>> (
>>>id1 VARCHAR   NOT NULL,
>>>value1  VARCHAR,
>>>
>>>id2 VARCHAR NOT NULL,
>>>value2  VARCHAR
>>>CONSTRAINT id_ref_pk  PRIMARY KEY (id1, id2)
>>> )IMMUTABLE_ROWS=true,SALT_BUCKETS=100, VERSIONS=1, TTL=691200
>>>
>>> I'm trying to analyze result of explain:
>>>
>>> explain select id1, value1, id2, value2 from id_ref where id1 = 'xxx'
>>>
>>> . . . . . . . . . . . . . . . . . . . . . . .> ;
>>>
>>> *+--+*
>>>
>>> *| **  PLAN  ** |*
>>>
>>> *+--+*
>>>
>>> *| *CLIENT 100-CHUNK PARALLEL 100-WAY RANGE SCAN OVER ID_REF
>>> [0,'1fd5c44a75549162ca1602dda55f6d129cab61a6']* |*
>>>
>>> *| *CLIENT MERGE SORT   * |*
>>>
>>> *+--+*
>>>
>>>
>>> What happens? Client spawns 100 parallel scans (because of bucketing)
>>> and waits for 100 responses?
>>>
>>> Is it effective? What is the right way to optimize such query pattern:
>>> "select by first part of primary key"? Reduce the amount of buckets? I get
>>> exeption a while after restarting app:
>>>
>>>
>>> *Task org.apache.phoenix.job.JobManager$JobFutureTask@60a40644 rejected
>>> from org.apache.phoenix.job.JobManager$1@58e3fe9aRunning, pool size = 128,
>>> active threads = 121, queued tasks = 5000, completed tasks = 2629565*
>>>
>>>
>>>
>>
>


Re: Select by first part of composite primary key, is it effective?

2016-02-01 Thread Mujtaba Chohan
If you are filtering on leading part of row key which is highly selective
then you would be better off not using salt buckets all together rather
than having 100 parallel scan and block reads in your case. In our test
with billion+ row table, non-salted table offer much better performance
since it ends up reading fewer blocks from a single region.

//mujtaba

On Mon, Feb 1, 2016 at 1:16 PM, Serega Sheypak 
wrote:

> Hi, here is my table DDL:
> CREATE TABLE IF NOT EXISTS id_ref
> (
>id1 VARCHAR   NOT NULL,
>value1  VARCHAR,
>
>id2 VARCHAR NOT NULL,
>value2  VARCHAR
>CONSTRAINT id_ref_pk  PRIMARY KEY (id1, id2)
> )IMMUTABLE_ROWS=true,SALT_BUCKETS=100, VERSIONS=1, TTL=691200
>
> I'm trying to analyze result of explain:
>
> explain select id1, value1, id2, value2 from id_ref where id1 = 'xxx'
>
> . . . . . . . . . . . . . . . . . . . . . . .> ;
>
> *+--+*
>
> *| **  PLAN  ** |*
>
> *+--+*
>
> *| *CLIENT 100-CHUNK PARALLEL 100-WAY RANGE SCAN OVER ID_REF
> [0,'1fd5c44a75549162ca1602dda55f6d129cab61a6']* |*
>
> *| *CLIENT MERGE SORT   * |*
>
> *+--+*
>
>
> What happens? Client spawns 100 parallel scans (because of bucketing) and
> waits for 100 responses?
>
> Is it effective? What is the right way to optimize such query pattern:
> "select by first part of primary key"? Reduce the amount of buckets? I get
> exeption a while after restarting app:
>
>
> *Task org.apache.phoenix.job.JobManager$JobFutureTask@60a40644 rejected
> from org.apache.phoenix.job.JobManager$1@58e3fe9aRunning, pool size = 128,
> active threads = 121, queued tasks = 5000, completed tasks = 2629565*
>
>
>


Re: phoenix-4.4.0-HBase-1.1-client.jar in maven?

2015-11-25 Thread Mujtaba Chohan
Kristoffer - If you use *phoenix-core* dependency in your pom.xml as
described here  then it's
equivalent to have a project dependency on phoenix-client jar as maven
would resolve all dependency needed by phoenix-core. Note that
phoenix-client jar is just a bundle of phoenix-core and its associated
dependencies. HTH.

On Wed, Nov 25, 2015 at 7:35 AM, Kristoffer Sjögren 
wrote:

> That's fine, I can just take the one packaged inside the tar.gz file
> from the website.
>
>
> http://apache.mirrors.spacedump.net/phoenix/phoenix-4.4.0-HBase-1.1/bin/phoenix-4.4.0-HBase-1.1-bin.tar.gz
>
> On Wed, Nov 25, 2015 at 4:04 PM, Asher Devuyst  wrote:
> > You can try the phoenix-assembly/  and see if that works for you.  It
> seems
> > like it should have most of what you need.
> >
> > On Wed, Nov 25, 2015 at 4:22 AM, Kristoffer Sjögren 
> > wrote:
> >>
> >> Yes, I was looking for the jars in both these maven repositories but
> >> they are mostly the same AFAICT and neither have the jar i'm looking
> >> for. The link you sent are linux package repositories?
> >>
> >> I wanted to avoid to manually deploy the jar file in our Maven
> >> repository, but that might be the only option?
> >>
> >> http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.phoenix%22
> >>
> >>
> http://repo.hortonworks.com/content/repositories/releases/org/apache/phoenix/
> >>
> >> On Tue, Nov 24, 2015 at 10:55 PM, Asher Devuyst 
> wrote:
> >> >
> >> >
> http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.1/bk_Installing_HDP_AMB/content/_hdp_stack_repositories.html
> >> >
> >> > On Nov 24, 2015 4:47 PM, "Asher Devuyst"  wrote:
> >> >>
> >> >> If you are doing development, most of what you need is in core
> >> >> (input/output formats utility for MR job setup, etc).  But yes, that
> >> >> also
> >> >> works.  You can also the jget the jars from the Horton Works hdp
> 2.3.2
> >> >> repos
> >> >> if the typical maven repos don't have them.
> >> >>
> >> >> On Nov 24, 2015 4:43 PM, "Kristoffer Sjögren" 
> wrote:
> >> >>>
> >> >>> That's for the query server right? Im not sure that's safe to use
> yet?
> >> >>>
> >> >>> I'm looking for the phoenix-4.4.0-HBase-1.1-client.jar which is in
> the
> >> >>> tar.gz file downloaded from here:
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> http://apache.mirrors.spacedump.net/phoenix/phoenix-4.4.0-HBase-1.1/bin/phoenix-4.4.0-HBase-1.1-bin.tar.gz
> >> >>>
> >> >>> On Tue, Nov 24, 2015 at 7:23 PM, Asher Devuyst 
> >> >>> wrote:
> >> >>> > Try the phoenix-4.4.0-HBase-1.1-server-client.jar instead. You may
> >> >>> > also
> >> >>> > need
> >> >>> > the core jar as well.
> >> >>> >
> >> >>> > On Nov 24, 2015 9:51 AM, "Kristoffer Sjögren" 
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Hi
> >> >>> >>
> >> >>> >> I'm looking for phoenix-4.4.0-HBase-1.1-client.jar in Maven
> Central
> >> >>> >> but unable to find it.
> >> >>> >>
> >> >>> >> Is manual binary tar.gz unpacking the way to go? What's the
> reason
> >> >>> >> the
> >> >>> >> jar is not in Maven Central?
> >> >>> >>
> >> >>> >> Cheers,
> >> >>> >> -Kristoffer
> >
> >
>


Re: Apache Phoenix Tracing

2015-11-03 Thread Mujtaba Chohan
traceserver.py

is in Phoenix 4.6.0.

On Tue, Nov 3, 2015 at 12:42 AM, Nanda  wrote:

>
> Hi All,
>
> I am trying to enable the tracing app as mentioned in the wiki:
> https://phoenix.apache.org/tracing.html
>
> I followed all the steps but was not able to find the "traceserver.py"
> file in my phoenix/bin directory.
>
> I am using 4.4.x version of phoneix.
>
> TIA.
>
> Nanda
>


Re: Number of regions in SYSTEM.SEQUENCE

2015-09-22 Thread Mujtaba Chohan
Since Phoenix 4.5.x default has been changed for
phoenix.sequence.saltBuckets to not split sequence table. See this

commit. For older versions you can drop sequence table and reconnect with
setting client side phoenix.sequence.saltBuckets property.

On Tue, Sep 22, 2015 at 11:14 AM, Michael McAllister <
mmcallis...@homeaway.com> wrote:

> Hi
>
> By default SYSTEM.SEQUENCE is installed with 256 regions. In an
> environment where you don’t have a large number of tables and regions
> (yet), the end result of this seems to be that with hbase
> balance_switch=true, you end up with a lot of region servers with nothing
> but empty SYSTEM.SEQUENCE regions on them. That mans inefficient use of our
> cluster.
>
> Have there been any best practices developed as to how to deal with this
> situation?
>
> Michael McAllister
> Staff Data Warehouse Engineer | Decision Systems
> mmcallis...@homeaway.com | C: 512.423.7447 | skype: michael.mcallister.ha
>  | webex: https://h.a/mikewebex
>
>
> This electronic communication (including any attachment) is confidential.
> If you are not an intended recipient of this communication, please be
> advised that any disclosure, dissemination, distribution, copying or other
> use of this communication or any attachment is strictly prohibited.  If you
> have received this communication in error, please notify the sender
> immediately by reply e-mail and promptly destroy all electronic and printed
> copies of this communication and any attachment.
>
>


Re: [ANNOUNCE] Welcome our newest Committer Dumindu Buddhika

2015-09-18 Thread Mujtaba Chohan
Welcome onboard Dumindu!!

On Friday, September 18, 2015, Nick Dimiduk  wrote:

> Nice work Dumindu!
>
> On Thu, Sep 17, 2015 at 9:18 PM, Vasudevan, Ramkrishna S <
> ramkrishna.s.vasude...@intel.com > wrote:
>
> > Hi All
> >
> > Please welcome our newest committer Dumindu Buddhika to the Apache
> Phoenix
> > team.  Dumindu,  a student and an intern in the GSoC  program, has
> > contributed lot of new functionalities related to the PHOENIX ARRAY
> feature
> > and also has involved himself in lot of critical bug fixes even after the
> > GSoC period was over.
> > He is a quick learner and a very young blood eager to contribute to
> > Phoenix and its roadmap.
> >
> > All the best and congratulations, Dumindu  Welcome on board !!!
> >
> > Regards
> > Ram
> >
>


Re: NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.protobuf.ProtobufUtil while trying to connect to HBase with Phoenix

2015-09-08 Thread Mujtaba Chohan
Can you try with phoenix-client jar instead of phoenix-client-*minimal* jar?

On Tue, Sep 8, 2015 at 10:42 AM, Dmitry Goldenberg  wrote:

> I'm getting this error while trying to connect to HBase in a clustered
> environment. The code seems to work fine in a single node environment.
>
> The full set of stack traces is below.
>
> I can see that ProtobufUtil class is in the
> hbase-client-0.98.9-hadoop2.jar on the classpath. Google's
> protobuf-java-2.5.0.jar is also on the classpath.
>
> Does anyone have an idea as to what might be going wrong here? Thanks.
>
> Caused by: java.sql.SQLException: ERROR 103 (08004): Unable to establish
> connection.
>
> at
> org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:362)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:133)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:283)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:166)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl$11.call(ConnectionQueryServicesImpl.java:1831)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl$11.call(ConnectionQueryServicesImpl.java:1810)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1810)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:162)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:126)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:133)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at java.sql.DriverManager.getConnection(DriverManager.java:664)
> ~[?:1.8.0_60]
>
> at java.sql.DriverManager.getConnection(DriverManager.java:247)
> ~[?:1.8.0_60]
>
> at com.myco.util.SqlUtils.getDbConnection(SqlUtils.java:208)
> ~[core-model-0.0.1-SNAPSHOT.jar:?]
>
> ... 35 more
>
> Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
>
> at
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:457)
> ~[hbase-client-0.98.9-hadoop2.jar:0.98.9-hadoop2]
>
> at
> org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:350)
> ~[hbase-client-0.98.9-hadoop2.jar:0.98.9-hadoop2]
>
> at
> org.apache.phoenix.query.HConnectionFactory$HConnectionFactoryImpl.createConnection(HConnectionFactory.java:47)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:280)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:166)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl$11.call(ConnectionQueryServicesImpl.java:1831)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl$11.call(ConnectionQueryServicesImpl.java:1810)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1810)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:162)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at
> org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:126)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:133)
> ~[phoenix-4.3.1-client-minimal-hbase0.98.9-hadoop2.4.0.jar:?]
>
> at java.sql.DriverManager.getConnection(DriverManager.java:664)
> ~[?:1.8.0_60]
>
> at 

Re: missing rows after using performance.py

2015-09-08 Thread Mujtaba Chohan
Thanks James. Filed https://issues.apache.org/jira/browse/PHOENIX-2240.

On Tue, Sep 8, 2015 at 12:38 PM, James Heather 
wrote:

> Thanks.
>
> I've discovered that the cause is even simpler. With 100M rows, you get
> collisions in the primary key in the CSV file. An experiment (capturing the
> CSV file, and counting the rows with a unique primary key) reveals that the
> number of unique primary keys is about 500 short of the full 100M. So the
> upserting is working as it should!
>
> I don't know if there's a way round this, because it does produce rather
> suspicious-looking results. It might be worth having the program emit a
> warning to this effect if the parameter size is large, or finding a way to
> increase the entropy in the primary keys that are generated, to ensure that
> there won't be collisions.
>
> It's a bit surprising no one has run into this before! Hopefully this
> script has been run on that many rows before... it seems a reasonable
> number for testing performance of a scalable database... (in fact I was
> planning to increase the row count somewhat).
>
> James
>
>
> On 08/09/15 20:16, James Taylor wrote:
>
> Hi James,
> Looks like currently you'll get a error log message generated if a row is
> attempted to be imported but cannot be (usually due to the data not being
> compatible with the schema). For psql.py, this would be the client side log
> and messages would look like this:
> LOG.error("Error upserting record {}: {}", csvRecord,
> errorMessage);
>
> FWIW, we have a "strict" option for CSV loading (using the -s or --strict
> option) which is meant to cause the load to abort if bad data is found, but
> it doesn't look like this is currently checked (when bad data is
> encountered). I've filed PHOENIX-2239 for this.
>
> Thanks,
> James
>
> On Tue, Sep 8, 2015 at 11:26 AM, James Heather  > wrote:
>
>> I've had another go running the performance.py script to upsert
>> 100,000,000 rows into a Phoenix table, and again I've ended up with around
>> 500 rows missing.
>>
>> Can anyone explain this, or reproduce it?
>>
>> It is rather concerning: I'm reluctant to use Phoenix if I'm not sure
>> whether rows will be silently dropped.
>>
>> James
>>
>
>
>


Re: Maven issue with version 4.5.0

2015-08-13 Thread Mujtaba Chohan
I'll take a look and will update.

Thanks,
Mujtaba

On Thu, Aug 13, 2015 at 8:33 AM, Yiannis Gkoufas johngou...@gmail.com
wrote:

 Hi there,

 When I try to include the following in my pom.xml:

 dependency
 groupIdorg.apache.phoenix/groupId
 artifactIdphoenix-core/artifactId
 version4.5.0-HBase-0.98/version
 scopeprovided/scope
 /dependency

 I get this error:

 Failed to collect dependencies at
 org.apache.phoenix:phoenix-core:jar:4.5.0-HBase-0.98: Failed to read
 artifact descriptor for
 org.apache.phoenix:phoenix-core:jar:4.5.0-HBase-0.98: Could not find
 artifact org.apache.phoenix:phoenix:pom:4.5.0-HBase-0.98 in apache release (
 https://repository.apache.org/content/repositories/releases/)

 However, when I switch to the previous version (4.4.0-HBase-0.98)
 everything works as expected
 Did something break with the new release?

 Thanks a lot,
 Yiannis



Re: Maven issue with version 4.5.0

2015-08-13 Thread Mujtaba Chohan
Hi Yiannis. Please retry now.

Thanks,
Mujtaba

On Thu, Aug 13, 2015 at 10:44 AM, Mujtaba Chohan mujt...@apache.org wrote:

 I'll take a look and will update.

 Thanks,
 Mujtaba

 On Thu, Aug 13, 2015 at 8:33 AM, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi there,

 When I try to include the following in my pom.xml:

 dependency
 groupIdorg.apache.phoenix/groupId
 artifactIdphoenix-core/artifactId
 version4.5.0-HBase-0.98/version
 scopeprovided/scope
 /dependency

 I get this error:

 Failed to collect dependencies at
 org.apache.phoenix:phoenix-core:jar:4.5.0-HBase-0.98: Failed to read
 artifact descriptor for
 org.apache.phoenix:phoenix-core:jar:4.5.0-HBase-0.98: Could not find
 artifact org.apache.phoenix:phoenix:pom:4.5.0-HBase-0.98 in apache release (
 https://repository.apache.org/content/repositories/releases/)

 However, when I switch to the previous version (4.4.0-HBase-0.98)
 everything works as expected
 Did something break with the new release?

 Thanks a lot,
 Yiannis





Re: Phoenix table scan performance

2015-03-09 Thread Mujtaba Chohan
During your scan with data on single region server (RS), do you see RS
blocked on disk I/O due to heavy reads or 100% CPU utilized? if that is the
case then having data distributed on 2 RS would effectively cut time in
half.

On Mon, Mar 9, 2015 at 10:01 AM, Yohan Bismuth yohan.bismu...@gmail.com
wrote:

 Hello,
 we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our
 cluster and we're experiencing some perf issues.

 What we need to do is a full table scan over 1 billion rows. We've got 50
 regionservers and approximatively 1000 regions of 1Gb equally distributed
 on these rs (which means ~20 regions per rs). Each node has 14 disks and 12
 cores.

 A simple Select count(1) from table is currently taking 400~500 sec.

 We noticed that a range scan over 2 regions located on 2 different rs
 seems to be done in parallel (taking 15~20 sec) but a range scan over 2
 regions of a single rs is taking twice this time (about 30~40 sec). We
 experience the same result with more than 2 regions.

 *Could this mean that parallelization is done at a regionserver level but
 not a region level *? in this case 400~500 seconds seems legit with 20~25
 regions per rs. We expected regions of a single rs to be scanned in
 parallel, is this a normal behavior or are we doing something wrong ?

 Thanks for your help



Re: Incompatible jars detected between client and server with CsvBulkloadTool

2015-02-26 Thread Mujtaba Chohan
Just tried connecting Sqlline using Phoenix 4.3 with clean HBase
0.98.4-hadoop2 and it worked fine. Any change you are using hadoop1?

On Thu, Feb 26, 2015 at 10:38 AM, Naga Vijayapuram naga_vijayapu...@gap.com
 wrote:

  Hi Sun,

  See my comment in https://issues.apache.org/jira/browse/PHOENIX-1248 …

  ||
 With Phoenix 4.3.0, I had to move from HBase 0.98.4 to HBase 0.98.9 to
 overcome the issue ...
 ||

  Naga


  On Feb 26, 2015, at 1:57 AM, su...@certusnet.com.cn wrote:


 Hi, all

  With the latest 4.3 release, I got strange error of incompatible jars
 for client and server, as following:

Exception in thread main java.sql.SQLException: ERROR 2006 (INT08):
 Incompatible jars detected between client and server. Ensure that
 phoenix.jar is put on the classpath of HBase in every region server:
 org.apache.hadoop.hbase.protobuf.generated.ZooKeeperProtos$MetaRegionServer.hasState()Z

  Never had seen this kind of exception before. I am sure server jar and
 client jar are both phoenix-4.3.0

  Any hints are accepted and appreciated.
  Thanks,
  Sun.
 --
   --
  CertusNet





Re: Update statistics made query 2-3x slower

2015-02-13 Thread Mujtaba Chohan
Hi Constantin,

This in useful info. Just to clarify slowdown that you see after update
statistics case, was the *update statistics* executed after initial data
load or after you upserted it again?

2. How many regions/region servers (RS) did the data end up on?

3. 30/60 seconds for count(*) seems really high. Do you see lots of disk
I/O during your count query? How wide are your rows and how much memory is
available on your RS/HBase heap?

3. Can you also send output of *explain select count(*) from tablex* for
this case?

Thanks,
Mujtaba

On Fri, Feb 13, 2015 at 12:34 AM, Ciureanu, Constantin (GfK) 
constantin.ciure...@gfk.com wrote:

 Hello Mujtaba,



 Don’t worry – it was just the *select count(*) from tableX* that was
 slowed down in a more than visible way.

 I presume all the regular queries do actually benefit from using the STATS.



 Some other cases where I saw slowdown for “*select count(*) from tableX*”:

 -   First time after loading 6 M records – the time to obtain the
 count was ~30 sec

 -   After loading the *same* 6 M records again – the time almost
 doubled L I imagine the data is doubled, not yet compacted in HBase

 -   After deleting the 6M rows (delete from …. , not truncate) and
 loading the 6M rows again – the same double time – same comment as above

 -   After update statistics tableX – the time was around 2x the
 original time (~60 seconds) – this I couldn’t really explain (perhaps the
 fleet I use is undersized)



 I need to mention that I’m using 4.2.2 but I can’t wait for 4.3 to be
 released (as it will fix some issues I have. Eg. one with SKIP SCAN:

 [1st part of PK between A and B] or [first part of PK between C and D] or
 [….] was understood as a full table scan = painfully slow - but this
 worked today after I used a hint in the SELECT /*+ SKIP_SCAN */ which
 shouldn’t be mandatory in my opinion).



 Regards,

   Constantin



 *From:* Mujtaba Chohan [mailto:mujt...@apache.org]
 *Sent:* Thursday, February 12, 2015 9:20 PM

 *To:* user@phoenix.apache.org
 *Subject:* Re: Update statistics made query 2-3x slower



 Constantin - If possible can you please share your schema, approx.
 row/columns width, number of region servers in your cluster plus their heap
 size, HBase/Phoenix version and any default property overrides so we can
 identify why stats are slowing things down in your case.



 Thanks,

 Mujtaba



 On Thu, Feb 12, 2015 at 12:56 AM, Ciureanu, Constantin (GfK) 
 constantin.ciure...@gfk.com wrote:

 It worked!

 Without stats it’s again faster (2-3x times) – but I do understand that
 all other normal queries might benefit from the stats.



 Thank you Mujtaba for the info,

 Thank you Vasudevan for the explanations, I already used HBase and I agree
 it’s hard to have a counter for the table rows (especially if the
 tombstones for deleted rows are still there – ie. not compacted yet).



 Constantin







 *From:* Mujtaba Chohan [mailto:mujt...@apache.org]
 *Sent:* Wednesday, February 11, 2015 8:54 PM
 *To:* user@phoenix.apache.org
 *Subject:* Re: Update statistics made query 2-3x slower



 To compare performance without stats, try deleting related rows from
 SYSTEM.STATS or an easier way, just truncate SYSTEM.STATS table from HBase
 shell and restart your region servers.

 //mujtaba



 On Wed, Feb 11, 2015 at 10:29 AM, Vasudevan, Ramkrishna S 
 ramkrishna.s.vasude...@intel.com wrote:

 Hi Constantin



 Before I could explain on the slowness part let me answer your 2nd
 question,



 Phoenix is on top of HBase. HBase is a distributed NoSQL DB. So the data
 that is residing inside logical entities called regions are spread across
 different nodes (region servers).  There is nothing like a table that is in
 one location where you can keep updating the count of rows that is getting
 inserted.



 Which means that when you need  count(*) you may have to aggregate the
 count from every region distributed across region servers. So in other
 words a table is not a single entity it is a collection of regions.



 Coming to your slowness in query, the update statistics query allows you
 to parallelize the query into logical chunks on a single region.  Suppose
 there are 100K rows in a region the statistics collected would allow you to
 run a query parallely for eg say execute parallely on 10 equal chunks of
 1 rows within that region.



 Have you modified any of the parameters related to statistics like this
 one ‘phoenix.stats.guidepost.width’.





 Regards

 Ram

 *From:* Ciureanu, Constantin (GfK) [mailto:constantin.ciure...@gfk.com]
 *Sent:* Wednesday, February 11, 2015 2:51 PM
 *To:* user@phoenix.apache.org
 *Subject:* Update statistics made query 2-3x slower



 Hello all,



 1. Is there a good explanation why updating the statistics:

 *update statistics tableX;*



 made this query 2x times slower?   (it was 27 seconds before, now it’s
 somewhere between 60 – 90 seconds)

 *select count(*) from tableX

Re: Update statistics made query 2-3x slower

2015-02-11 Thread Mujtaba Chohan
To compare performance without stats, try deleting related rows from
SYSTEM.STATS or an easier way, just truncate SYSTEM.STATS table from HBase
shell and restart your region servers.

//mujtaba

On Wed, Feb 11, 2015 at 10:29 AM, Vasudevan, Ramkrishna S 
ramkrishna.s.vasude...@intel.com wrote:

  Hi Constantin



 Before I could explain on the slowness part let me answer your 2nd
 question,



 Phoenix is on top of HBase. HBase is a distributed NoSQL DB. So the data
 that is residing inside logical entities called regions are spread across
 different nodes (region servers).  There is nothing like a table that is in
 one location where you can keep updating the count of rows that is getting
 inserted.



 Which means that when you need  count(*) you may have to aggregate the
 count from every region distributed across region servers. So in other
 words a table is not a single entity it is a collection of regions.



 Coming to your slowness in query, the update statistics query allows you
 to parallelize the query into logical chunks on a single region.  Suppose
 there are 100K rows in a region the statistics collected would allow you to
 run a query parallely for eg say execute parallely on 10 equal chunks of
 1 rows within that region.



 Have you modified any of the parameters related to statistics like this
 one ‘phoenix.stats.guidepost.width’.





 Regards

 Ram

 *From:* Ciureanu, Constantin (GfK) [mailto:constantin.ciure...@gfk.com]
 *Sent:* Wednesday, February 11, 2015 2:51 PM
 *To:* user@phoenix.apache.org
 *Subject:* Update statistics made query 2-3x slower



 Hello all,



 1. Is there a good explanation why updating the statistics:

 *update statistics tableX;*



 made this query 2x times slower?   (it was 27 seconds before, now it’s
 somewhere between 60 – 90 seconds)

 *select count(*) from tableX;*

 +--+

 | COUNT(1) |

 +--+

 | 5786227  |

 +--+

 1 row selected (62.718 seconds)



 (If possible J ) how can I “drop” those statistics?



 2. Why there is nothing (like a counter / attribute for the table) to
 obtain the number of rows in one table fast?



 Thank you,

Constantin



Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

2015-01-08 Thread Mujtaba Chohan
With 100+ columns, using multiple column families will help a lot if your
full scan uses only few columns.

Also if columns are wide then turning on compression would help if you are
seeing disk I/O contention on region servers.

On Wednesday, January 7, 2015, James Taylor jamestay...@apache.org wrote:

 Hi Sun,
 Can you give us a sample DDL and upsert/select query for #1? What's the
 approximate cluster size and what does the client look like? How much data
 are you scanning? Are you using multiple column families? We should be able
 to help tune things to improve #1.
 Thanks,
 James

 On Monday, January 5, 2015, su...@certusnet.com.cn
 javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn'); 
 su...@certusnet.com.cn
 javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn'); wrote:

 We had firstly done the test using #1 and the result didnot satisfy our
 expectation.
 Unfortunately I had not saved the log copy, but under same conditions of
 datasets,
 #2 is better than #1.

 Thanks,
 Sun.

 --
 --


 *From:* Nick Dimiduk
 *Date:* 2015-01-06 14:03
 *To:* user@phoenix.apache.org
 *CC:* lars hofhansl
 *Subject:* Re: Performance options for doing Phoenix full table scans to
 complete some data statistics and summary collection work
 Region server fails consistently? Can you provide logs from the failing
 process?

 On Monday, January 5, 2015, su...@certusnet.com.cn 
 su...@certusnet.com.cn wrote:

 Hi, Lars
 Thanks for your reply and advice. You are right, we are considering
 about sort of aggregates work.
 Our requirements need to assure full scan over table with approximately
 50 million rows while containing
 nearly 100+ columns. We are using the latest 4.2.2 release, actually we
 are using Spark to read and write to
 Phoenix tables. We apply the schema of mapreduce over Phoenix tables to
 do full table scan in Spark, and
 then we shall use the created rdd to write or bulkload to new Phoenix
 tables. Thats' just our production flow.

 Specifying the #1 vs #2 performance, we found that #1 shall always
 failed to complete and we can see regionserver
 falling down during the job.  #2 would cause some kind of
 ScannerTimeOutExecption, then we configure parameters
 for our hbase cluster and such problems gone. However, we are still
 expecting more efficient approaches for doing
 such full table scan over Phoenix datasets.

 Thanks,
 Sun.

 --
 --

 CertusNet


 *From:* lars hofhansl
 *Date:* 2015-01-06 12:52
 *To:* d...@phoenix.apache.org; user
 *Subject:* Re: Performance options for doing Phoenix full table scans
 to complete some data statistics and summary collection work
 Hi Sun,

 assuming that you are mostly talking about aggregates (in the sense of
 scanning a lot of data, but the resulting set is small), it's interesting
 that option #1 would not satisfy your performance expectations,  but #2
 would.

 Which version of Phoenix are you using? From 4.2 Phoenix is well aware
 of the distribution of the data and will farm out full scans in parallel
 chunks.
 In number you would make a copy of the entire dataset in order to be
 able to query it via Spark?

 What kind of performance do you see with option #1 vs #2?

 Thanks.

 -- Lars

   --
  *From:* su...@certusnet.com.cn su...@certusnet.com.cn
 *To:* user user@phoenix.apache.org; dev d...@phoenix.apache.org
 *Sent:* Monday, January 5, 2015 6:42 PM
 *Subject:* Performance options for doing Phoenix full table scans to
 complete some data statistics and summary collection work

 Hi,all
 Currently we are using Phoenix to store and query large datasets of KPI
 for our projects. Noting that we definitely need
 to do full table scan of phoneix KPI tables for data statistics and
 summary collection, e.g. from five minutes data table to
 summary hour based data table, and to day based and week based data
 tables, and so on.
 The approaches now we used currently are as follows:
 1. using Phoenix upsert into ... select ... grammer , however, the query
 performance would not satisfy our expectation.
 2. using Apache Spark with the phoenix_mr integration to read data from
 phoenix tables and create rdd, then we can transform
 these rdds to summary rdd, and bulkload to new Phoenix data table.
 This approach can satisfy most of our application requirements, but
 in some cases we cannot complete the full scan job.

 Here are my questions:
 1. Is there any more efficient approaches for improving performance of
 Phoenix full table scan of large data sets? Any kindly share are greately
 appropriated.
 2. Noting that full table scan is not quite appropriate for hbase
 tables, is there any alternative options for doing such work under current
 hdfs and
 hbase environments? Please kindly share any good points.

 Best regards,
 Sun.





 CertusNet






Re: problem about using tracing

2014-09-03 Thread Mujtaba Chohan
Phoenix connection URL should be of this form
jdbc:phoenix:zookeeper2,zookeeper1,zookeeper3:2181



On Wed, Sep 3, 2014 at 12:11 PM, Jesse Yates jesse.k.ya...@gmail.com
wrote:

 It looks like the connection string that the tracing module is using isn't
 configured correctly. Is 2181 the client port on which you are running
 zookeeper?

 @James Taylor - phoenix can connect to multiple ZK nodes this way, right?

 ---
 Jesse Yates
 @jesse_yates
 jyates.github.com


 On Wed, Sep 3, 2014 at 12:59 AM, su...@certusnet.com.cn 
 su...@certusnet.com.cn wrote:

 Hi all,
 I am trying to facilitate tracing according to the instructions here
 http://phoenix.apache.org/tracing.html. Here are my several operations:
 1. copy the phoenix-hadoop2-compat/bin/ attributes files into my hbase
 classpath($HBASE_HOME/conf)
 2. modify hbase-site.xml and adding the following properties:
 property
  namephoenix.trace.frequency/name
  valuealways/value
  /property
 3. restart hbase cluster and run phoenix through sqlline client:
   ./bin/sqlline.py zookeeper1,zookeeper2,zookeeper3
  as zookeeper1,zookeeper2,zookeeper3 are my zookeeper hosts
 4. When I am trying to see the tracing feature thourgh sqlline query as
 the following:
   select count (*) from mytable;
 I checked the regionserver log and found the following exception. Any
 available hints?

2014-09-03 15:40:53,218 ERROR [tracing] impl.MetricsSinkAdapter: Got
 sink exception and over retry limit, suppressing further error messages
 java.lang.RuntimeException: java.sql.SQLException: ERROR 102 (08001):
 Malformed connection url.
 jdbc:phoenix:zookeeper2:2181,zookeeper1:2181,zookeeper3:2181;
 at
 org.apache.phoenix.trace.PhoenixTableMetricsWriter.lazyInitialize(PhoenixTableMetricsWriter.java:110)

 at
 org.apache.phoenix.trace.PhoenixTableMetricsWriter.addMetrics(PhoenixTableMetricsWriter.java:185)

 at
 org.apache.phoenix.trace.PhoenixMetricsSink.putMetrics(PhoenixMetricsSink.java:92)

 at
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.consume(MetricsSinkAdapter.java:173)

 at
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.consume(MetricsSinkAdapter.java:41)

 at
 org.apache.hadoop.metrics2.impl.SinkQueue.consumeAll(SinkQueue.java:87)
 at
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.publishMetricsFromQueue(MetricsSinkAdapter.java:127)

 at
 org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$1.run(MetricsSinkAdapter.java:86)

 Caused by: java.sql.SQLException: ERROR 102 (08001): Malformed connection
 url. jdbc:phoenix:zookeeper2:2181,zookeeper1:2181,zookeeper3:2181;
 at
 org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:333)

 at
 org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:133)

 at
 org.apache.phoenix.jdbc.PhoenixEmbeddedDriver$ConnectionInfo.getMalFormedUrlException(PhoenixEmbeddedDriver.java:183)

 at
 org.apache.phoenix.jdbc.PhoenixEmbeddedDriver$ConnectionInfo.create(PhoenixEmbeddedDriver.java:238)

 at
 org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:144)

 at
 org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:129)

 at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:133)
 at java.sql.DriverManager.getConnection(DriverManager.java:571)
 at java.sql.DriverManager.getConnection(DriverManager.java:187)
 at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:213)
 at
 org.apache.phoenix.trace.PhoenixTableMetricsWriter.lazyInitialize(PhoenixTableMetricsWriter.java:100)

 ... 7 more

 --
 --

 CertusNet