Re: Index table empty

2020-01-27 Thread Tim Dolbeare
Hi Josh,

Thanks for the quick response!  I'm running Hadoop 2.8.5-amzn-5, HBase 1.4.10, 
and Phoenix 4.14.3.

A 'select *' from the table does indeed return the expected data, but selecting 
a column that's part of the index does not, unless I explicitly run it with the 
NO_INDEX hint.

Regarding the timeout error on 'select count(*)', I can do a select count(*) 
pretty quickly as long as I don't involve the problematic index--

0: jdbc:phoenix:> select /*+NO_INDEX*/ count(*) from meta_reads;
+---+
| COUNT(1)  |
+---+
| 100   |
+---+
1 row selected (1.748 seconds)


I'm not sure why the index would be implicated in a 'select count(*)' to begin 
with, but the first line of the query plan for it is "CLIENT 1-CHUNK PARALLEL 
1-WAY FULL SCAN OVER IDX_GC".  Since the index table is empty, I would expect a 
very fast response with "0" as the result.  So the timeout is puzzling, and I'm 
not a big believer in coincidences.

Regarding the logs files-- I get "Log aggregation has not completed or is not 
enabled."  However in the EMR console, I can find logs for the indexing 
mapreduce job, but there isn't any indication of failure.  The local hbase.log 
also showed no failures.

Thanks for your help.  I think you've reassured me that I'm not doing anything 
obviously wrong and this seems like a pretty ordinary use case, so unless 
anyone can suggest something else I think my next step is to rebuild the 
cluster and try again.



From: Josh Elser 
Sent: Monday, January 27, 2020 4:11 PM
To: user@phoenix.apache.org 
Subject: Re: Index table empty
 

Hi Tim,

It sounds like you're doing the right steps to build an index with the
async approach. Not having records after IndexTool runs successfully is
definitely unexpected :)

If you aren't getting any records in the index table after running the
IndexTool, my guess is that something is going awry there. Have you
looked at the logging produced by that MapReduce job? (e.g. `yarn logs
-applicationId `).

It's curious that the job runs successfully and sets the index state to
active, but you don't have any data loaded.

As a sanity check, does `select * from meta_reads limit 10` return you
expected data?

For the future, always a good idea to let us know what version of
Hadoop/HBase/Phoenix you're running when asking questions.

PS: If you're worried about the SocketTimeException, it's probably more
to do with the size of your data table and the way a `select count(*)`
runs. This is a full-table scan, and you'd have to increase
hbase.rpc.timeout at at minimum to a larger value. If this is a normal
query pattern you intend to service, it will be an exercise in tweaking
configs.

On 1/27/20 6:01 PM, Tim Dolbeare wrote:
> Hello All,
>
> I've run into a problem with a Phoenix index that no amount of googling
> is solving.  I hope someone might have run into this before and can
> offer some suggestions.  I'm a noob BTW, so please don't hesitate to
> point out the most obvious potential issues.  The problem is that after
> indexing a table already populated with 1M rows a) any query that uses
> the new index returns 0 results and b) the index table itself is empty.
>
> I have created a table via psql.py, populated it with 1M rows via
> CsvBulkLoadTool, created an async covered index on that table in
> sqlline.py, followed by a mapreduce index population with IndexTool.
> All of that completes without error, and the index is marked "ACTIVE".
>
> Here are my table and index definitions:
>
> DROP TABLE IF EXISTS meta_reads;
> CREATE IMMUTABLE TABLE IF NOT EXISTS meta_reads (
>cluster VARCHAR,
>subclass VARCHAR,
>class VARCHAR,
>sex VARCHAR,
>region VARCHAR,
>subregion VARCHAR,
>cell VARCHAR NOT NULL,
>gene VARCHAR NOT NULL,
>read FLOAT,
>CONSTRAINT my_pk PRIMARY KEY (cell, gene))
> IMMUTABLE_STORAGE_SCHEME = ONE_CELL_PER_COLUMN;
>
> create index idx_gc on meta_reads(gene, cluster) include(read) ASYNC;
>
>
> Almost any query that attempts to use the index returns 0 results,
> however 'select count(*) from meta_reads' throws a SocketTimeoutException.
>
>
> Any ideas?
>
> Thanks
>
> Tim
>
>
>
>
>


Re: Index table empty

2020-01-27 Thread Josh Elser

Hi Tim,

It sounds like you're doing the right steps to build an index with the 
async approach. Not having records after IndexTool runs successfully is 
definitely unexpected :)


If you aren't getting any records in the index table after running the 
IndexTool, my guess is that something is going awry there. Have you 
looked at the logging produced by that MapReduce job? (e.g. `yarn logs 
-applicationId `).


It's curious that the job runs successfully and sets the index state to 
active, but you don't have any data loaded.


As a sanity check, does `select * from meta_reads limit 10` return you 
expected data?


For the future, always a good idea to let us know what version of 
Hadoop/HBase/Phoenix you're running when asking questions.


PS: If you're worried about the SocketTimeException, it's probably more 
to do with the size of your data table and the way a `select count(*)` 
runs. This is a full-table scan, and you'd have to increase 
hbase.rpc.timeout at at minimum to a larger value. If this is a normal 
query pattern you intend to service, it will be an exercise in tweaking 
configs.


On 1/27/20 6:01 PM, Tim Dolbeare wrote:

Hello All,

I've run into a problem with a Phoenix index that no amount of googling 
is solving.  I hope someone might have run into this before and can 
offer some suggestions.  I'm a noob BTW, so please don't hesitate to 
point out the most obvious potential issues.  The problem is that after 
indexing a table already populated with 1M rows a) any query that uses 
the new index returns 0 results and b) the index table itself is empty.


I have created a table via psql.py, populated it with 1M rows via 
CsvBulkLoadTool, created an async covered index on that table in 
sqlline.py, followed by a mapreduce index population with IndexTool.  
All of that completes without error, and the index is marked "ACTIVE".


Here are my table and index definitions:

DROP TABLE IF EXISTS meta_reads;
CREATE IMMUTABLE TABLE IF NOT EXISTS meta_reads (
       cluster VARCHAR,
       subclass VARCHAR,
       class VARCHAR,
       sex VARCHAR,
       region VARCHAR,
       subregion VARCHAR,
       cell VARCHAR NOT NULL,
       gene VARCHAR NOT NULL,
       read FLOAT,
       CONSTRAINT my_pk PRIMARY KEY (cell, gene))
IMMUTABLE_STORAGE_SCHEME = ONE_CELL_PER_COLUMN;

create index idx_gc on meta_reads(gene, cluster) include(read) ASYNC;


Almost any query that attempts to use the index returns 0 results, 
however 'select count(*) from meta_reads' throws a SocketTimeoutException.



Any ideas?

Thanks

Tim