Re: A few issues we ran into the last couple of weeks.

Vidhyashankar Venkataraman Tue, 17 May 2011 17:55:04 -0700

>> For 1, the check in HCM.isTableAvailable() is:
>>      return available.get() && (regionCount.get() > 0);
>> This explains why some regions aren't available.


The javadoc says the function returns true if all regions are available. 
Clearly this statement is wrong going by what is there in the code. Also some 
of the source for which we had used this function may be broken (for example in 
LoadIncrementalHFiles.java).

>> For 3, can you provide a unit test so that we can investigate further ?

The problem is I am unable to get the master crash consistently. I can send you 
the key split.

Thank you
Vidhya

On 5/17/11 4:59 PM, "Ted Yu" <[email protected]> wrote:

For 1, the check in HCM.isTableAvailable() is:
      return available.get() && (regionCount.get() > 0);
This explains why some regions aren't available.

For 3, can you provide a unit test so that we can investigate further ?

Thanks

On Tue, May 17, 2011 at 4:25 PM, Vidhyashankar Venkataraman <
[email protected]> wrote:

> (Running Hbase 0.90.0 on 700+ nodes.)
>
> You may have seen many (or mostly all) of the following issues already:
>   1. HConnection.isTableAvailable: This doesn't seem to be working all the
> time. In particular, I had this code after creating a table asynchronously:
>
>   do {
>      LOG.info("Table " + tableName + "not yet available... Sleeping for" +
> sleepTime + "milliseconds...");
>      Thread.sleep(sleepTime);
>    } while (!conn.isTableAvailable(table.getTableName()));
>    LOG.info("Table is available!! : "+tableName+" Available?
> "+conn.isTableAvailable(table.getTableName()));
>
> It comes out of the loop but then I see this:
> Table is available!! : <TABLE> Available? false
>
> And then I see that not all the regions are yet available.
>
>
>   2. The master getting stuck unable to delete a WAL (I have seen this
> before on this forum and a related JIRA on this one): We had worked around
> by manually deleting a WAL. But during times when the master crashed during
> table creation (with split key boundaries), the node that took over next as
> the master (failover) started getting stuck for around 25% of the cluster. I
> had to wipe out all the logs so that the master could start up right.
>
> But even then, the regionservers which had suffered the log issue couldn't
> recognize the failed over master. (Is this something that has been observed
> before?)
>
>
>   3. createTableAsync with incorrect split keys: By mistake, I had some
> duplicate keys in the split key byte array while calling the
> createTableAsync function. The master crashed throwing a KeeperException
> (thanks to the duplicate keys I guess?)
>
>
> Also, can you let me know why createTableAsync blocks for some time and
> throws a socket timeout exception when I try creating a table with a large
> number of regions?
>
> Thank you
> Vidhya
>

Re: A few issues we ran into the last couple of weeks.

Reply via email to