Re: Help with bad errors on 4.6.1

2018-03-13 Thread Enrico Olivelli
2018-03-13 17:19 GMT+01:00 Ivan Kelly :

> > @Ivan
> > I wonder if some tests on Jepsen with bookie restarts may find this kind
> of
> > issues, given that it is not a network/SO problem
> If jepsen can catch then normal integration test can. The readers in
> question, are they tailing with long poll, or just calling
> readLastAddConfirmed in a loop? What is the configuration in terms of
> ensemble/write/ack?
>

readLastAddConfirmed in a loop, see this code, it is mostly like the
tutorial

https://github.com/diennea/majordodo/blob/1487dc85a79e64ac0624a320729f2ad425fe15dd/majordodo-core/src/main/java/majordodo/replication/ReplicatedCommitLog.java#L975





>
> I can try to put together a repro too, using the integ test framework.
>

thank you


>
> -Ivan
>


Re: Help with bad errors on 4.6.1

2018-03-13 Thread Ivan Kelly
> @Ivan
> I wonder if some tests on Jepsen with bookie restarts may find this kind of
> issues, given that it is not a network/SO problem
If jepsen can catch then normal integration test can. The readers in
question, are they tailing with long poll, or just calling
readLastAddConfirmed in a loop? What is the configuration in terms of
ensemble/write/ack?

I can try to put together a repro too, using the integ test framework.

-Ivan


Re: Help with bad errors on 4.6.1

2018-03-13 Thread Enrico Olivelli
Findings of today:
A - the system fails even with BK 4.6.0
B - we have moved all the clients and the bookies to different machines
(keeping the same ZK cluster), same problem
C - I have copies of the application which are running on other similar
machines (on the same Blade/VMWare system)
D - I have tried to disable Netty polls on client side (Sijie's
suggestion): no effect
E - with ensemblesize = 1 the problem on readers does not occour, but the
writer seems not to be able to recover from a restart of the only bookie
(seems stuck at writing on logger PendingAddOp "Failed to write entry 
Bookie Operation Timed Out")
F - ZK cluster is working perfeclty as it is serving a lot of other
services of the application (Kafka, Majordodo, BlazingCache, HBase)
without errors
G - all of the other distributed components are running without issues
(Kafka,HDFS see the list above about ZK) and other database connections
too (the application connects to serveral external machines)
H - bookkeper bookiesanity is running OKAY on every bookie
I - my collegues checked networking and VMWARE and OS, we were suspecting
about problems on lookback interfaces but the problem still occours moving
each part on a dedicated machine
L - I have tested with 4.6.2-SNAPSHOT...same as above
M - the problem starts when a bookie restarts and then joins the cluster
again (not when you kill it)

given all of these facts:
1) It may be a problem of network/SO (given points F and G I doubt)
2) it may be a bug on BK
3) it is not a regression on 4.6.1 but 4.6.2 has no fix
4) I will intrument BK code in order to have better debug of the error
5) I will create a reproducer without the full application (which is huge)

I have memory (hprof) dumps of a failing client and a failing bookie if
someone has time to spend, honestly I have already spent some time in order
to find some leak/bad recycler, but without success (not sure this is the
good way to approach this problem)

I have no proof but maybe there is a problem with Pending reads, when the
bookie is down the read remains "pending", then when the channel is active
again (the bookie joins the cluster) that pending "old" read (which is not
needed anymore) reaches the bookie and crash everything.

It is interesting that it seems that "other" bookies break, not the one
which joins the cluster (this is what is seems to me)

@Ivan
I wonder if some tests on Jepsen with bookie restarts may find this kind of
issues, given that it is not a network/SO problem

Regards

Enrico





2018-03-12 20:51 GMT+01:00 Enrico Olivelli :

>
>
> Il lun 12 mar 2018, 20:40 Ivan Kelly  ha scritto:
>
>> > It is interesting that the problems is on 'readers' and it seems that
>> the
>> > PCBC seems corrupted and even writes (if the broker is promoted to
>> > 'leader') are able to go on after the reads broke the client.
>> Are writes coming from the same clients? Or clients in the same process?
>>
>
> Same o.a.b.c.BookKeeper object
>
>>
>> -Ivan
>>
> --
>
>
> -- Enrico Olivelli
>


Re: [DISCUSS] Set `ENABLE_DIGEST_TYPE_AUTODETECTION` to true as default value

2018-03-13 Thread Sijie Guo
On Tue, Mar 13, 2018 at 12:42 AM, Enrico Olivelli 
wrote:

> Good idea
> I have already responded on the PR
>
> Summary of my response:
>
>1. okay to make it the default
>2. may this change break existing tests, or at least change the meaning
>of what is tested ?
>3. we should add a test about this change, at least on the new API
>
>
Replied on the PR.


>
> Enrico
>
>
>
> 2018-03-13 8:41 GMT+01:00 Sijie Guo :
>
> > Hi all,
> >
> > I am raising a discussion to set `ENABLE_DIGEST_TYPE_AUTODETECTION' to
> > true
> > to turn on this feature by default.  because `digest type` has been
> > recorded in ledger metadata since 4.5, it is better for client to use the
> > digest type recorded in ledger metadata.
> >
> > Here is a proposal of the change:
> > https://github.com/apache/bookkeeper/pull/1252
> >
> > - Sijie
> >
>


Re: [DISCUSS] Set `ENABLE_DIGEST_TYPE_AUTODETECTION` to true as default value

2018-03-13 Thread Enrico Olivelli
Good idea
I have already responded on the PR

Summary of my response:

   1. okay to make it the default
   2. may this change break existing tests, or at least change the meaning
   of what is tested ?
   3. we should add a test about this change, at least on the new API


Enrico



2018-03-13 8:41 GMT+01:00 Sijie Guo :

> Hi all,
>
> I am raising a discussion to set `ENABLE_DIGEST_TYPE_AUTODETECTION' to
> true
> to turn on this feature by default.  because `digest type` has been
> recorded in ledger metadata since 4.5, it is better for client to use the
> digest type recorded in ledger metadata.
>
> Here is a proposal of the change:
> https://github.com/apache/bookkeeper/pull/1252
>
> - Sijie
>


[DISCUSS] Set `ENABLE_DIGEST_TYPE_AUTODETECTION` to true as default value

2018-03-13 Thread Sijie Guo
Hi all,

I am raising a discussion to set `ENABLE_DIGEST_TYPE_AUTODETECTION' to true
to turn on this feature by default.  because `digest type` has been
recorded in ledger metadata since 4.5, it is better for client to use the
digest type recorded in ledger metadata.

Here is a proposal of the change:
https://github.com/apache/bookkeeper/pull/1252

- Sijie