Re: Re[2]: Asynchronous registration of binary metadata

2019-08-15 Thread Sergey Chugunov
Denis,

Thanks for bringing this issue up, decision to write binary metadata from
discovery thread was really a tough decision to make.
I don't think that moving metadata to metastorage is a silver bullet here
as this approach also has its drawbacks and is not an easy change.

In addition to workarounds suggested by Alexei we have two choices to
offload write operation from discovery thread:

   1. Your scheme with a separate writer thread and futures completed when
   write operation is finished.
   2. PME-like protocol with obvious complications like failover and
   asynchronous wait for replies over communication layer.

Your suggestion looks easier from code complexity perspective but in my
view it increases chances to get into starvation. Now if some node faces
really long delays during write op it is gonna be kicked out of topology by
discovery protocol. In your case it is possible that more and more threads
from other pools may stuck waiting on the operation future, it is also not
good.

What do you think?

I also think that if we want to approach this issue systematically, we need
to do a deep analysis of metastorage option as well and to finally choose
which road we wanna go.

Thanks!

On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
 wrote:

>
> >
> >> 1. Yes, only on OS failures. In such case data will be received from
> alive
> >> nodes later.
> What behavior would be in case of one node ? I suppose someone can obtain
> cache data without unmarshalling schema, what in this case would be with
> grid operability?
>
> >
> >> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
> mode
> >> should not be used if you have more than two nodes in grid because it
> has
> >> huge impact on performance.
> Is wal mode affects metadata store ?
>
> >
> >>
> >> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < dmekhani...@gmail.com
> >:
> >>
> >>> Folks,
> >>>
> >>> Thanks for showing interest in this issue!
> >>>
> >>> Alexey,
> >>>
>  I think removing fsync could help to mitigate performance issues with
> >>> current implementation
> >>>
> >>> Is my understanding correct, that if we remove fsync, then discovery
> won’t
> >>> be blocked, and data will be flushed to disk in background, and loss of
> >>> information will be possible only on OS failure? It sounds like an
> >>> acceptable workaround to me.
> >>>
> >>> Will moving metadata to metastore actually resolve this issue? Please
> >>> correct me if I’m wrong, but we will still need to write the
> information to
> >>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then
> the
> >>> issue will still be there. Or is it planned to abandon the
> discovery-based
> >>> protocol at all?
> >>>
> >>> Evgeniy, Ivan,
> >>>
> >>> In my particular case the data wasn’t too big. It was a slow
> virtualised
> >>> disk with encryption, that made operations slow. Given that there are
> 200
> >>> nodes in a cluster, where every node writes slowly, and this process is
> >>> sequential, one piece of metadata is registered extremely slowly.
> >>>
> >>> Ivan, answering to your other questions:
> >>>
>  2. Do we need a persistent metadata for in-memory caches? Or is it so
> >>> accidentally?
> >>>
> >>> It should be checked, if it’s safe to stop writing marshaller mappings
> to
> >>> disk without loosing any guarantees.
> >>> But anyway, I would like to have a property, that would control this.
> If
> >>> metadata registration is slow, then initial cluster warmup may take a
> >>> while. So, if we preserve metadata on disk, then we will need to warm
> it up
> >>> only once, and further restarts won’t be affected.
> >>>
>  Do we really need a fast fix here?
> >>>
> >>> I would like a fix, that could be implemented now, since the activity
> with
> >>> moving metadata to metastore doesn’t sound like a quick one. Having a
> >>> temporary solution would be nice.
> >>>
> >>> Denis
> >>>
>  On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com >
> wrote:
> 
>  Denis,
> 
>  Several clarifying questions:
>  1. Do you have an idea why metadata registration takes so long? So
>  poor disks? So many data to write? A contention with disk writes by
>  other subsystems?
>  2. Do we need a persistent metadata for in-memory caches? Or is it so
>  accidentally?
> 
>  Generally, I think that it is possible to move metadata saving
>  operations out of discovery thread without loosing required
>  consistency/integrity.
> 
>  As Alex mentioned using metastore looks like a better solution. Do we
>  really need a fast fix here? (Are we talking about fast fix?)
> 
>  ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> >>> < arzamas...@mail.ru.invalid >:
> >
> > Alexey, but in this case customer need to be informed, that whole
> (for
> >>> example 1 node) cluster crash (power off) could lead to partial data
> >>> unavailability.
> > And may be further index corruption.

Re[2]: Asynchronous registration of binary metadata

2019-08-15 Thread Zhenya Stanilovsky

>
>> 1. Yes, only on OS failures. In such case data will be received from alive
>> nodes later.
What behavior would be in case of one node ? I suppose someone can obtain cache 
data without unmarshalling schema, what in this case would be with grid 
operability?

>
>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such mode
>> should not be used if you have more than two nodes in grid because it has
>> huge impact on performance.
Is wal mode affects metadata store ?

>
>> 
>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < dmekhani...@gmail.com >:
>> 
>>> Folks,
>>> 
>>> Thanks for showing interest in this issue!
>>> 
>>> Alexey,
>>> 
 I think removing fsync could help to mitigate performance issues with
>>> current implementation
>>> 
>>> Is my understanding correct, that if we remove fsync, then discovery won’t
>>> be blocked, and data will be flushed to disk in background, and loss of
>>> information will be possible only on OS failure? It sounds like an
>>> acceptable workaround to me.
>>> 
>>> Will moving metadata to metastore actually resolve this issue? Please
>>> correct me if I’m wrong, but we will still need to write the information to
>>> WAL before releasing the discovery thread. If WAL mode is FSYNC, then the
>>> issue will still be there. Or is it planned to abandon the discovery-based
>>> protocol at all?
>>> 
>>> Evgeniy, Ivan,
>>> 
>>> In my particular case the data wasn’t too big. It was a slow virtualised
>>> disk with encryption, that made operations slow. Given that there are 200
>>> nodes in a cluster, where every node writes slowly, and this process is
>>> sequential, one piece of metadata is registered extremely slowly.
>>> 
>>> Ivan, answering to your other questions:
>>> 
 2. Do we need a persistent metadata for in-memory caches? Or is it so
>>> accidentally?
>>> 
>>> It should be checked, if it’s safe to stop writing marshaller mappings to
>>> disk without loosing any guarantees.
>>> But anyway, I would like to have a property, that would control this. If
>>> metadata registration is slow, then initial cluster warmup may take a
>>> while. So, if we preserve metadata on disk, then we will need to warm it up
>>> only once, and further restarts won’t be affected.
>>> 
 Do we really need a fast fix here?
>>> 
>>> I would like a fix, that could be implemented now, since the activity with
>>> moving metadata to metastore doesn’t sound like a quick one. Having a
>>> temporary solution would be nice.
>>> 
>>> Denis
>>> 
 On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com > wrote:
 
 Denis,
 
 Several clarifying questions:
 1. Do you have an idea why metadata registration takes so long? So
 poor disks? So many data to write? A contention with disk writes by
 other subsystems?
 2. Do we need a persistent metadata for in-memory caches? Or is it so
 accidentally?
 
 Generally, I think that it is possible to move metadata saving
 operations out of discovery thread without loosing required
 consistency/integrity.
 
 As Alex mentioned using metastore looks like a better solution. Do we
 really need a fast fix here? (Are we talking about fast fix?)
 
 ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>>> < arzamas...@mail.ru.invalid >:
> 
> Alexey, but in this case customer need to be informed, that whole (for
>>> example 1 node) cluster crash (power off) could lead to partial data
>>> unavailability.
> And may be further index corruption.
> 1. Why your meta takes a substantial size? may be context leaking ?
> 2. Could meta be compressed ?
> 
> 
>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
>>>  alexey.scherbak...@gmail.com >:
>> 
>> Denis Mekhanikov,
>> 
>> Currently metadata are fsync'ed on write. This might be the case of
>> slow-downs in case of metadata burst writes.
>> I think removing fsync could help to mitigate performance issues with
>> current implementation until proper solution will be implemented:
>>> moving
>> metadata to metastore.
>> 
>> 
>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <  dmekhani...@gmail.com
 :
>> 
>>> I would also like to mention, that marshaller mappings are written to
>>> disk
>>> even if persistence is disabled.
>>> So, this issue affects purely in-memory clusters as well.
>>> 
>>> Denis
>>> 
 On 13 Aug 2019, at 17:06, Denis Mekhanikov <  dmekhani...@gmail.com >
>>> wrote:
 
 Hi!
 
 When persistence is enabled, binary metadata is written to disk upon
>>> registration. Currently it happens in the discovery thread, which
>>> makes
>>> processing of related messages very slow.
 There are cases, when a lot of nodes and slow disks can make every
>>> binary type be registered for several minutes. Plus it blocks
>>> processing of
>>> other messages.
 

Re: Re[2]: Asynchronous registration of binary metadata

2019-08-14 Thread Павлухин Иван
Denis,

Several clarifying questions:
1. Do you have an idea why metadata registration takes so long? So
poor disks? So many data to write? A contention with disk writes by
other subsystems?
2. Do we need a persistent metadata for in-memory caches? Or is it so
accidentally?

Generally, I think that it is possible to move metadata saving
operations out of discovery thread without loosing required
consistency/integrity.

As Alex mentioned using metastore looks like a better solution. Do we
really need a fast fix here? (Are we talking about fast fix?)

ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky :
>
> Alexey, but in this case customer need to be informed, that whole (for 
> example 1 node) cluster crash (power off) could lead to partial data 
> unavailability.
> And may be further index corruption.
> 1. Why your meta takes a substantial size? may be context leaking ?
> 2. Could meta be compressed ?
>
>
> >Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov 
> >:
> >
> >Denis Mekhanikov,
> >
> >Currently metadata are fsync'ed on write. This might be the case of
> >slow-downs in case of metadata burst writes.
> >I think removing fsync could help to mitigate performance issues with
> >current implementation until proper solution will be implemented: moving
> >metadata to metastore.
> >
> >
> >вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < dmekhani...@gmail.com >:
> >
> >> I would also like to mention, that marshaller mappings are written to disk
> >> even if persistence is disabled.
> >> So, this issue affects purely in-memory clusters as well.
> >>
> >> Denis
> >>
> >> > On 13 Aug 2019, at 17:06, Denis Mekhanikov < dmekhani...@gmail.com >
> >> wrote:
> >> >
> >> > Hi!
> >> >
> >> > When persistence is enabled, binary metadata is written to disk upon
> >> registration. Currently it happens in the discovery thread, which makes
> >> processing of related messages very slow.
> >> > There are cases, when a lot of nodes and slow disks can make every
> >> binary type be registered for several minutes. Plus it blocks processing of
> >> other messages.
> >> >
> >> > I propose starting a separate thread that will be responsible for
> >> writing binary metadata to disk. So, binary type registration will be
> >> considered finished before information about it will is written to disks on
> >> all nodes.
> >> >
> >> > The main concern here is data consistency in cases when a node
> >> acknowledges type registration and then fails before writing the metadata
> >> to disk.
> >> > I see two parts of this issue:
> >> > Nodes will have different metadata after restarting.
> >> > If we write some data into a persisted cache and shut down nodes faster
> >> than a new binary type is written to disk, then after a restart we won’t
> >> have a binary type to work with.
> >> >
> >> > The first case is similar to a situation, when one node fails, and after
> >> that a new type is registered in the cluster. This issue is resolved by the
> >> discovery data exchange. All nodes receive information about all binary
> >> types in the initial discovery messages sent by other nodes. So, once you
> >> restart a node, it will receive information, that it failed to finish
> >> writing to disk, from other nodes.
> >> > If all nodes shut down before finishing writing the metadata to disk,
> >> then after a restart the type will be considered unregistered, so another
> >> registration will be required.
> >> >
> >> > The second case is a bit more complicated. But it can be resolved by
> >> making the discovery threads on every node create a future, that will be
> >> completed when writing to disk is finished. So, every node will have such
> >> future, that will reflect the current state of persisting the metadata to
> >> disk.
> >> > After that, if some operation needs this binary type, it will need to
> >> wait on that future until flushing to disk is finished.
> >> > This way discovery threads won’t be blocked, but other threads, that
> >> actually need this type, will be.
> >> >
> >> > Please let me know what you think about that.
> >> >
> >> > Denis
> >>
> >>
> >
> >--
> >
> >Best regards,
> >Alexei Scherbakov
>
>
> --
> Zhenya Stanilovsky



-- 
Best regards,
Ivan Pavlukhin


Re[2]: Asynchronous registration of binary metadata

2019-08-14 Thread Zhenya Stanilovsky
Alexey, but in this case customer need to be informed, that whole (for example 
1 node) cluster crash (power off) could lead to partial data unavailability.
And may be further index corruption.
1. Why your meta takes a substantial size? may be context leaking ?
2. Could meta be compressed ?


>Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov 
>:
>
>Denis Mekhanikov,
>
>Currently metadata are fsync'ed on write. This might be the case of
>slow-downs in case of metadata burst writes.
>I think removing fsync could help to mitigate performance issues with
>current implementation until proper solution will be implemented: moving
>metadata to metastore.
>
>
>вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < dmekhani...@gmail.com >:
>
>> I would also like to mention, that marshaller mappings are written to disk
>> even if persistence is disabled.
>> So, this issue affects purely in-memory clusters as well.
>>
>> Denis
>>
>> > On 13 Aug 2019, at 17:06, Denis Mekhanikov < dmekhani...@gmail.com >
>> wrote:
>> >
>> > Hi!
>> >
>> > When persistence is enabled, binary metadata is written to disk upon
>> registration. Currently it happens in the discovery thread, which makes
>> processing of related messages very slow.
>> > There are cases, when a lot of nodes and slow disks can make every
>> binary type be registered for several minutes. Plus it blocks processing of
>> other messages.
>> >
>> > I propose starting a separate thread that will be responsible for
>> writing binary metadata to disk. So, binary type registration will be
>> considered finished before information about it will is written to disks on
>> all nodes.
>> >
>> > The main concern here is data consistency in cases when a node
>> acknowledges type registration and then fails before writing the metadata
>> to disk.
>> > I see two parts of this issue:
>> > Nodes will have different metadata after restarting.
>> > If we write some data into a persisted cache and shut down nodes faster
>> than a new binary type is written to disk, then after a restart we won’t
>> have a binary type to work with.
>> >
>> > The first case is similar to a situation, when one node fails, and after
>> that a new type is registered in the cluster. This issue is resolved by the
>> discovery data exchange. All nodes receive information about all binary
>> types in the initial discovery messages sent by other nodes. So, once you
>> restart a node, it will receive information, that it failed to finish
>> writing to disk, from other nodes.
>> > If all nodes shut down before finishing writing the metadata to disk,
>> then after a restart the type will be considered unregistered, so another
>> registration will be required.
>> >
>> > The second case is a bit more complicated. But it can be resolved by
>> making the discovery threads on every node create a future, that will be
>> completed when writing to disk is finished. So, every node will have such
>> future, that will reflect the current state of persisting the metadata to
>> disk.
>> > After that, if some operation needs this binary type, it will need to
>> wait on that future until flushing to disk is finished.
>> > This way discovery threads won’t be blocked, but other threads, that
>> actually need this type, will be.
>> >
>> > Please let me know what you think about that.
>> >
>> > Denis
>>
>>
>
>-- 
>
>Best regards,
>Alexei Scherbakov


-- 
Zhenya Stanilovsky