Re: [MAKDUMPFILE PATCH] Add option to estimate the size of vmcore dump files

2020-10-29 Thread lijiang
在 2020年10月28日 16:32, HAGIO KAZUHITO(萩尾 一仁) 写道:
> Hi Julien,
> 
> sorry for my delayed reply.
> 
> -Original Message-
>>> A user might want to know how much space a vmcore file will take on
>>> the system and how much space on their disk should be available to
>>> save it during a crash.
>>>
>>> The option --vmcore-size does not create the vmcore file but provides
>>> an estimation of the size of the final vmcore file created with the
>>> same make dumpfile options.
>>>
>>> Interesting.  Do you have any actual use case?  e.g. used by kdumpctl?
>>> or use it in kdump initramfs?
>>>
>>
>> Yes, the idea would be to use this in mkdumprd to have a more accurate
>> estimate of the dump size (currently it cannot take compression into
>> account and warns about potential lack of space, considering the system
>> memory size as a whole).
> 
> Hmm, I'm not sure how you are going to implement in mkdumprd, but I do not
> recommend that you use it to determine how much disk space should be
> allocated for crash dump.  Because, I think that
> 
> - It cannot estimate the dump size when a real crash occurs, e.g. if slab
> explodes with non-zero data, almost all memory will be captured by 
> makedumpfile

I agree with you, but this could be rare? If yes, I'm not sure if it is worth
thinking more about the rare situations.

> even with -d 31, and compression ratio varies with data in memory.

Indeed.

> Also, in most cases, mkdumprd runs at boot time or construction phase
> with less memory usage, not at usual application running time.  So it
> can underestimate the needed size easily.
> 
If administrator can monitor the estimated size periodically, maybe it
won't be a problem?
 
> - The system might need a full vmcore and need to change makedumpfile's
> dump level for an issue in the future.  But many systems cannot change
> their disk space allocation easily.  So we should prevent users from
> having minimum disk space for crash dump.
> 
> So, the following is from mkdumprd on Fedora 32, personally I think this
> is good for now.
> 
> if [ $avail -lt $memtotal ]; then
> echo "Warning: There might not be enough space to save a vmcore."
> echo " The size of $2 should be greater than $memtotal kilo 
> bytes."
> fi
> 
Currently, some users are complaining that mkdumprd overestimates the needed 
size,
and most vmcores are significantly smaller than the size of system memory.

Furthermore, in most cases, the system memory will not be completely exhausted, 
but
that still depends on how the memory is used in the system, for example:
[1] make the stressful test for memory
[2] always occupies amount of memory and not release it.

For the above two cases, there may be rare. Therefore, can we find out a 
compromise
between the size of vmcore and system memory so that makedumpfile can estimate 
the
size of vmcore more accurately?

And finally, mkdumprd can use the estimated size of vmcore instead of system 
memory(memtotal)
to determine if the target disk has enough space to store vmcore.


Thanks.
Lianbo

> The patch's functionality itself might be useful and I don't reject, though.
> 
>>> @@ -4643,6 +4706,8 @@ write_buffer(int fd, off_t offset, void *buf, 
>>> size_t buf_size, char *file_name)
>>>   }
>>>   if (!write_and_check_space(fd, , sizeof(fdh), 
>>> file_name))
>>>   return FALSE;
>>> +   } else if (info->flag_vmcore_size && fd == info->fd_dumpfile) {
>>> +   return write_buffer_update_size_info(offset, buf, 
>>> buf_size);
>>>
>>> Why do we need this function?  makedumpfile actually writes zero-filled
>>> pages to the dumpfile with -d 0, and doesn't write them with -d 1.
>>> So isn't "write_bytes += buf_size" enough?  For example, with -d 30,
>>>
>>
>> The reason I went with this method was to make an estimate of the number
>> of blocks actually allocated on the disk (since depending on how the
>> data written is scattered in the file, there might be a significant
>> difference between bytes written vs actual size allocated on disk). But
>> I realize that there is some misunderstanding from my end since written
>> 0 do make block allocation as opposed to not writing at some offset
>> (skipping the with lseek() ), I would need to fix that.
>>
>> To highlight the behaviour I'm talking about:
>> $ dd if=/dev/zero of=./testfile bs=4096 count=1 seek=1
>> 1+0 records in
>> 1+0 records out
>> 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000302719 s, 13.5 MB/s
>> $ du -h testfile
>> 4.0K testfile
>>
>> $ dd if=/dev/zero of=./testfile bs=4096 count=2
>> 2+0 records in
>> 2+0 records out
>> 8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000373002 s, 22.0 MB/s
>> $ du -h testfile
>> 8.0K testfile
>>
>>
>> So, do you think it's not worth bothering estimating the number of
>> blocks allocated an that I should only consider the number of bytes written?
> 
> Yes, makedumpfile 

Re: [PATCH 0/3] warn and suppress irqflood

2020-10-29 Thread Pingfan Liu
On Wed, Oct 28, 2020 at 12:58:41PM +0100, Thomas Gleixner wrote:
> On Wed, Oct 28 2020 at 14:02, Pingfan Liu wrote:
> > On Tue, Oct 27, 2020 at 3:59 AM Thomas Gleixner  wrote:
> >> Also Liu's patch only works if:
> >>
> >>   1) CONFIG_IRQ_TIME_ACCOUNTING is enabled
> >
> > I wonder whether it can not be a default option or not by the following 
> > method:
> >   DEFINE_STATIC_KEY_FALSE(irqtime_account), and enable it according to
> > a boot param.
> 
> How so?
> 
>   config IRQ_TIME_ACCOUNTING
>   depends on HAVE_IRQ_TIME_ACCOUNTING && 
> !VIRT_CPU_ACCOUNTING_NATIVE
> 
Look closely at the two config value:
-1. HAVE_IRQ_TIME_ACCOUNTING, it is selected by most of the popular arches, and
can be further relaxed.
   It implies sched_clock() is fast enough for sampling. With current code, the
variable sched_clock_irqtime=0 can be used to turn off irqtime accounting on
some arches with slow sched_clock(). And it can be even better by using
DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime)
   So the pre-requirement can be relaxed as "depends on 
!VIRT_CPU_ACCOUNTING_NATIVE"
In case that I can not express clearly, could you have a look at the demo patch?

   That patch _assumes_ that irqtime accounting costs much and is not turned on 
by
default. If turned on, it will cost an extra jmp than current implement.
And I think it is critical to my [1/3] whether this assumption is reasonable.

-2. For VIRT_CPU_ACCOUNTING_NATIVE, it can only be selected by powerpc and ia64

In fact, I have a seperate patch for powerpc with
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE to utilize my [1/3].

---
diff --git a/init/Kconfig b/init/Kconfig
index c944691..16d168b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -490,7 +490,7 @@ endchoice
 
 config IRQ_TIME_ACCOUNTING
bool "Fine granularity task level IRQ time accounting"
-   depends on HAVE_IRQ_TIME_ACCOUNTING && !VIRT_CPU_ACCOUNTING_NATIVE
+   depends on !VIRT_CPU_ACCOUNTING_NATIVE
help
  Select this option to enable fine granularity task irq time
  accounting. This is done by reading a timestamp on each
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5a55d23..3ab7e1d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -19,7 +19,7 @@
  */
 DEFINE_PER_CPU(struct irqtime, cpu_irqtime);
 
-static int sched_clock_irqtime;
+DEFINE_STATIC_KEY_FALSE(sched_clock_irqtime);
 
 void enable_sched_clock_irqtime(void)
 {
@@ -49,13 +49,14 @@ static void irqtime_account_delta(struct irqtime *irqtime, 
u64 delta,
  */
 void irqtime_account_irq(struct task_struct *curr)
 {
-   struct irqtime *irqtime = this_cpu_ptr(_irqtime);
+   struct irqtime *irqtime;
s64 delta;
int cpu;
 
-   if (!sched_clock_irqtime)
+   if (static_branch_unlikely(_clock_irqtime))
return;
 
+   irqtime = this_cpu_ptr(_irqtime);
cpu = smp_processor_id();
delta = sched_clock_cpu(cpu) - irqtime->irq_start_time;
irqtime->irq_start_time += delta;
@@ -84,16 +85,7 @@ static u64 irqtime_tick_accounted(u64 maxtime)
return delta;
 }
 
-#else /* CONFIG_IRQ_TIME_ACCOUNTING */
-
-#define sched_clock_irqtime(0)
-
-static u64 irqtime_tick_accounted(u64 dummy)
-{
-   return 0;
-}
-
-#endif /* !CONFIG_IRQ_TIME_ACCOUNTING */
+#endif
 
 static inline void task_group_account_field(struct task_struct *p, int index,
u64 tmp)
@@ -475,7 +467,7 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
if (vtime_accounting_enabled_this_cpu())
return;
 
-   if (sched_clock_irqtime) {
+   if (static_branch_unlikely(_clock_irqtime))
irqtime_account_process_tick(p, user_tick, 1);
return;
}
@@ -504,7 +496,7 @@ void account_idle_ticks(unsigned long ticks)
 {
u64 cputime, steal;
 
-   if (sched_clock_irqtime) {
+   if (static_branch_unlikely(_clock_irqtime))
irqtime_account_idle_ticks(ticks);
return;
}
-- 
2.7.5

[...]
> +
> +static int __init irqstorm_setup(char *arg)
> +{
> + int res = kstrtoul(arg, 0, _limit);
> +
> + if (!res) {
> + pr_info("Interrupt storm detector enabled. Limit=%lu / s\n",
> + irqstorm_limit);
> + }
> + return !!res;
> +}
> +__setup("irqstorm_limit", irqstorm_setup);

This configuration independent method looks appealing. And I am glad to have a 
try.

But irqstorm_limit may be a hard choice. Maybe by formula:
instruction-percpu-per-second / insn num of irq failed path ?  It is hard to
estimate "instruction-percpu-per-second".

Thanks,
Pingfan

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec