Hi Pranith,
i recently upgraded to version 3.12.14, still no change in
load/performance. Have you received any feedback?
At the moment i have 3 options:
- problem can be fixed within version 3.12
- upgrade to 4.1 and magically/hopefully "fix" the problem (might not
help when problem is within
On Fri, Aug 31, 2018 at 1:18 PM Hu Bert wrote:
> Hi Pranith,
>
> i just wanted to ask if you were able to get any feedback from your
> colleagues :-)
>
Sorry, I didn't get a chance to. I am working on a customer issue which is
taking away cycles from any other work. Let me get back to you once
Hi Pranith,
i just wanted to ask if you were able to get any feedback from your
colleagues :-)
btw.: we migrated some stuff (static resources, small files) to a nfs
server that we actually wanted to replace by glusterfs. Load and cpu
usage has gone down a bit, but still is asymmetric on the 3
Hm, i noticed that in the shared.log (volume log file) on gluster11
and gluster12 (but not on gluster13) i now see these warnings:
[2018-08-28 07:18:57.224367] W [MSGID: 109011]
[dht-layout.c:186:dht_layout_search] 0-shared-dht: no subvolume for
hash (value) = 3054593291
[2018-08-28
a little update after about 2 hours of uptime: still/again high cpu
usage by one brick processes. server load >30.
gluster11: high cpu; brick /gluster/bricksdd1/; no hdd exchange so far
gluster12: normal cpu; brick /gluster/bricksdd1_new/; hdd change /dev/sdd
gluster13: high cpu; brick
Good Morning,
today i update + rebooted all gluster servers, kernel update to
4.9.0-8 and gluster to 3.12.13. Reboots went fine, but on one of the
gluster servers (gluster13) one of the bricks did come up at the
beginning but then lost connection.
OK:
Status of volume: shared
Gluster process
yeah, on debian xyz.log.1 is always the former logfile which has been
rotated by logrotate. Just checked the 3 servers: now it looks good, i
will check it again tomorrow. very strange, maybe logrotate hasn't
worked properly.
the performance problems remain :-)
2018-08-27 15:41 GMT+02:00 Milind
On Thu, Aug 23, 2018 at 5:28 PM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:
> On Wed, Aug 22, 2018 at 12:01 PM Hu Bert wrote:
>
>> Just an addition: in general there are no log messages in
>> /var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
>> the node with the
On Wed, Aug 22, 2018 at 12:01 PM Hu Bert wrote:
> Just an addition: in general there are no log messages in
> /var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
> the node with the lowest load i see in cli.log.1:
>
> [2018-08-22 06:20:43.291055] I
Just an addition: in general there are no log messages in
/var/log/glusterfs/ (if you don't all 'gluster volume ...'), but on
the node with the lowest load i see in cli.log.1:
[2018-08-22 06:20:43.291055] I [socket.c:2474:socket_event_handler]
0-transport: EPOLLERR - disconnecting now
[2018-08-22
On Tue, Aug 21, 2018 at 11:40 AM Hu Bert wrote:
> Good morning :-)
>
> gluster11:
> ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
> total 0
> -- 1 root root 0 Aug 14 06:14
> xattrop-006b65d8-9e81-4886-b380-89168ea079bd
>
> gluster12:
> ls -l
Good morning :-)
gluster11:
ls -l /gluster/bricksdd1/shared/.glusterfs/indices/xattrop/
total 0
-- 1 root root 0 Aug 14 06:14
xattrop-006b65d8-9e81-4886-b380-89168ea079bd
gluster12:
ls -l /gluster/bricksdd1_new/shared/.glusterfs/indices/xattrop/
total 0
-- 1 root root 0 Jul 17
On Tue, Aug 21, 2018 at 10:13 AM Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:
>
>
> On Mon, Aug 20, 2018 at 3:20 PM Hu Bert wrote:
>
>> Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
>> Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
>> RAID
On Mon, Aug 20, 2018 at 3:20 PM Hu Bert wrote:
> Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
> Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
> RAID Controller; operating system running on a raid1, then 4 disks
> (JBOD) as bricks.
>
> Ok, i ran perf
Regarding hardware the machines are identical. Intel Xeon E5-1650 v3
Hexa-Core; 64 GB DDR4 ECC; Dell PERC H330 8 Port SAS/SATA 12 GBit/s
RAID Controller; operating system running on a raid1, then 4 disks
(JBOD) as bricks.
Ok, i ran perf for a few seconds.
perf record
gluster volume heal shared info | grep -i number
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries: 0
Number of entries:
There are a lot of Lookup operations in the system. But I am not able to
find why. Could you check the output of
# gluster volume heal info | grep -i number
it should print all zeros.
On Fri, Aug 17, 2018 at 1:49 PM Hu Bert wrote:
> I don't know what you exactly mean with workload, but the
I don't know what you exactly mean with workload, but the main
function of the volume is storing (incl. writing, reading) images
(from hundreds of bytes up to 30 MBs, overall ~7TB). The work is done
by apache tomcat servers writing to / reading from the volume. Besides
images there are some text
There seems to be too many lookup operations compared to any other
operations. What is the workload on the volume?
On Fri, Aug 17, 2018 at 12:47 PM Hu Bert wrote:
> i hope i did get it right.
>
> gluster volume profile shared start
> wait 10 minutes
> gluster volume profile shared info
>
i hope i did get it right.
gluster volume profile shared start
wait 10 minutes
gluster volume profile shared info
gluster volume profile shared stop
If that's ok, i've attached the output of the info command.
2018-08-17 8:31 GMT+02:00 Pranith Kumar Karampuri :
> Please do volume profile also
Please do volume profile also for around 10 minutes when CPU% is high.
On Fri, Aug 17, 2018 at 11:56 AM Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:
> As per the output, all io-threads are using a lot of CPU. It is better to
> check what the volume profile is to see what is leading to
As per the output, all io-threads are using a lot of CPU. It is better to
check what the volume profile is to see what is leading to so much work for
io-threads. Please follow the documentation at
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Monitoring%20Workload/
section:
Could you do the following on one of the nodes where you are observing high
CPU usage and attach that file to this thread? We can find what
threads/processes are leading to high usage. Do this for say 10 minutes
when you see the ~100% CPU.
top -bHd 5 > /tmp/top.${HOSTNAME}.txt
On Wed, Aug 15,
Hi,
well, as the situation doesn't get better, we're quite helpless and
mostly in the dark, so we're thinking about hiring some professional
support. Any hint? :-)
2018-08-15 11:07 GMT+02:00 Hu Bert :
> Hello again :-)
>
> The self heal must have finished as there are no log entries in
>
Hello again :-)
The self heal must have finished as there are no log entries in
glustershd.log files anymore. According to munin disk latency (average
io wait) has gone down to 100 ms, and disk utilization has gone down
to ~60% - both on all servers and hard disks.
But now system load on 2
Hi there,
well, it seems the heal has finally finished. Couldn't see/find any
related log message; is there such a message in a specific log file?
But i see the same behaviour when the last heal finished: all CPU
cores are consumed by brick processes; not only by the formerly failed
bricksdd1,
Hello :-) Just wanted to give a short report...
>> It could be saturating in the day. But if enough self-heals are going on,
>> even in the night it should have been close to 100%.
>
> Lowest utilization was 70% over night, but i'll check this
> evening/weekend. Also that 'stat...' is running.
>> Btw.: i've seen in the munin stats that the disk utilization for
>> bricksdd1 on the healthy gluster servers is between 70% (night) and
>> almost 99% (daytime). So it looks like that the basic problem is the
>> disk which seems not to be able to work faster? If so (heal)
>> performance won't
On Fri, Jul 27, 2018 at 1:32 PM, Hu Bert wrote:
> 2018-07-27 9:22 GMT+02:00 Pranith Kumar Karampuri :
> >
> >
> > On Fri, Jul 27, 2018 at 12:36 PM, Hu Bert
> wrote:
> >>
> >> 2018-07-27 8:52 GMT+02:00 Pranith Kumar Karampuri >:
> >> >
> >> >
> >> > On Fri, Jul 27, 2018 at 11:53 AM, Hu Bert
>
2018-07-27 9:22 GMT+02:00 Pranith Kumar Karampuri :
>
>
> On Fri, Jul 27, 2018 at 12:36 PM, Hu Bert wrote:
>>
>> 2018-07-27 8:52 GMT+02:00 Pranith Kumar Karampuri :
>> >
>> >
>> > On Fri, Jul 27, 2018 at 11:53 AM, Hu Bert
>> > wrote:
>> >>
>> >> > Do you already have all the 19 directories
On Fri, Jul 27, 2018 at 12:36 PM, Hu Bert wrote:
> 2018-07-27 8:52 GMT+02:00 Pranith Kumar Karampuri :
> >
> >
> > On Fri, Jul 27, 2018 at 11:53 AM, Hu Bert
> wrote:
> >>
> >> > Do you already have all the 19 directories already created? If not
> >> > could you find out which of the paths
2018-07-27 8:52 GMT+02:00 Pranith Kumar Karampuri :
>
>
> On Fri, Jul 27, 2018 at 11:53 AM, Hu Bert wrote:
>>
>> > Do you already have all the 19 directories already created? If not
>> > could you find out which of the paths need it and do a stat directly
>> > instead
>> > of find?
>>
>>
On Fri, Jul 27, 2018 at 11:53 AM, Hu Bert wrote:
> > Do you already have all the 19 directories already created? If not
> could you find out which of the paths need it and do a stat directly
> instead of find?
>
> Quite probable not all of them have been created (but counting how
> much
> Do you already have all the 19 directories already created? If not could
> you find out which of the paths need it and do a stat directly instead of
> find?
Quite probable not all of them have been created (but counting how
much would take very long...). Hm, maybe running stat in a double
On Fri, Jul 27, 2018 at 11:11 AM, Hu Bert wrote:
> Good Morning :-)
>
> on server gluster11 about 1.25 million and on gluster13 about 1.35
> million log entries in glustershd.log file. About 70 GB got healed,
> overall ~700GB of 2.0TB. Doesn't seem to run faster. I'm calling
> 'find...' whenever
Good Morning :-)
on server gluster11 about 1.25 million and on gluster13 about 1.35
million log entries in glustershd.log file. About 70 GB got healed,
overall ~700GB of 2.0TB. Doesn't seem to run faster. I'm calling
'find...' whenever i notice that it has finished. Hmm... is it
possible and
On Thu, Jul 26, 2018 at 2:41 PM, Hu Bert wrote:
> > Sorry, bad copy/paste :-(.
>
> np :-)
>
> The question regarding version 4.1 was meant more generally: does
> gluster v4.0 etc. have a better performance than version 3.12 etc.?
> Just curious :-) Sooner or later we have to upgrade anyway.
> Sorry, bad copy/paste :-(.
np :-)
The question regarding version 4.1 was meant more generally: does
gluster v4.0 etc. have a better performance than version 3.12 etc.?
Just curious :-) Sooner or later we have to upgrade anyway.
btw.: gluster12 was the node with the failed brick, and i started
On Thu, Jul 26, 2018 at 12:59 PM, Hu Bert wrote:
> Hi Pranith,
>
> thanks a lot for your efforts and for tracking "my" problem with an issue.
> :-)
>
> I've set this params on the gluster volume and will start the
> 'find...' command within a short time. I'll probably add another
> answer to
Hi Pranith,
thanks a lot for your efforts and for tracking "my" problem with an issue. :-)
I've set this params on the gluster volume and will start the
'find...' command within a short time. I'll probably add another
answer to the list to document the progress.
btw. - you had some typos:
Thanks a lot for detailed write-up, this helps find the bottlenecks easily.
On a high level, to handle this directory hierarchy i.e. lots of
directories with files, we need to improve healing
algorithms. Based on the data you provided, we need to make the following
enhancements:
1) At the moment
Hi Pranith,
Sry, it took a while to count the directories. I'll try to answer your
questions as good as possible.
> What kind of data do you have?
> How many directories in the filesystem?
> On average how many files per directory?
> What is the depth of your directory hierarchy on average?
>
On Mon, Jul 23, 2018 at 4:16 PM, Hu Bert wrote:
> Well, over the weekend about 200GB were copied, so now there are
> ~400GB copied to the brick. That's far beyond a speed of 10GB per
> hour. If I copied the 1.6 TB directly, that would be done within max 2
> days. But with the self heal this will
Well, over the weekend about 200GB were copied, so now there are
~400GB copied to the brick. That's far beyond a speed of 10GB per
hour. If I copied the 1.6 TB directly, that would be done within max 2
days. But with the self heal this will take at least 20 days minimum.
Why is the performance
hmm... no one any idea?
Additional question: the hdd on server gluster12 was changed, so far
~220 GB were copied. On the other 2 servers i see a lot of entries in
glustershd.log, about 312.000 respectively 336.000 entries there
yesterday, most of them (current log output) looking like this:
45 matches
Mail list logo