Re: Module vs Kernel main performacne
I modified my module (m.c). Still sending buffer from user space using ioctl, but instead of copying data from buffer provided by user, I have allocated (kmalloc) a buffer and I copy from this buffer to another kernel buffer which is allocated each time this module ioclt is invoked. copy_from_user is now replaced with memcpy. I still see processor stall. This means the buffer allocated per call is the cause. Abu ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
wrote: > Hi again! > Hi > How did you call from Kernel module? In original code, copied data is dmaed and in experimental code data is dropped. ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
Hi again! On Tue, May 29, 2012 at 8:50 PM, Abu Rasheda wrote: > Hi, > > I am working on x8_64 arch. Profiled (oprofile) Linux kernel module > and notice that whole lot of cycles are spent in copy_from_user call. > I compared same flow from kernel proper and noticed that for more data > through put cycles spent in copy_from_user are much less. Kernel > proper has 1/8 cycles compared to module. (There is a user process > which keeps sending data, like iperf) > > Used perf tool to gather some statistics and found that call from kernel > proper > > 185,719,857,837 cpu-cycles # 3.318 GHz > [90.01%] > 99,886,030,243 instructions # 0.54 insns per cycle > [95.00%] > 1,696,072,702 cache-references # 30.297 M/sec > [94.99%] > 786,929,244 cache-misses # 46.397 % of all cache > refs [95.00%] > 16,867,747,688 branch-instructions # 301.307 M/sec > [95.03%] > 86,752,646 branch-misses # 0.51% of all branches > [95.00%] > 5,482,768,332 bus-cycles # 97.938 M/sec > [20.08%] > 55967.269801 cpu-clock > 55981.842225 task-clock # 0.933 CPUs utilized > > and call from kernel module > > 9,388,787,678 cpu-cycles # 1.527 GHz > [89.77%] > 1,706,203,221 instructions # 0.18 insns per cycle > [94.59%] > 551,010,961 cache-references # 89.588 M/sec [94.73%] > 369,632,492 cache-misses # 67.083 % of all cache refs > [95.18%] > 291,358,658 branch-instructions # 47.372 M/sec > [94.68%] > 10,291,678 branch-misses # 3.53% of all branches > [95.01%] > 582,651,999 bus-cycles # 94.733 M/sec > [20.55%] > 6112.471585 cpu-clock > 6150.490210 task-clock # 0.102 CPUs utilized > 367 page-faults # 0.000 M/sec > 367 minor-faults # 0.000 M/sec > 0 major-faults # 0.000 M/sec > 25,770 context-switches # 0.004 M/sec > 23 cpu-migrations # 0.000 M/sec How did you call from Kernel module? > > > So obviously, CPU is stalling when it is copying data and there are > more cache misses. My question is, is there a difference calling > copy_from_user from kernel proper compared to calling from LKM ? > > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies []'s -- Peter Senna Tschudin peter.se...@gmail.com gpg id: 48274C36 ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
Hello Abu, On Thu, Jun 7, 2012 at 2:47 PM, Abu Rasheda wrote: >> Hello Abu, >> >> I had to include or an error was issued about >> "THIS_MODULE". > > > I am running this tool on Scientific Linux 6.0, which is 2.6.32 kernel. I > know this is old but this is what I have for my product. > > >> >> What Kernel version are you using? I'm trying to compile it and I'm >> getting the error: >> >> [peter@ace m]$ make >> make -C /lib/modules/3.3.7-1.fc17.x86_64/build SUBDIRS=`pwd` modules >> make[1]: Entering directory `/usr/src/kernels/3.3.7-1.fc17.x86_64' >> CC [M] /tmp/m/m.o >> /tmp/m/m.c:36:2: error: unknown field ‘ioctl’ specified in initializer >> /tmp/m/m.c:36:2: warning: initialization from incompatible pointer >> type [enabled by default] >> /tmp/m/m.c:36:2: warning: (near initialization for ‘m_fops.llseek’) >> [enabled by default] >> make[2]: *** [/tmp/m/m.o] Error 1 >> make[1]: *** [_module_/tmp/m] Error 2 >> make[1]: Leaving directory `/usr/src/kernels/3.3.7-1.fc17.x86_64' >> make: *** [module] Error 2 >> >> According to: >> http://lxr.linux.no/linux+v3.4.1/include/linux/fs.h#L1609 >> >> There is no .ioctl at struct file_operations... >> >> Can you share how you've used perf/oprofile on your module/Kernel code? >> >> []'s >> >> Peter > > > for perf: > > perf stat -e > cpu-cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branch-instructions,branch-misses,bus-cycles,cpu-clock,task-clock,page-faults,minor-faults,major-faults,context-switches,cpu-migrations,alignment-faults,emulation-faults,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-icache-loads,L1-icache-load-misses,L1-icache-prefetches,L1-icache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches,LLC-prefetch-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,dTLB-prefetches,dTLB-prefetch-misses,iTLB-loads,iTLB-load-misses,branch-loads,branch-load-misses,syscalls:sys_enter_sendmsg,syscalls:sys_exit_sendmsg,sched:sched_wakeup,sched:sched_stat_sleep > ./prog > > for oprofile: > > # opcontrol --reset > # opcontrol --vmlinux=/boot/vmlinux.64 > # opcontrol --start > # ./a.out > # opcontrol --shutdown > # opreport -l -p Thanks! I'll try it now. I've made changes to your code, so it "probably" will: - Run on 3.4 Kernel - Partially meet Kernel coding style (Try to run scripts/checkpatch.pl -f m.c) - Stop working due lack of locking at m_ioctl(). I'm working on this now... :-) See it at: http://pastebin.com/sibPrQJL []'s Peter -- Peter Senna Tschudin peter.se...@gmail.com gpg id: 48274C36 ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
> > Hello Abu, > > I had to include or an error was issued about > "THIS_MODULE". > I am running this tool on Scientific Linux 6.0, which is 2.6.32 kernel. I know this is old but this is what I have for my product. > What Kernel version are you using? I'm trying to compile it and I'm > getting the error: > > [peter@ace m]$ make > make -C /lib/modules/3.3.7-1.fc17.x86_64/build SUBDIRS=`pwd` modules > make[1]: Entering directory `/usr/src/kernels/3.3.7-1.fc17.x86_64' > CC [M] /tmp/m/m.o > /tmp/m/m.c:36:2: error: unknown field ‘ioctl’ specified in initializer > /tmp/m/m.c:36:2: warning: initialization from incompatible pointer > type [enabled by default] > /tmp/m/m.c:36:2: warning: (near initialization for ‘m_fops.llseek’) > [enabled by default] > make[2]: *** [/tmp/m/m.o] Error 1 > make[1]: *** [_module_/tmp/m] Error 2 > make[1]: Leaving directory `/usr/src/kernels/3.3.7-1.fc17.x86_64' > make: *** [module] Error 2 > > According to: > http://lxr.linux.no/linux+v3.4.1/include/linux/fs.h#L1609 > > There is no .ioctl at struct file_operations... > > Can you share how you've used perf/oprofile on your module/Kernel code? > > []'s > > Peter for perf: perf stat -e cpu-cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branch-instructions,branch-misses,bus-cycles,cpu-clock,task-clock,page-faults,minor-faults,major-faults,context-switches,cpu-migrations,alignment-faults,emulation-faults,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-icache-loads,L1-icache-load-misses,L1-icache-prefetches,L1-icache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches,LLC-prefetch-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,dTLB-prefetches,dTLB-prefetch-misses,iTLB-loads,iTLB-load-misses,branch-loads,branch-load-misses,syscalls:sys_enter_sendmsg,syscalls:sys_exit_sendmsg,sched:sched_wakeup,sched:sched_stat_sleep ./prog for oprofile: # opcontrol --reset # opcontrol --vmlinux=/boot/vmlinux.64 # opcontrol --start # ./a.out # opcontrol --shutdown # opreport -l -p ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
Hello Abu, I had to include or an error was issued about "THIS_MODULE". What Kernel version are you using? I'm trying to compile it and I'm getting the error: [peter@ace m]$ make make -C /lib/modules/3.3.7-1.fc17.x86_64/build SUBDIRS=`pwd` modules make[1]: Entering directory `/usr/src/kernels/3.3.7-1.fc17.x86_64' CC [M] /tmp/m/m.o /tmp/m/m.c:36:2: error: unknown field ‘ioctl’ specified in initializer /tmp/m/m.c:36:2: warning: initialization from incompatible pointer type [enabled by default] /tmp/m/m.c:36:2: warning: (near initialization for ‘m_fops.llseek’) [enabled by default] make[2]: *** [/tmp/m/m.o] Error 1 make[1]: *** [_module_/tmp/m] Error 2 make[1]: Leaving directory `/usr/src/kernels/3.3.7-1.fc17.x86_64' make: *** [module] Error 2 According to: http://lxr.linux.no/linux+v3.4.1/include/linux/fs.h#L1609 There is no .ioctl at struct file_operations... Can you share how you've used perf/oprofile on your module/Kernel code? []'s Peter On Fri, Jun 1, 2012 at 3:52 PM, Abu Rasheda wrote: >> If the buffer at user side is more then a page, then it may be that >> complete user space buffer is not available in memory and kernel spend time >> in processing page fault > > > I have attached code for module and user program. If anyone is bored over > the weekend they are welcome to try and explain the behavior. > > Abu Rasheda > > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies > -- Peter Senna Tschudin peter.se...@gmail.com gpg id: 48274C36 ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
> > If the buffer at user side is more then a page, then it may be that > complete user space buffer is not available in memory and kernel spend time > in processing page fault > I have attached code for module and user program. If anyone is bored over the weekend they are welcome to try and explain the behavior. Abu Rasheda m.tgz Description: GNU Zip compressed data ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
On May 31, 2012 9:37 PM, "Abu Rasheda" wrote: > > On Wed, May 30, 2012 at 10:35 PM, Mulyadi Santosa > wrote: > > Hi... > > > > On Thu, May 31, 2012 at 4:44 AM, Abu Rasheda wrote: > >> as I increase size of buffer, insns per cycle keep decreasing. Here is the data: > >> > >>1k 0.90 insns per cycle > >>8k 0.43 insns per cycle > >> 43k 0.18 insns per cycle > >> 100k 0.08 insns per cycle > >> > >> Showing that copy_from_user is more efficient when copy data is small, > >> why it is so ? > > > > you meant, the bigger the buffer, the fewer the instructions, right? > > yes > If the buffer at user side is more then a page, then it may be that complete user space buffer is not available in memory and kernel spend time in processing page fault > > > > Not sure why, but I am sure it will reach some peak point. > > > > Anyway, you did kmalloc and then kfree()? I think that's why...bigger > > buffer will grab large chunk from slab...and again likely it's > > physically contigous. Also, it will be placed in the same cache line. > > > > Whereas the smaller onewill hit allocate/free cycle more...thus > > flushing the L1/L2 cache even more. > > It seems to be doing opposite, bigger the allocation / copy longer stall is. > > ___ > Kernelnewbies mailing list > Kernelnewbies@kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
On Wed, May 30, 2012 at 10:35 PM, Mulyadi Santosa wrote: > Hi... > > On Thu, May 31, 2012 at 4:44 AM, Abu Rasheda wrote: >> as I increase size of buffer, insns per cycle keep decreasing. Here is the >> data: >> >> 1k 0.90 insns per cycle >> 8k 0.43 insns per cycle >> 43k 0.18 insns per cycle >> 100k 0.08 insns per cycle >> >> Showing that copy_from_user is more efficient when copy data is small, >> why it is so ? > > you meant, the bigger the buffer, the fewer the instructions, right? yes > > Not sure why, but I am sure it will reach some peak point. > > Anyway, you did kmalloc and then kfree()? I think that's why...bigger > buffer will grab large chunk from slab...and again likely it's > physically contigous. Also, it will be placed in the same cache line. > > Whereas the smaller onewill hit allocate/free cycle more...thus > flushing the L1/L2 cache even more. It seems to be doing opposite, bigger the allocation / copy longer stall is. ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
Hi... On Thu, May 31, 2012 at 4:44 AM, Abu Rasheda wrote: > as I increase size of buffer, insns per cycle keep decreasing. Here is the > data: > > 1k 0.90 insns per cycle > 8k 0.43 insns per cycle > 43k 0.18 insns per cycle > 100k 0.08 insns per cycle > > Showing that cop_from_user is more efficient when copy data is small, > why it is so ? you meant, the bigger the buffer, the fewer the instructions, right? Not sure why, but I am sure it will reach some peak point. Anyway, you did kmalloc and then kfree()? I think that's why...bigger buffer will grab large chunk from slab...and again likely it's physically contigous. Also, it will be placed in the same cache line. Whereas the smaller onewill hit allocate/free cycle more...thus flushing the L1/L2 cache even more. CMIIW people... -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
On Wed, May 30, 2012 at 2:44 PM, Abu Rasheda wrote: > I did another experiment. > > Wrote a stand alone module and user program which does ioctl and pass > buffer to kernel module. > > User program passes a buffer through ioctl and kernel module does > kmalloc on it and calls copy_from_user, kfree and return. Test program > send 120 gigabyte data to module. > > If I pass 1k buffer per call, I get > > 115,396,349,819 instructions # 0.90 insns per cycle > [95.00%] > > as I increase size of buffer, insns per cycle keep decreasing. Here is the > data: > > 1k 0.90 insns per cycle > 8k 0.43 insns per cycle > 43k 0.18 insns per cycle > 100k 0.08 insns per cycle > > Showing that cop_from_user is more efficient when copy data is small, > why it is so ? Did another experiment: User program sending 43k and allocating 43k after entering ioctl and copy_from_user smaller portion in each call to copy_from_user: -- copy_from_user 0.25k at a time 0.56 insns per cycle copy_from_user 0.50k at a time 0.42 insns per cycle copy_from_user 1.00k at a time 0.36 insns per cycle copy_from_user 2.00k at a time 0.29 insns per cycle copy_from_user 3.00k at a time 0.26 insns per cycle copy_from_user 4.00k at a time 0.23 insns per cycle copy_from_user 8.00k at a time 0.21 insns per cycle copy_from_user 16.00k at a time 0.19 insns per cycle User program sending 43k, allocating smaller chunk and sending that chunk to call to copy_from_user: -- Allocated 0.25k and copy_from_user 0.25k at a time 1.04 insns per cycle Allocated 0.50k and copy_from_user 0.50k at a time 0.90 insns per cycle Allocated 1.00k and copy_from_user 1.00k at a time 0.79 insns per cycle Allocated 2.00k and copy_from_user 2.00k at a time 0.67 insns per cycle Allocated 4.00k and copy_from_user 4.00k at a time 0.53 insns per cycle Allocated 8.00k and copy_from_user 8.00k at a time 0.42 insns per cycle Allocated 16.00k and copy_from_user 16.00k at a time 0.33 insns per cycle Allocated 32.00k and copy_from_user 32.00k at a time 0.22 insns per cycle ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
I did another experiment. Wrote a stand alone module and user program which does ioctl and pass buffer to kernel module. User program passes a buffer through ioctl and kernel module does kmalloc on it and calls copy_from_user, kfree and return. Test program send 120 gigabyte data to module. If I pass 1k buffer per call, I get 115,396,349,819 instructions #0.90 insns per cycle [95.00%] as I increase size of buffer, insns per cycle keep decreasing. Here is the data: 1k 0.90 insns per cycle 8k 0.43 insns per cycle 43k 0.18 insns per cycle 100k 0.08 insns per cycle Showing that cop_from_user is more efficient when copy data is small, why it is so ? ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
Hi... On Wed, May 30, 2012 at 11:51 AM, Abu Rasheda wrote: > When you say, LKM area is prepared with vmalloc is it for code / > executable you refering too ? Yes, AFAIK memory area code and static data in linux kernel module is allocated via vmalloc(). >if so will it matter for data copy ? see my previous reply :) > > Point # 2. Some one was saying that on atleast MIPS it takes more > cycle to call kernel main function from module because of log jump. > Does it apply to x86_64 to ? IIRC long jump means jumping more than 64 KB...but that's in real mode in 32 bit...so I am not sure whether it still applies in protected mode. > To teat above two should I make my module part of static kernel ? good ideai think you can try that... :) -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
> What I meant here is, there must be difference speed when you copy > onto something contigous vs non contigous. IIRC at least it will waste > some portion of L1/L2 cache. When you say, LKM area is prepared with vmalloc is it for code / executable you refering too ? if so will it matter for data copy ? Point # 2. Some one was saying that on atleast MIPS it takes more cycle to call kernel main function from module because of log jump. Does it apply to x86_64 to ? To teat above two should I make my module part of static kernel ? ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
Re: Module vs Kernel main performacne
Hi... On Wed, May 30, 2012 at 6:50 AM, Abu Rasheda wrote: > So obviously, CPU is stalling when it is copying data and there are > more cache misses. My question is, is there a difference calling > copy_from_user from kernel proper compared to calling from LKM ? Theoritically, it should be the same. However, one thing that might interest you is that the fact that linux kernel module memory area is prepared through vmalloc(), thus there is a chance they are not physically contigous...whereas the main kernel image are using page_alloc() IIRC thus physically contigous. What I meant here is, there must be difference speed when you copy onto something contigous vs non contigous. IIRC at least it will waste some portion of L1/L2 cache. Just my 2 cents, maybe I am wrong somewhere... -- regards, Mulyadi Santosa Freelance Linux trainer and consultant blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com ___ Kernelnewbies mailing list Kernelnewbies@kernelnewbies.org http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies