Re: Module vs Kernel main performacne

2012-06-08 Thread Abu Rasheda
I modified my module (m.c). Still sending buffer from user space using
ioctl, but instead of copying data from buffer provided by user, I have
allocated (kmalloc) a buffer and I copy from this buffer to another kernel
buffer which is allocated each time this module ioclt is invoked.

copy_from_user is now replaced with memcpy. I still see processor stall.
This means the buffer allocated per call is the cause.

Abu
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-06-07 Thread Abu Rasheda
 wrote:

> Hi again!
>

Hi


> How did you call from Kernel module?


In original code, copied data is dmaed and in experimental code data is
dropped.
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-06-07 Thread Peter Senna Tschudin
Hi again!

On Tue, May 29, 2012 at 8:50 PM, Abu Rasheda  wrote:
> Hi,
>
> I am working on x8_64 arch. Profiled (oprofile) Linux kernel module
> and notice that whole lot of cycles are spent in copy_from_user call.
> I compared same flow from kernel proper and noticed that for more data
> through put cycles spent in copy_from_user are much less. Kernel
> proper has 1/8 cycles compared to module. (There is a user process
> which keeps sending data, like iperf)
>
> Used perf tool to gather some statistics and found that call from kernel 
> proper
>
> 185,719,857,837 cpu-cycles               #    3.318 GHz
>     [90.01%]
>  99,886,030,243 instructions              #    0.54  insns per cycle
>       [95.00%]
>    1,696,072,702 cache-references     #   30.297 M/sec
>   [94.99%]
>       786,929,244 cache-misses           #   46.397 % of all cache
> refs     [95.00%]
>  16,867,747,688 branch-instructions   #  301.307 M/sec
>   [95.03%]
>         86,752,646 branch-misses          #    0.51% of all branches
>       [95.00%]
>    5,482,768,332 bus-cycles                #   97.938 M/sec
>        [20.08%]
>    55967.269801 cpu-clock
>    55981.842225 task-clock                 #    0.933 CPUs utilized
>
> and call from kernel module
>
>  9,388,787,678 cpu-cycles               #    1.527 GHz
>    [89.77%]
>  1,706,203,221 instructions             #    0.18  insns per cycle
>    [94.59%]
>    551,010,961 cache-references    #   89.588 M/sec                   [94.73%]
>   369,632,492 cache-misses           #   67.083 % of all cache refs
>  [95.18%]
>   291,358,658 branch-instructions   #   47.372 M/sec                   
> [94.68%]
>    10,291,678 branch-misses           #    3.53% of all branches
>   [95.01%]
>  582,651,999 bus-cycles                 #   94.733 M/sec
>     [20.55%]
>  6112.471585 cpu-clock
>  6150.490210 task-clock                 #    0.102 CPUs utilized
>                367 page-faults                #    0.000 M/sec
>                367 minor-faults                #    0.000 M/sec
>                    0 major-faults                #    0.000 M/sec
>           25,770 context-switches        #    0.004 M/sec
>                 23 cpu-migrations            #    0.000 M/sec

How did you call from Kernel module?

>
>
> So obviously, CPU is stalling when it is copying data and there are
> more cache misses. My question is, is there a difference calling
> copy_from_user from kernel proper compared to calling from LKM ?
>
> ___
> Kernelnewbies mailing list
> Kernelnewbies@kernelnewbies.org
> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

[]'s

-- 
Peter Senna Tschudin
peter.se...@gmail.com
gpg id: 48274C36

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-06-07 Thread Peter Senna Tschudin
Hello Abu,

On Thu, Jun 7, 2012 at 2:47 PM, Abu Rasheda  wrote:
>> Hello Abu,
>>
>> I had to include  or an error was issued about
>> "THIS_MODULE".
>
>
> I am running this tool on Scientific Linux 6.0, which is 2.6.32 kernel. I
> know this is old but this is what I have for my product.
>
>
>>
>> What Kernel version are you using? I'm trying to compile it and I'm
>> getting the error:
>>
>> [peter@ace m]$ make
>> make -C /lib/modules/3.3.7-1.fc17.x86_64/build SUBDIRS=`pwd` modules
>> make[1]: Entering directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
>>  CC [M]  /tmp/m/m.o
>> /tmp/m/m.c:36:2: error: unknown field ‘ioctl’ specified in initializer
>> /tmp/m/m.c:36:2: warning: initialization from incompatible pointer
>> type [enabled by default]
>> /tmp/m/m.c:36:2: warning: (near initialization for ‘m_fops.llseek’)
>> [enabled by default]
>> make[2]: *** [/tmp/m/m.o] Error 1
>> make[1]: *** [_module_/tmp/m] Error 2
>> make[1]: Leaving directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
>> make: *** [module] Error 2
>>
>> According to:
>> http://lxr.linux.no/linux+v3.4.1/include/linux/fs.h#L1609
>>
>> There is no .ioctl at struct file_operations...
>>
>> Can you share how you've used perf/oprofile on your module/Kernel code?
>>
>> []'s
>>
>> Peter
>
>
> for perf:
>
> perf stat -e
> cpu-cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branch-instructions,branch-misses,bus-cycles,cpu-clock,task-clock,page-faults,minor-faults,major-faults,context-switches,cpu-migrations,alignment-faults,emulation-faults,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-icache-loads,L1-icache-load-misses,L1-icache-prefetches,L1-icache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches,LLC-prefetch-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,dTLB-prefetches,dTLB-prefetch-misses,iTLB-loads,iTLB-load-misses,branch-loads,branch-load-misses,syscalls:sys_enter_sendmsg,syscalls:sys_exit_sendmsg,sched:sched_wakeup,sched:sched_stat_sleep
> ./prog
>
> for oprofile:
>
> # opcontrol --reset
> # opcontrol --vmlinux=/boot/vmlinux.64
> # opcontrol --start
> # ./a.out
> # opcontrol --shutdown
> # opreport -l -p

Thanks! I'll try it now.

I've made changes to your code, so it "probably" will:
 - Run on 3.4 Kernel
 - Partially meet Kernel coding style (Try to run scripts/checkpatch.pl -f m.c)
 - Stop working due lack of locking at m_ioctl(). I'm working on this now... :-)

See it at: http://pastebin.com/sibPrQJL

[]'s

Peter


-- 
Peter Senna Tschudin
peter.se...@gmail.com
gpg id: 48274C36

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-06-07 Thread Abu Rasheda
>
> Hello Abu,
>
> I had to include  or an error was issued about
> "THIS_MODULE".
>

I am running this tool on Scientific Linux 6.0, which is 2.6.32 kernel. I
know this is old but this is what I have for my product.


> What Kernel version are you using? I'm trying to compile it and I'm
> getting the error:
>
> [peter@ace m]$ make
> make -C /lib/modules/3.3.7-1.fc17.x86_64/build SUBDIRS=`pwd` modules
> make[1]: Entering directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
>  CC [M]  /tmp/m/m.o
> /tmp/m/m.c:36:2: error: unknown field ‘ioctl’ specified in initializer
> /tmp/m/m.c:36:2: warning: initialization from incompatible pointer
> type [enabled by default]
> /tmp/m/m.c:36:2: warning: (near initialization for ‘m_fops.llseek’)
> [enabled by default]
> make[2]: *** [/tmp/m/m.o] Error 1
> make[1]: *** [_module_/tmp/m] Error 2
> make[1]: Leaving directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
> make: *** [module] Error 2
>
> According to:
> http://lxr.linux.no/linux+v3.4.1/include/linux/fs.h#L1609
>
> There is no .ioctl at struct file_operations...
>
> Can you share how you've used perf/oprofile on your module/Kernel code?
>
> []'s
>
> Peter


for perf:

perf stat -e
cpu-cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,cache-references,cache-misses,branch-instructions,branch-misses,bus-cycles,cpu-clock,task-clock,page-faults,minor-faults,major-faults,context-switches,cpu-migrations,alignment-faults,emulation-faults,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,L1-dcache-prefetch-misses,L1-icache-loads,L1-icache-load-misses,L1-icache-prefetches,L1-icache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches,LLC-prefetch-misses,dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,dTLB-prefetches,dTLB-prefetch-misses,iTLB-loads,iTLB-load-misses,branch-loads,branch-load-misses,syscalls:sys_enter_sendmsg,syscalls:sys_exit_sendmsg,sched:sched_wakeup,sched:sched_stat_sleep
./prog

for oprofile:

# opcontrol --reset
# opcontrol --vmlinux=/boot/vmlinux.64
# opcontrol --start
# ./a.out
# opcontrol --shutdown
# opreport -l -p
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-06-07 Thread Peter Senna Tschudin
Hello Abu,

I had to include  or an error was issued about "THIS_MODULE".

What Kernel version are you using? I'm trying to compile it and I'm
getting the error:

[peter@ace m]$ make
make -C /lib/modules/3.3.7-1.fc17.x86_64/build SUBDIRS=`pwd` modules
make[1]: Entering directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
  CC [M]  /tmp/m/m.o
/tmp/m/m.c:36:2: error: unknown field ‘ioctl’ specified in initializer
/tmp/m/m.c:36:2: warning: initialization from incompatible pointer
type [enabled by default]
/tmp/m/m.c:36:2: warning: (near initialization for ‘m_fops.llseek’)
[enabled by default]
make[2]: *** [/tmp/m/m.o] Error 1
make[1]: *** [_module_/tmp/m] Error 2
make[1]: Leaving directory `/usr/src/kernels/3.3.7-1.fc17.x86_64'
make: *** [module] Error 2

According to:
http://lxr.linux.no/linux+v3.4.1/include/linux/fs.h#L1609

There is no .ioctl at struct file_operations...

Can you share how you've used perf/oprofile on your module/Kernel code?

[]'s

Peter


On Fri, Jun 1, 2012 at 3:52 PM, Abu Rasheda  wrote:
>> If the buffer at user side is more then a page, then it may be that
>> complete user space buffer is not available in memory and kernel spend time
>> in processing page fault
>
>
> I have attached code for module and user program. If anyone is bored over
> the weekend they are welcome to try and explain the behavior.
>
> Abu Rasheda
>
> ___
> Kernelnewbies mailing list
> Kernelnewbies@kernelnewbies.org
> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>



-- 
Peter Senna Tschudin
peter.se...@gmail.com
gpg id: 48274C36

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-06-01 Thread Abu Rasheda
>
> If the buffer at user side is more then a page, then it may be that
> complete user space buffer is not available in memory and kernel spend time
> in processing page fault
>

I have attached code for module and user program. If anyone is bored over
the weekend they are welcome to try and explain the behavior.

Abu Rasheda


m.tgz
Description: GNU Zip compressed data
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-05-31 Thread Chetan Nanda
On May 31, 2012 9:37 PM, "Abu Rasheda"  wrote:
>
> On Wed, May 30, 2012 at 10:35 PM, Mulyadi Santosa
>  wrote:
> > Hi...
> >
> > On Thu, May 31, 2012 at 4:44 AM, Abu Rasheda 
wrote:
> >> as I increase size of buffer, insns per cycle keep decreasing. Here is
the data:
> >>
> >>1k 0.90  insns per cycle
> >>8k 0.43  insns per cycle
> >>  43k 0.18  insns per cycle
> >> 100k 0.08  insns per cycle
> >>
> >> Showing that copy_from_user is more efficient when copy data is small,
> >> why it is so ?
> >
> > you meant, the bigger the buffer, the fewer the instructions, right?
>
> yes
>
If the buffer at user side is more then a page, then it may be that
complete user space buffer is not available in memory and kernel spend time
in processing page fault
> >
> > Not sure why, but I am sure it will reach some peak point.
> >
> > Anyway, you did kmalloc and then kfree()? I think that's why...bigger
> > buffer will grab large chunk from slab...and again likely it's
> > physically contigous. Also, it will be placed in the same cache line.
> >
> > Whereas the smaller onewill hit allocate/free cycle more...thus
> > flushing the L1/L2 cache even more.
>
> It seems to be doing opposite, bigger the allocation / copy longer stall
is.
>
> ___
> Kernelnewbies mailing list
> Kernelnewbies@kernelnewbies.org
> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-05-31 Thread Abu Rasheda
On Wed, May 30, 2012 at 10:35 PM, Mulyadi Santosa
 wrote:
> Hi...
>
> On Thu, May 31, 2012 at 4:44 AM, Abu Rasheda  wrote:
>> as I increase size of buffer, insns per cycle keep decreasing. Here is the 
>> data:
>>
>>    1k 0.90  insns per cycle
>>    8k 0.43  insns per cycle
>>  43k 0.18  insns per cycle
>> 100k 0.08  insns per cycle
>>
>> Showing that copy_from_user is more efficient when copy data is small,
>> why it is so ?
>
> you meant, the bigger the buffer, the fewer the instructions, right?

yes

>
> Not sure why, but I am sure it will reach some peak point.
>
> Anyway, you did kmalloc and then kfree()? I think that's why...bigger
> buffer will grab large chunk from slab...and again likely it's
> physically contigous. Also, it will be placed in the same cache line.
>
> Whereas the smaller onewill hit allocate/free cycle more...thus
> flushing the L1/L2 cache even more.

It seems to be doing opposite, bigger the allocation / copy longer stall is.

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-05-30 Thread Mulyadi Santosa
Hi...

On Thu, May 31, 2012 at 4:44 AM, Abu Rasheda  wrote:
> as I increase size of buffer, insns per cycle keep decreasing. Here is the 
> data:
>
>    1k 0.90  insns per cycle
>    8k 0.43  insns per cycle
>  43k 0.18  insns per cycle
> 100k 0.08  insns per cycle
>
> Showing that cop_from_user is more efficient when copy data is small,
> why it is so ?

you meant, the bigger the buffer, the fewer the instructions, right?

Not sure why, but I am sure it will reach some peak point.

Anyway, you did kmalloc and then kfree()? I think that's why...bigger
buffer will grab large chunk from slab...and again likely it's
physically contigous. Also, it will be placed in the same cache line.

Whereas the smaller onewill hit allocate/free cycle more...thus
flushing the L1/L2 cache even more.

CMIIW people...

-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-05-30 Thread Abu Rasheda
On Wed, May 30, 2012 at 2:44 PM, Abu Rasheda  wrote:
> I did another experiment.
>
> Wrote a stand alone module and user program which does ioctl and pass
> buffer to kernel module.
>
> User program passes a buffer through ioctl and kernel module does
> kmalloc on it and calls copy_from_user, kfree and return. Test program
> send 120 gigabyte data to module.
>
> If I pass 1k buffer per call, I get
>
> 115,396,349,819 instructions              #    0.90  insns per cycle
>      [95.00%]
>
> as I increase size of buffer, insns per cycle keep decreasing. Here is the 
> data:
>
>    1k 0.90  insns per cycle
>    8k 0.43  insns per cycle
>  43k 0.18  insns per cycle
> 100k 0.08  insns per cycle
>
> Showing that cop_from_user is more efficient when copy data is small,
> why it is so ?

Did another experiment:

User program sending 43k and allocating 43k after entering ioctl and
copy_from_user smaller portion in each call to copy_from_user:
--
copy_from_user  0.25k at a time 0.56  insns per cycle
copy_from_user  0.50k at a time 0.42  insns per cycle
copy_from_user  1.00k at a time 0.36  insns per cycle
copy_from_user  2.00k at a time 0.29  insns per cycle
copy_from_user  3.00k at a time 0.26  insns per cycle
copy_from_user  4.00k at a time 0.23  insns per cycle
copy_from_user  8.00k at a time 0.21  insns per cycle
copy_from_user 16.00k at a time 0.19  insns per cycle


User program sending 43k, allocating smaller chunk and sending that
chunk to call to copy_from_user:
--
Allocated 0.25k and copy_from_user  0.25k at a time 1.04 insns per cycle
Allocated 0.50k and copy_from_user  0.50k at a time 0.90 insns per cycle
Allocated 1.00k and copy_from_user  1.00k at a time 0.79 insns per cycle
Allocated 2.00k and copy_from_user  2.00k at a time 0.67 insns per cycle
Allocated 4.00k and copy_from_user  4.00k at a time 0.53 insns per cycle
Allocated 8.00k and copy_from_user  8.00k at a time 0.42 insns per cycle
Allocated 16.00k and copy_from_user 16.00k at a time 0.33 insns per cycle
Allocated 32.00k and copy_from_user 32.00k at a time 0.22 insns per cycle

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-05-30 Thread Abu Rasheda
I did another experiment.

Wrote a stand alone module and user program which does ioctl and pass
buffer to kernel module.

User program passes a buffer through ioctl and kernel module does
kmalloc on it and calls copy_from_user, kfree and return. Test program
send 120 gigabyte data to module.

If I pass 1k buffer per call, I get

115,396,349,819 instructions  #0.90  insns per cycle
  [95.00%]

as I increase size of buffer, insns per cycle keep decreasing. Here is the data:

1k 0.90  insns per cycle
8k 0.43  insns per cycle
  43k 0.18  insns per cycle
100k 0.08  insns per cycle

Showing that cop_from_user is more efficient when copy data is small,
why it is so ?

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-05-30 Thread Mulyadi Santosa
Hi...

On Wed, May 30, 2012 at 11:51 AM, Abu Rasheda  wrote:
> When you say, LKM area is prepared with vmalloc is it for code /
> executable you refering too ?

Yes, AFAIK memory area code and static data in linux kernel module is
allocated via vmalloc().

>if so will it matter for data copy ?

see my previous reply :)

>
> Point # 2. Some one was saying that on atleast MIPS it takes more
> cycle to call kernel main function from module because of log jump.
> Does it apply to x86_64 to ?

IIRC long jump means jumping more than 64 KB...but that's in real mode
in 32 bit...so I am not sure whether it still applies in protected
mode.

> To teat above two should I make my module part of static kernel ?

good ideai think you can try that... :)

-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-05-29 Thread Abu Rasheda
> What I meant here is, there must be difference speed when you copy
> onto something contigous vs non contigous. IIRC at least it will waste
> some portion of L1/L2 cache.

When you say, LKM area is prepared with vmalloc is it for code /
executable you refering too ? if so will it matter for data copy ?

Point # 2. Some one was saying that on atleast MIPS it takes more
cycle to call kernel main function from module because of log jump.
Does it apply to x86_64 to ?

To teat above two should I make my module part of static kernel ?

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies


Re: Module vs Kernel main performacne

2012-05-29 Thread Mulyadi Santosa
Hi...

On Wed, May 30, 2012 at 6:50 AM, Abu Rasheda  wrote:
> So obviously, CPU is stalling when it is copying data and there are
> more cache misses. My question is, is there a difference calling
> copy_from_user from kernel proper compared to calling from LKM ?

Theoritically, it should be the same. However, one thing that might
interest you is that the fact that linux kernel module memory area is
prepared through vmalloc(), thus there is a chance they are not
physically contigous...whereas the main kernel image are using
page_alloc() IIRC thus physically contigous.

What I meant here is, there must be difference speed when you copy
onto something contigous vs non contigous. IIRC at least it will waste
some portion of L1/L2 cache.

Just my 2 cents, maybe I am wrong somewhere...


-- 
regards,

Mulyadi Santosa
Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com
training: mulyaditraining.blogspot.com

___
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies