Re: Observed a ecryptFS crash

2016-10-09 Thread xiakaixu

ping...



Hi Tyhicks,

We observed a ecryptFS crash occasionally in Linux kernel 4.1.18. The call 
trace is attached below. Is it a known issue? Look forward to hearing from you. 
Thanks in advance!

[19314.529479s][pid:2694,cpu3,GAC_Executor[0]]Call trace:
[19314.529510s][pid:2694,cpu3,GAC_Executor[0]][] 
do_raw_spin_lock+0x20/0x200
[19314.529510s][pid:2694,cpu3,GAC_Executor[0]][] 
_raw_spin_lock+0x28/0x34
[19314.529541s][pid:2694,cpu3,GAC_Executor[0]][] 
selinux_inode_free_security+0x3c/0x94
[19314.529541s][pid:2694,cpu3,GAC_Executor[0]][] 
security_inode_free+0x2c/0x38
[19314.529541s][pid:2694,cpu3,GAC_Executor[0]][] 
__destroy_inode+0x2c/0x180
[19314.529571s][pid:2694,cpu3,GAC_Executor[0]][] 
destroy_inode+0x30/0xa0
[19314.529571s][pid:2694,cpu3,GAC_Executor[0]][] 
evict+0x108/0x1c0
[19314.529571s][pid:2694,cpu3,GAC_Executor[0]][] 
iput+0x184/0x258
[19314.529602s][pid:2694,cpu3,GAC_Executor[0]][] 
ecryptfs_evict_inode+0x30/0x3c
[19314.529602s][pid:2694,cpu3,GAC_Executor[0]][] 
evict+0xac/0x1c0
[19314.529602s][pid:2694,cpu3,GAC_Executor[0]][] 
dispose_list+0x44/0x5c
[19314.529632s][pid:2694,cpu3,GAC_Executor[0]][] 
evict_inodes+0xcc/0x12c
[19314.529632s][pid:2694,cpu3,GAC_Executor[0]][] 
generic_shutdown_super+0x58/0xe4
[19314.529632s][pid:2694,cpu3,GAC_Executor[0]][] 
kill_anon_super+0x30/0x74
[19314.529663s][pid:2694,cpu3,GAC_Executor[0]][] 
ecryptfs_kill_block_super+0x24/0x54
[19314.529663s][pid:2694,cpu3,GAC_Executor[0]][] 
deactivate_locked_super+0x60/0x8c
[19314.529663s][pid:2694,cpu3,GAC_Executor[0]][] 
deactivate_super+0x98/0xa4
[19314.529693s][pid:2694,cpu3,GAC_Executor[0]][] 
cleanup_mnt+0x50/0xd0
[19314.529693s][pid:2694,cpu3,GAC_Executor[0]][] 
__cleanup_mnt+0x20/0x2c
[19314.529693s][pid:2694,cpu3,GAC_Executor[0]][] 
task_work_run+0xbc/0xf8
[19314.529724s][pid:2694,cpu3,GAC_Executor[0]][] 
do_exit+0x2d4/0xa14
[19314.529724s][pid:2694,cpu3,GAC_Executor[0]][] 
do_group_exit+0x60/0xf8
[19314.529724s][pid:2694,cpu3,GAC_Executor[0]][] 
get_signal+0x284/0x598
[19314.529754s][pid:2694,cpu3,GAC_Executor[0]][] 
do_signal+0x170/0x5b8
[19314.529754s][pid:2694,cpu3,GAC_Executor[0]][] 
do_notify_resume+0x70/0x78
[19314.529785s][pid:2694,cpu3,GAC_Executor[0]]Code: aa0003f3 aa1e03e0 97fe7718 
5289d5a0 (b9400661)
[19314.529907s][pid:2694,cpu3,GAC_Executor[0]]---[ end trace 382e4b6264b035b5 
]---
[19314.529907s][pid:2694,cpu3,GAC_Executor[0]]Kernel panic - not syncing: Fatal 
exception

Regards,
Shuoran





Re: Observed a ecryptFS crash

2016-10-09 Thread xiakaixu

ping...



Hi Tyhicks,

We observed a ecryptFS crash occasionally in Linux kernel 4.1.18. The call 
trace is attached below. Is it a known issue? Look forward to hearing from you. 
Thanks in advance!

[19314.529479s][pid:2694,cpu3,GAC_Executor[0]]Call trace:
[19314.529510s][pid:2694,cpu3,GAC_Executor[0]][] 
do_raw_spin_lock+0x20/0x200
[19314.529510s][pid:2694,cpu3,GAC_Executor[0]][] 
_raw_spin_lock+0x28/0x34
[19314.529541s][pid:2694,cpu3,GAC_Executor[0]][] 
selinux_inode_free_security+0x3c/0x94
[19314.529541s][pid:2694,cpu3,GAC_Executor[0]][] 
security_inode_free+0x2c/0x38
[19314.529541s][pid:2694,cpu3,GAC_Executor[0]][] 
__destroy_inode+0x2c/0x180
[19314.529571s][pid:2694,cpu3,GAC_Executor[0]][] 
destroy_inode+0x30/0xa0
[19314.529571s][pid:2694,cpu3,GAC_Executor[0]][] 
evict+0x108/0x1c0
[19314.529571s][pid:2694,cpu3,GAC_Executor[0]][] 
iput+0x184/0x258
[19314.529602s][pid:2694,cpu3,GAC_Executor[0]][] 
ecryptfs_evict_inode+0x30/0x3c
[19314.529602s][pid:2694,cpu3,GAC_Executor[0]][] 
evict+0xac/0x1c0
[19314.529602s][pid:2694,cpu3,GAC_Executor[0]][] 
dispose_list+0x44/0x5c
[19314.529632s][pid:2694,cpu3,GAC_Executor[0]][] 
evict_inodes+0xcc/0x12c
[19314.529632s][pid:2694,cpu3,GAC_Executor[0]][] 
generic_shutdown_super+0x58/0xe4
[19314.529632s][pid:2694,cpu3,GAC_Executor[0]][] 
kill_anon_super+0x30/0x74
[19314.529663s][pid:2694,cpu3,GAC_Executor[0]][] 
ecryptfs_kill_block_super+0x24/0x54
[19314.529663s][pid:2694,cpu3,GAC_Executor[0]][] 
deactivate_locked_super+0x60/0x8c
[19314.529663s][pid:2694,cpu3,GAC_Executor[0]][] 
deactivate_super+0x98/0xa4
[19314.529693s][pid:2694,cpu3,GAC_Executor[0]][] 
cleanup_mnt+0x50/0xd0
[19314.529693s][pid:2694,cpu3,GAC_Executor[0]][] 
__cleanup_mnt+0x20/0x2c
[19314.529693s][pid:2694,cpu3,GAC_Executor[0]][] 
task_work_run+0xbc/0xf8
[19314.529724s][pid:2694,cpu3,GAC_Executor[0]][] 
do_exit+0x2d4/0xa14
[19314.529724s][pid:2694,cpu3,GAC_Executor[0]][] 
do_group_exit+0x60/0xf8
[19314.529724s][pid:2694,cpu3,GAC_Executor[0]][] 
get_signal+0x284/0x598
[19314.529754s][pid:2694,cpu3,GAC_Executor[0]][] 
do_signal+0x170/0x5b8
[19314.529754s][pid:2694,cpu3,GAC_Executor[0]][] 
do_notify_resume+0x70/0x78
[19314.529785s][pid:2694,cpu3,GAC_Executor[0]]Code: aa0003f3 aa1e03e0 97fe7718 
5289d5a0 (b9400661)
[19314.529907s][pid:2694,cpu3,GAC_Executor[0]]---[ end trace 382e4b6264b035b5 
]---
[19314.529907s][pid:2694,cpu3,GAC_Executor[0]]Kernel panic - not syncing: Fatal 
exception

Regards,
Shuoran





there are unencrypted files in an encrypted directory in F2FS

2016-09-18 Thread xiakaixu

Hi Kim,

According to the encryption design policy "all of the  files or
subdirectories in an encrypted directory must be encrypted". But
the current f2fs code seems allow to there are unencrypted files
in an encrypted directory. For example, the f2fs_create() and
f2fs_mknod() functions call f2fs_new_inode() to check the child inode.

/* If the directory encrypted, then we should encrypt the inode. */
if (f2fs_encrypted_inode(dir) && f2fs_may_encrypt(inode))
f2fs_set_encrypted_inode(inode);

static inline bool f2fs_may_encrypt(struct inode *inode)
{
#ifdef CONFIG_F2FS_FS_ENCRYPTION
umode_t mode = inode->i_mode;

return (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode));
#else
return 0;
#endif
}

So even if the child inode is not REG/DIR/LNK and it still can be created
successfully which is unencrypted file. Instead, maybe here we can return
-EACCESS. Not sure about it :)

--
Regards
Kaixu Xia



there are unencrypted files in an encrypted directory in F2FS

2016-09-18 Thread xiakaixu

Hi Kim,

According to the encryption design policy "all of the  files or
subdirectories in an encrypted directory must be encrypted". But
the current f2fs code seems allow to there are unencrypted files
in an encrypted directory. For example, the f2fs_create() and
f2fs_mknod() functions call f2fs_new_inode() to check the child inode.

/* If the directory encrypted, then we should encrypt the inode. */
if (f2fs_encrypted_inode(dir) && f2fs_may_encrypt(inode))
f2fs_set_encrypted_inode(inode);

static inline bool f2fs_may_encrypt(struct inode *inode)
{
#ifdef CONFIG_F2FS_FS_ENCRYPTION
umode_t mode = inode->i_mode;

return (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode));
#else
return 0;
#endif
}

So even if the child inode is not REG/DIR/LNK and it still can be created
successfully which is unencrypted file. Instead, maybe here we can return
-EACCESS. Not sure about it :)

--
Regards
Kaixu Xia



Re: Kernel panic - encryption/decryption failed when open file on Arm64

2016-09-12 Thread xiakaixu

On 12 September 2016 at 03:16, liushuoran  wrote:

Hi Ard,

Thanks for the prompt reply. With the patch, there is no panic anymore. But it 
seems that the encryption/decryption is not successful anyway.

As Herbert points out, "If the page allocation fails in blkcipher_walk_next it'll 
simply switch over to processing it block by block". So does that mean the 
encryption/decryption should be successful even if the page allocation fails? Please 
correct me if I misunderstand anything. Thanks in advance.



Perhaps Herbert can explain: I don't see how the 'n = 0' assignment
results in the correct path being taken; this chunk (blkcipher.c:252)

if (unlikely(n < bsize)) {
 err = blkcipher_next_slow(desc, walk, bsize, walk->alignmask);
 goto set_phys_lowmem;
}

is skipped due to the fact that n == 0 and therefore bsize == 0, and
so the condition is always false for n == 0

Therefore we end up here (blkcipher.c:257)

walk->nbytes = n;
if (walk->flags & BLKCIPHER_WALK_COPY) {
 err = blkcipher_next_copy(walk);
 goto set_phys_lowmem;
}

where blkcipher_next_copy() unconditionally calls memcpy() with
walk->page as destination (even though we ended up here due to the
fact that walk->page == NULL)

So to me, it seems like we should be taking the blkcipher_next_slow()
path, which does a kmalloc() and bails with -ENOMEM if that fails.


Hi Ard,

Thanks for such a detailed reply.

According to your reply, I just make a little change to take the
blkcipher_next_slow() path. I test it on arm64 board, there is
no panic anymore and seems the encryption/decryption is successful.

diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index 0122bec..5389d40 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -240,12 +240,13 @@ static int blkcipher_walk_next(struct blkcipher_desc 
*desc,
walk->flags |= BLKCIPHER_WALK_COPY;
if (!walk->page) {
walk->page = (void *)__get_free_page(GFP_ATOMIC);
+   walk->page = NULL;
if (!walk->page)
n = 0;
}
}

-   bsize = min(walk->walk_blocksize, n);
+   bsize = walk->walk_blocksize;
n = scatterwalk_clamp(>in, n);
n = scatterwalk_clamp(>out, n);

It is just a trial and not sure it makes sense. But anyway, we can do
something here to fix the crash result from the page allocation failure.

What's your opinions, Herbert?

Regards
Kaixu Xia


.





Re: Kernel panic - encryption/decryption failed when open file on Arm64

2016-09-12 Thread xiakaixu

On 12 September 2016 at 03:16, liushuoran  wrote:

Hi Ard,

Thanks for the prompt reply. With the patch, there is no panic anymore. But it 
seems that the encryption/decryption is not successful anyway.

As Herbert points out, "If the page allocation fails in blkcipher_walk_next it'll 
simply switch over to processing it block by block". So does that mean the 
encryption/decryption should be successful even if the page allocation fails? Please 
correct me if I misunderstand anything. Thanks in advance.



Perhaps Herbert can explain: I don't see how the 'n = 0' assignment
results in the correct path being taken; this chunk (blkcipher.c:252)

if (unlikely(n < bsize)) {
 err = blkcipher_next_slow(desc, walk, bsize, walk->alignmask);
 goto set_phys_lowmem;
}

is skipped due to the fact that n == 0 and therefore bsize == 0, and
so the condition is always false for n == 0

Therefore we end up here (blkcipher.c:257)

walk->nbytes = n;
if (walk->flags & BLKCIPHER_WALK_COPY) {
 err = blkcipher_next_copy(walk);
 goto set_phys_lowmem;
}

where blkcipher_next_copy() unconditionally calls memcpy() with
walk->page as destination (even though we ended up here due to the
fact that walk->page == NULL)

So to me, it seems like we should be taking the blkcipher_next_slow()
path, which does a kmalloc() and bails with -ENOMEM if that fails.


Hi Ard,

Thanks for such a detailed reply.

According to your reply, I just make a little change to take the
blkcipher_next_slow() path. I test it on arm64 board, there is
no panic anymore and seems the encryption/decryption is successful.

diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index 0122bec..5389d40 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -240,12 +240,13 @@ static int blkcipher_walk_next(struct blkcipher_desc 
*desc,
walk->flags |= BLKCIPHER_WALK_COPY;
if (!walk->page) {
walk->page = (void *)__get_free_page(GFP_ATOMIC);
+   walk->page = NULL;
if (!walk->page)
n = 0;
}
}

-   bsize = min(walk->walk_blocksize, n);
+   bsize = walk->walk_blocksize;
n = scatterwalk_clamp(>in, n);
n = scatterwalk_clamp(>out, n);

It is just a trial and not sure it makes sense. But anyway, we can do
something here to fix the crash result from the page allocation failure.

What's your opinions, Herbert?

Regards
Kaixu Xia


.





Re: Kernel panic - encryption/decryption failed when open file on Arm64

2016-09-09 Thread xiakaixu

Hi,

After a deeply research about this crash, seems it is a specific
bug that only exists in armv8 board. And it occurs in this function
in arch/arm64/crypto/aes-glue.c.

static int ctr_encrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
   struct scatterlist *src, unsigned int nbytes)
{
   ...

desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
blkcipher_walk_init(, dst, src, nbytes);
err = blkcipher_walk_virt_block(desc, , AES_BLOCK_SIZE); ---> page 
allocation failed

...

while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {   > 
walk.nbytes = 0, and skip this loop
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key_enc, rounds, blocks, walk.iv,
first);
...
err = blkcipher_walk_done(desc, ,
  walk.nbytes % AES_BLOCK_SIZE);
}
if (nbytes) { > 
enter this if() statement
u8 *tdst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
u8 *tsrc = walk.src.virt.addr + blocks * AES_BLOCK_SIZE;
...

aes_ctr_encrypt(tail, tsrc, (u8 *)ctx->key_enc, rounds,  > 
the the sencond input parameter is NULL, so crash...
blocks, walk.iv, first);
...
}
...
}


If the page allocation failed in the function blkcipher_walk_virt_block(),
the variable walk.nbytes = 0, so it will skip the while() loop and enter
the if(nbytes) statment. But here the varibale tsrc is NULL and it is also
the sencond input parameter of the function aes_ctr_encrypt()... Kernel Panic...

I have also researched the similar function in other architectures, and
there if(walk.nbytes) is used, not this if(nbytes) statement in the armv8.
so I think this armv8 function ctr_encrypt() should deal with the page
allocation failed situation.

Regards
Kaixu Xia



On Thu, Sep 08, 2016 at 08:38:43PM +0800, xiakaixu wrote:

Hi,

I am using the encryption/decryption feature on arm64 board and a kernel
panic occurs just when open a file.  As the memory size of the board
is limited
and there are some page allocation failures before the panic.

Seems it is a kernel bug from the call trace log.

 ...
 - fscrypt_get_encryption_info
   - get_crypt_info.part.1
- validate_user_key.isra.0
 - derive_aes_gcm_key
  - crypto_gcm_decrypt
   - ablk_decrypt
- ctr_encrypt
 - blkcipher_walk_done
   - blkcipher_walk_next
-  __get_free_pages
--> page allocation failure
   ...
- aes_ctr_encrypt
-> the input parameter is
NULL pointer as the page allocation failure


The input parameter of function aes_ctr_encrypt() comes from the
/struct blkcipher_walk//
//walk/, and this variable /walk /is allocated by the function
__get_free_pages(). So if this
page allocate failed, the input parameter of function
aes_ctr_encrypt() will be NULL. The
panic will occurs if we don't check the input parameter.

Not sure about this and wish to get your opinions!


If the page allocation fails in blkcipher_walk_next it'll simply
switch over to processing it block by block. so I don't think the
warning is related to the crash.

Cheers,





Re: Kernel panic - encryption/decryption failed when open file on Arm64

2016-09-09 Thread xiakaixu

Hi,

After a deeply research about this crash, seems it is a specific
bug that only exists in armv8 board. And it occurs in this function
in arch/arm64/crypto/aes-glue.c.

static int ctr_encrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
   struct scatterlist *src, unsigned int nbytes)
{
   ...

desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
blkcipher_walk_init(, dst, src, nbytes);
err = blkcipher_walk_virt_block(desc, , AES_BLOCK_SIZE); ---> page 
allocation failed

...

while ((blocks = (walk.nbytes / AES_BLOCK_SIZE))) {   > 
walk.nbytes = 0, and skip this loop
aes_ctr_encrypt(walk.dst.virt.addr, walk.src.virt.addr,
(u8 *)ctx->key_enc, rounds, blocks, walk.iv,
first);
...
err = blkcipher_walk_done(desc, ,
  walk.nbytes % AES_BLOCK_SIZE);
}
if (nbytes) { > 
enter this if() statement
u8 *tdst = walk.dst.virt.addr + blocks * AES_BLOCK_SIZE;
u8 *tsrc = walk.src.virt.addr + blocks * AES_BLOCK_SIZE;
...

aes_ctr_encrypt(tail, tsrc, (u8 *)ctx->key_enc, rounds,  > 
the the sencond input parameter is NULL, so crash...
blocks, walk.iv, first);
...
}
...
}


If the page allocation failed in the function blkcipher_walk_virt_block(),
the variable walk.nbytes = 0, so it will skip the while() loop and enter
the if(nbytes) statment. But here the varibale tsrc is NULL and it is also
the sencond input parameter of the function aes_ctr_encrypt()... Kernel Panic...

I have also researched the similar function in other architectures, and
there if(walk.nbytes) is used, not this if(nbytes) statement in the armv8.
so I think this armv8 function ctr_encrypt() should deal with the page
allocation failed situation.

Regards
Kaixu Xia



On Thu, Sep 08, 2016 at 08:38:43PM +0800, xiakaixu wrote:

Hi,

I am using the encryption/decryption feature on arm64 board and a kernel
panic occurs just when open a file.  As the memory size of the board
is limited
and there are some page allocation failures before the panic.

Seems it is a kernel bug from the call trace log.

 ...
 - fscrypt_get_encryption_info
   - get_crypt_info.part.1
- validate_user_key.isra.0
 - derive_aes_gcm_key
  - crypto_gcm_decrypt
   - ablk_decrypt
- ctr_encrypt
 - blkcipher_walk_done
   - blkcipher_walk_next
-  __get_free_pages
--> page allocation failure
   ...
- aes_ctr_encrypt
-> the input parameter is
NULL pointer as the page allocation failure


The input parameter of function aes_ctr_encrypt() comes from the
/struct blkcipher_walk//
//walk/, and this variable /walk /is allocated by the function
__get_free_pages(). So if this
page allocate failed, the input parameter of function
aes_ctr_encrypt() will be NULL. The
panic will occurs if we don't check the input parameter.

Not sure about this and wish to get your opinions!


If the page allocation fails in blkcipher_walk_next it'll simply
switch over to processing it block by block. so I don't think the
warning is related to the crash.

Cheers,





Re: Kernel panic - encryption/decryption failed when open file on Arm64

2016-09-08 Thread xiakaixu

Sorry for resend this email, just add the linux-cry...@vger.kernel.org
and linux-kernel@vger.kernel.org.


Hi,

Firstly, thanks for your reply!

To reproduce this kernel panic, I test the encryption/decryption feature 
on arm64 board with more memory. Just add the following

change:

diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index 0122bec..10ef3f4 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -240,6 +240,7 @@ static int blkcipher_walk_next(struct 
blkcipher_desc *desc,

walk->flags |= BLKCIPHER_WALK_COPY;
if (!walk->page) {
walk->page = (void 
*)__get_free_page(GFP_ATOMIC);

+  walk->page = NULL;
if (!walk->page)
n = 0;
}


This change just set the walk->page to NULL manually.
I get the same crash when open file with the above change log. So I 
think this NULL page failure is not be handled correctly in current code.


Regards
Kaixu Xia


On Thu, Sep 08, 2016 at 08:38:43PM +0800, xiakaixu wrote:

Hi,

I am using the encryption/decryption feature on arm64 board and a kernel
panic occurs just when open a file.  As the memory size of the board
is limited
and there are some page allocation failures before the panic.

Seems it is a kernel bug from the call trace log.

 ...
 - fscrypt_get_encryption_info
   - get_crypt_info.part.1
- validate_user_key.isra.0
 - derive_aes_gcm_key
  - crypto_gcm_decrypt
   - ablk_decrypt
- ctr_encrypt
 - blkcipher_walk_done
   - blkcipher_walk_next
-  __get_free_pages
--> page allocation failure
   ...
- aes_ctr_encrypt
-> the input parameter is
NULL pointer as the page allocation failure


The input parameter of function aes_ctr_encrypt() comes from the
/struct blkcipher_walk//
//walk/, and this variable /walk /is allocated by the function
__get_free_pages(). So if this
page allocate failed, the input parameter of function
aes_ctr_encrypt() will be NULL. The
panic will occurs if we don't check the input parameter.

Not sure about this and wish to get your opinions!


If the page allocation fails in blkcipher_walk_next it'll simply
switch over to processing it block by block. so I don't think the
warning is related to the crash.

Cheers,





Re: Kernel panic - encryption/decryption failed when open file on Arm64

2016-09-08 Thread xiakaixu

Sorry for resend this email, just add the linux-cry...@vger.kernel.org
and linux-kernel@vger.kernel.org.


Hi,

Firstly, thanks for your reply!

To reproduce this kernel panic, I test the encryption/decryption feature 
on arm64 board with more memory. Just add the following

change:

diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index 0122bec..10ef3f4 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -240,6 +240,7 @@ static int blkcipher_walk_next(struct 
blkcipher_desc *desc,

walk->flags |= BLKCIPHER_WALK_COPY;
if (!walk->page) {
walk->page = (void 
*)__get_free_page(GFP_ATOMIC);

+  walk->page = NULL;
if (!walk->page)
n = 0;
}


This change just set the walk->page to NULL manually.
I get the same crash when open file with the above change log. So I 
think this NULL page failure is not be handled correctly in current code.


Regards
Kaixu Xia


On Thu, Sep 08, 2016 at 08:38:43PM +0800, xiakaixu wrote:

Hi,

I am using the encryption/decryption feature on arm64 board and a kernel
panic occurs just when open a file.  As the memory size of the board
is limited
and there are some page allocation failures before the panic.

Seems it is a kernel bug from the call trace log.

 ...
 - fscrypt_get_encryption_info
   - get_crypt_info.part.1
- validate_user_key.isra.0
 - derive_aes_gcm_key
  - crypto_gcm_decrypt
   - ablk_decrypt
- ctr_encrypt
 - blkcipher_walk_done
   - blkcipher_walk_next
-  __get_free_pages
--> page allocation failure
   ...
- aes_ctr_encrypt
-> the input parameter is
NULL pointer as the page allocation failure


The input parameter of function aes_ctr_encrypt() comes from the
/struct blkcipher_walk//
//walk/, and this variable /walk /is allocated by the function
__get_free_pages(). So if this
page allocate failed, the input parameter of function
aes_ctr_encrypt() will be NULL. The
panic will occurs if we don't check the input parameter.

Not sure about this and wish to get your opinions!


If the page allocation fails in blkcipher_walk_next it'll simply
switch over to processing it block by block. so I don't think the
warning is related to the crash.

Cheers,





test

2016-09-08 Thread xiakaixu

Sorry for noise!
--
Regards
Kaixu Xia



test

2016-09-08 Thread xiakaixu

Sorry for noise!
--
Regards
Kaixu Xia



Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread xiakaixu
于 2016/4/28 4:59, Jens Axboe 写道:
> On Wed, Apr 27 2016, Jens Axboe wrote:
>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>> On 04/27/2016 12:01 PM, Jan Kara wrote:
 Hi,

 On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> Since the dawn of time, our background buffered writeback has sucked.
> When we do background buffered writeback, it should have little impact
> on foreground activity. That's the definition of background activity...
> But for as long as I can remember, heavy buffered writers have not
> behaved like that. For instance, if I do something like this:
>
> $ dd if=/dev/zero of=foo bs=1M count=10k
>
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts database reads or sync writes. When that happens, I get people
> yelling at me.
>
> I have posted plenty of results previously, I'll keep it shorter
> this time. Here's a run on my laptop, using read-to-pipe-async for
> reading a 5g file, and rewriting it. You can find this test program
> in the fio git repo.

 I have tested your patchset on my test system. Generally I have observed
 noticeable drop in average throughput for heavy background writes without
 any other disk activity and also somewhat increased variance in the
 runtimes. It is most visible on this simple testcases:

 dd if=/dev/zero of=/mnt/file bs=1M count=1

 and

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

 The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
 created before each dd run on a dedicated disk.

 Without your patches I get pretty stable dd runtimes for both cases:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 87.9611 87.3279 87.2554

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 93.3502 93.2086 93.541

 With your patches the numbers look like:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 108.183, 97.184, 99.9587

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 104.9, 102.775, 102.892

 I have checked whether the variance is due to some interaction with CFQ
 which is used for the disk. When I switched the disk to deadline, I still
 get some variance although, the throughput is still ~10% lower:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 100.417 100.643 100.866

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 104.208 106.341 105.483

 The disk is rotational SATA drive with writeback cache, queue depth of the
 disk reported in /sys/block/sdb/device/queue_depth is 1.

 So I think we still need some tweaking on the low end of the storage
 spectrum so that we don't lose 10% of throughput for simple cases like
 this.
>>>
>>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
>>> you are seeing smaller requests, and that is why it both varies and
>>> you get lower throughput? I'll try and setup a test here similar to
>>> yours.
>>
>> Jan, care to try the below patch? I can't fully reproduce your issue on
>> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
>> a bit of a hack, but the general idea is to allow one more request to
>> build up for QD=1 devices. That eliminates wait time between one request
>> finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
> diff --git a/lib/wbt.c b/lib/wbt.c
> index 650da911f24f..322f5e04e994 100644
> --- a/lib/wbt.c
> +++ b/lib/wbt.c
> @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
>   else
>   limit = rwb->wb_normal;
Hi Jens,

This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
enough. It is not a big deal anyway :)


Another question about this if branch:

   if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
limit = 0;

I can't follow the logic of this if branch. why set limit equal to 0
when the device supports write back caches and there are tasks being
limited in balance_dirty_pages(). Could you pelase give more info
about this ?  Thanks!
>  
> + inflight = atomic_dec_return(>inflight);
> +
>   /*
> -  * Don't wake anyone up if we are above the normal limit. If
> -  * throttling got disabled (limit == 0) with waiters, ensure
> -  * that we wake them up.
> +  * wbt got disabled with IO in flight. Wake up any potential
> +  * waiters, we don't have to do more than that.
>*/
> - inflight = atomic_dec_return(>inflight);
> - if (limit && inflight >= limit) {
> - if (!rwb->wb_max)
> - wake_up_all(>wait);
> +

Re: [PATCHSET v5] Make background writeback great again for the first time

2016-04-27 Thread xiakaixu
于 2016/4/28 4:59, Jens Axboe 写道:
> On Wed, Apr 27 2016, Jens Axboe wrote:
>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>> On 04/27/2016 12:01 PM, Jan Kara wrote:
 Hi,

 On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> Since the dawn of time, our background buffered writeback has sucked.
> When we do background buffered writeback, it should have little impact
> on foreground activity. That's the definition of background activity...
> But for as long as I can remember, heavy buffered writers have not
> behaved like that. For instance, if I do something like this:
>
> $ dd if=/dev/zero of=foo bs=1M count=10k
>
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts database reads or sync writes. When that happens, I get people
> yelling at me.
>
> I have posted plenty of results previously, I'll keep it shorter
> this time. Here's a run on my laptop, using read-to-pipe-async for
> reading a 5g file, and rewriting it. You can find this test program
> in the fio git repo.

 I have tested your patchset on my test system. Generally I have observed
 noticeable drop in average throughput for heavy background writes without
 any other disk activity and also somewhat increased variance in the
 runtimes. It is most visible on this simple testcases:

 dd if=/dev/zero of=/mnt/file bs=1M count=1

 and

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync

 The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
 created before each dd run on a dedicated disk.

 Without your patches I get pretty stable dd runtimes for both cases:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 87.9611 87.3279 87.2554

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 93.3502 93.2086 93.541

 With your patches the numbers look like:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 108.183, 97.184, 99.9587

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 104.9, 102.775, 102.892

 I have checked whether the variance is due to some interaction with CFQ
 which is used for the disk. When I switched the disk to deadline, I still
 get some variance although, the throughput is still ~10% lower:

 dd if=/dev/zero of=/mnt/file bs=1M count=1
 Runtimes: 100.417 100.643 100.866

 dd if=/dev/zero of=/mnt/file bs=1M count=1 conv=fsync
 Runtimes: 104.208 106.341 105.483

 The disk is rotational SATA drive with writeback cache, queue depth of the
 disk reported in /sys/block/sdb/device/queue_depth is 1.

 So I think we still need some tweaking on the low end of the storage
 spectrum so that we don't lose 10% of throughput for simple cases like
 this.
>>>
>>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
>>> you are seeing smaller requests, and that is why it both varies and
>>> you get lower throughput? I'll try and setup a test here similar to
>>> yours.
>>
>> Jan, care to try the below patch? I can't fully reproduce your issue on
>> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
>> a bit of a hack, but the general idea is to allow one more request to
>> build up for QD=1 devices. That eliminates wait time between one request
>> finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
> diff --git a/lib/wbt.c b/lib/wbt.c
> index 650da911f24f..322f5e04e994 100644
> --- a/lib/wbt.c
> +++ b/lib/wbt.c
> @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
>   else
>   limit = rwb->wb_normal;
Hi Jens,

This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
enough. It is not a big deal anyway :)


Another question about this if branch:

   if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
limit = 0;

I can't follow the logic of this if branch. why set limit equal to 0
when the device supports write back caches and there are tasks being
limited in balance_dirty_pages(). Could you pelase give more info
about this ?  Thanks!
>  
> + inflight = atomic_dec_return(>inflight);
> +
>   /*
> -  * Don't wake anyone up if we are above the normal limit. If
> -  * throttling got disabled (limit == 0) with waiters, ensure
> -  * that we wake them up.
> +  * wbt got disabled with IO in flight. Wake up any potential
> +  * waiters, we don't have to do more than that.
>*/
> - inflight = atomic_dec_return(>inflight);
> - if (limit && inflight >= limit) {
> - if (!rwb->wb_max)
> - wake_up_all(>wait);
> +

Re: [PATCH 7/8] wbt: add general throttling mechanism

2016-04-27 Thread xiakaixu
于 2016/4/27 23:21, Jens Axboe 写道:
> On 04/27/2016 06:06 AM, xiakaixu wrote:
>>> +void __wbt_done(struct rq_wb *rwb)
>>> +{
>>> +int inflight, limit = rwb->wb_normal;
>>> +
>>> +/*
>>> + * If the device does write back caching, drop further down
>>> + * before we wake people up.
>>> + */
>>> +if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
>>> +limit = 0;
>>> +else
>>> +limit = rwb->wb_normal;
>>> +
>>> +/*
>>> + * Don't wake anyone up if we are above the normal limit. If
>>> + * throttling got disabled (limit == 0) with waiters, ensure
>>> + * that we wake them up.
>>> + */
>>> +inflight = atomic_dec_return(>inflight);
>>> +if (limit && inflight >= limit) {
>>> +if (!rwb->wb_max)
>>> +wake_up_all(>wait);
>>> +return;
>>> +}
>>> +
>> Hi Jens,
>>
>> Just a little confused about this. The rwb->wb_max can't be 0 if the variable
>> 'limit' does not equal to 0. So the if (!rwb->wb_max) branch maybe does not
>> make sense.
> 
> You are right, it doesn't make a lot of sense. I think it suffers from code 
> shuffling. How about the attached? The important part is that we wake up 
> waiters, if wbt got disabled while we had tracked IO in flight.
>
Hi Jens,

The modified patch in another mail looks better. Maybe there are still
some places coube be improved. You can find them in that mail.



-- 
Regards
Kaixu Xia



Re: [PATCH 7/8] wbt: add general throttling mechanism

2016-04-27 Thread xiakaixu
于 2016/4/27 23:21, Jens Axboe 写道:
> On 04/27/2016 06:06 AM, xiakaixu wrote:
>>> +void __wbt_done(struct rq_wb *rwb)
>>> +{
>>> +int inflight, limit = rwb->wb_normal;
>>> +
>>> +/*
>>> + * If the device does write back caching, drop further down
>>> + * before we wake people up.
>>> + */
>>> +if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
>>> +limit = 0;
>>> +else
>>> +limit = rwb->wb_normal;
>>> +
>>> +/*
>>> + * Don't wake anyone up if we are above the normal limit. If
>>> + * throttling got disabled (limit == 0) with waiters, ensure
>>> + * that we wake them up.
>>> + */
>>> +inflight = atomic_dec_return(>inflight);
>>> +if (limit && inflight >= limit) {
>>> +if (!rwb->wb_max)
>>> +wake_up_all(>wait);
>>> +return;
>>> +}
>>> +
>> Hi Jens,
>>
>> Just a little confused about this. The rwb->wb_max can't be 0 if the variable
>> 'limit' does not equal to 0. So the if (!rwb->wb_max) branch maybe does not
>> make sense.
> 
> You are right, it doesn't make a lot of sense. I think it suffers from code 
> shuffling. How about the attached? The important part is that we wake up 
> waiters, if wbt got disabled while we had tracked IO in flight.
>
Hi Jens,

The modified patch in another mail looks better. Maybe there are still
some places coube be improved. You can find them in that mail.



-- 
Regards
Kaixu Xia



Re: [PATCH 7/8] wbt: add general throttling mechanism

2016-04-27 Thread xiakaixu

> + return rwb && rwb->wb_normal != 0;
> +}
> +
> +/*
> + * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
> + * false if 'v' + 1 would be bigger than 'below'.
> + */
> +static bool atomic_inc_below(atomic_t *v, int below)
> +{
> + int cur = atomic_read(v);
> +
> + for (;;) {
> + int old;
> +
> + if (cur >= below)
> + return false;
> + old = atomic_cmpxchg(v, cur, cur + 1);
> + if (old == cur)
> + break;
> + cur = old;
> + }
> +
> + return true;
> +}
> +
> +static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
> +{
> + if (rwb_enabled(rwb)) {
> + const unsigned long cur = jiffies;
> +
> + if (cur != *var)
> + *var = cur;
> + }
> +}
> +
> +void __wbt_done(struct rq_wb *rwb)
> +{
> + int inflight, limit = rwb->wb_normal;
> +
> + /*
> +  * If the device does write back caching, drop further down
> +  * before we wake people up.
> +  */
> + if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
> + limit = 0;
> + else
> + limit = rwb->wb_normal;
> +
> + /*
> +  * Don't wake anyone up if we are above the normal limit. If
> +  * throttling got disabled (limit == 0) with waiters, ensure
> +  * that we wake them up.
> +  */
> + inflight = atomic_dec_return(>inflight);
> + if (limit && inflight >= limit) {
> + if (!rwb->wb_max)
> + wake_up_all(>wait);
> + return;
> + }
> +
Hi Jens,

Just a little confused about this. The rwb->wb_max can't be 0 if the variable
'limit' does not equal to 0. So the if (!rwb->wb_max) branch maybe does not
make sense.


> + if (waitqueue_active(>wait)) {
> + int diff = limit - inflight;
> +
> + if (!inflight || diff >= rwb->wb_background / 2)
> + wake_up_nr(>wait, 1);
> + }
> +}
> +
> +/*
> + * Called on completion of a request. Note that it's also called when
> + * a request is merged, when the request gets freed.
> + */
> +void wbt_done(struct rq_wb *rwb, struct wb_issue_stat *stat)
> +{
> + if (!rwb)
> + return;
> +
> + if (!wbt_tracked(stat)) {
> + if (rwb->sync_cookie == stat) {
> + rwb->sync_issue = 0;
> + rwb->sync_cookie = NULL;
> + }
> +
> + wb_timestamp(rwb, >last_comp);
> + } else {
> + WARN_ON_ONCE(stat == rwb->sync_cookie);
> + __wbt_done(rwb);
> + wbt_clear_tracked(stat);
> + }
> +}
> +
> +static void calc_wb_limits(struct rq_wb *rwb)
> +{
> + unsigned int depth;
> +
> + if (!rwb->min_lat_nsec) {
> + rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
> + return;
> + }
> +
> + depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
> +
> + /*
> +  * Reduce max depth by 50%, and re-calculate normal/bg based on that
> +  */
> + rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> + rwb->wb_normal = (rwb->wb_max + 1) / 2;
> + rwb->wb_background = (rwb->wb_max + 3) / 4;
> +}
> +
> +static bool inline stat_sample_valid(struct blk_rq_stat *stat)
> +{
> + /*
> +  * We need at least one read sample, and a minimum of
> +  * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
> +  * that it's writes impacting us, and not just some sole read on
> +  * a device that is in a lower power state.
> +  */
> + return stat[0].nr_samples >= 1 &&
> + stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
> +}
> +



Re: [PATCH 7/8] wbt: add general throttling mechanism

2016-04-27 Thread xiakaixu

> + return rwb && rwb->wb_normal != 0;
> +}
> +
> +/*
> + * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
> + * false if 'v' + 1 would be bigger than 'below'.
> + */
> +static bool atomic_inc_below(atomic_t *v, int below)
> +{
> + int cur = atomic_read(v);
> +
> + for (;;) {
> + int old;
> +
> + if (cur >= below)
> + return false;
> + old = atomic_cmpxchg(v, cur, cur + 1);
> + if (old == cur)
> + break;
> + cur = old;
> + }
> +
> + return true;
> +}
> +
> +static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
> +{
> + if (rwb_enabled(rwb)) {
> + const unsigned long cur = jiffies;
> +
> + if (cur != *var)
> + *var = cur;
> + }
> +}
> +
> +void __wbt_done(struct rq_wb *rwb)
> +{
> + int inflight, limit = rwb->wb_normal;
> +
> + /*
> +  * If the device does write back caching, drop further down
> +  * before we wake people up.
> +  */
> + if (rwb->wc && !atomic_read(>bdi->wb.dirty_sleeping))
> + limit = 0;
> + else
> + limit = rwb->wb_normal;
> +
> + /*
> +  * Don't wake anyone up if we are above the normal limit. If
> +  * throttling got disabled (limit == 0) with waiters, ensure
> +  * that we wake them up.
> +  */
> + inflight = atomic_dec_return(>inflight);
> + if (limit && inflight >= limit) {
> + if (!rwb->wb_max)
> + wake_up_all(>wait);
> + return;
> + }
> +
Hi Jens,

Just a little confused about this. The rwb->wb_max can't be 0 if the variable
'limit' does not equal to 0. So the if (!rwb->wb_max) branch maybe does not
make sense.


> + if (waitqueue_active(>wait)) {
> + int diff = limit - inflight;
> +
> + if (!inflight || diff >= rwb->wb_background / 2)
> + wake_up_nr(>wait, 1);
> + }
> +}
> +
> +/*
> + * Called on completion of a request. Note that it's also called when
> + * a request is merged, when the request gets freed.
> + */
> +void wbt_done(struct rq_wb *rwb, struct wb_issue_stat *stat)
> +{
> + if (!rwb)
> + return;
> +
> + if (!wbt_tracked(stat)) {
> + if (rwb->sync_cookie == stat) {
> + rwb->sync_issue = 0;
> + rwb->sync_cookie = NULL;
> + }
> +
> + wb_timestamp(rwb, >last_comp);
> + } else {
> + WARN_ON_ONCE(stat == rwb->sync_cookie);
> + __wbt_done(rwb);
> + wbt_clear_tracked(stat);
> + }
> +}
> +
> +static void calc_wb_limits(struct rq_wb *rwb)
> +{
> + unsigned int depth;
> +
> + if (!rwb->min_lat_nsec) {
> + rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
> + return;
> + }
> +
> + depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
> +
> + /*
> +  * Reduce max depth by 50%, and re-calculate normal/bg based on that
> +  */
> + rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> + rwb->wb_normal = (rwb->wb_max + 1) / 2;
> + rwb->wb_background = (rwb->wb_max + 3) / 4;
> +}
> +
> +static bool inline stat_sample_valid(struct blk_rq_stat *stat)
> +{
> + /*
> +  * We need at least one read sample, and a minimum of
> +  * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
> +  * that it's writes impacting us, and not just some sole read on
> +  * a device that is in a lower power state.
> +  */
> + return stat[0].nr_samples >= 1 &&
> + stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
> +}
> +



Re: [PATCH 8/8] writeback: throttle buffered writeback

2016-04-25 Thread xiakaixu
于 2016/4/24 5:37, Jens Axboe 写道:
> On 04/23/2016 02:21 AM, xiakaixu wrote:
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 40b57bf4852c..d941f69dfb4b 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -39,6 +39,7 @@
>>>
>>>   #include "blk.h"
>>>   #include "blk-mq.h"
>>> +#include "blk-wb.h"
>>>
>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
>>> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, 
>>> request_fn_proc *rfn,
>>>
>>>   fail:
>>>   blk_free_flush_queue(q->fq);
>>> +blk_wb_exit(q);
>>>   return NULL;
>>>   }
>>>   EXPORT_SYMBOL(blk_init_allocated_queue);
>>> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, 
>>> struct request *rq)
>>>   blk_delete_timer(rq);
>>>   blk_clear_rq_complete(rq);
>>>   trace_block_rq_requeue(q, rq);
>>> +blk_wb_requeue(q->rq_wb, rq);
>>>
>>>   if (rq->cmd_flags & REQ_QUEUED)
>>>   blk_queue_end_tag(q, rq);
>>> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, 
>>> struct request *req)
>>>   /* this is a bio leak */
>>>   WARN_ON(req->bio != NULL);
>>>
>>> +blk_wb_done(q->rq_wb, req);
>>> +
>>>   /*
>>>* Request may not have originated from ll_rw_blk. if not,
>>>* it didn't come out of our reserved rq pools
>>> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue 
>>> *q, struct bio *bio)
>>>   int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
>>>   struct request *req;
>>>   unsigned int request_count = 0;
>>> +bool wb_acct;
>>>
>>>   /*
>>>* low level driver can indicate that it wants pages above a
>>> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue 
>>> *q, struct bio *bio)
>>>   }
>>>
>>>   get_rq:
>>> +wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
>>> +
>>>   /*
>>>* This sync check and mask will be re-done in 
>>> init_request_from_bio(),
>>>* but we need to set it earlier to expose the sync flag to the
>>> @@ -1781,11 +1789,16 @@ get_rq:
>>>*/
>>>   req = get_request(q, rw_flags, bio, GFP_NOIO);
>>>   if (IS_ERR(req)) {
>>> +if (wb_acct)
>>> +__blk_wb_done(q->rq_wb);
>>>   bio->bi_error = PTR_ERR(req);
>>>   bio_endio(bio);
>>>   goto out_unlock;
>>>   }
>>>
>>> +if (wb_acct)
>>> +req->cmd_flags |= REQ_BUF_INFLIGHT;
>>> +
>>>   /*
>>>* After dropping the lock and possibly sleeping here, our request
>>>* may now be mergeable after it had proven unmergeable (above).
>>> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req)
>>>   blk_dequeue_request(req);
>>>
>>>   req->issue_time = ktime_to_ns(ktime_get());
>>> +blk_wb_issue(req->q->rq_wb, req);
>>>
>>>   /*
>>>* We are now handing the request to the hardware, initialize
>>> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int 
>>> error)
>>>   blk_unprep_request(req);
>>>
>>>   blk_account_io_done(req);
>>> +blk_wb_done(req->q->rq_wb, req);
>>
>> Hi Jens,
>>
>> Seems the function blk_wb_done() will be executed twice even if the end_io
>> callback is set.
>> Maybe the same thing would happen in blk-mq.c.
> 
> Yeah, that was a mistake, the current version has it fixed. It was 
> inadvertently added when I discovered that the flush request didn't work 
> properly. Now it just duplicates the call inside the check for if it has an 
> ->end_io() defined, since we don't use the normal path for that.
>
Hi Jens,

I have checked the wb-buf-throttle branch in your block git repo. I am not sure 
it is the completed version.
Seems only the problem is fixed in blk-mq.c. The function blk_wb_done() still 
would be executed twice in blk-core.c.
(the functions blk_finish_request() and __blk_put_request())
Maybe we can add a flag to mark whether blk_wb_done() has been done or not.



-- 
Regards
Kaixu Xia



Re: [PATCH 8/8] writeback: throttle buffered writeback

2016-04-25 Thread xiakaixu
于 2016/4/24 5:37, Jens Axboe 写道:
> On 04/23/2016 02:21 AM, xiakaixu wrote:
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 40b57bf4852c..d941f69dfb4b 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -39,6 +39,7 @@
>>>
>>>   #include "blk.h"
>>>   #include "blk-mq.h"
>>> +#include "blk-wb.h"
>>>
>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
>>>   EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
>>> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, 
>>> request_fn_proc *rfn,
>>>
>>>   fail:
>>>   blk_free_flush_queue(q->fq);
>>> +blk_wb_exit(q);
>>>   return NULL;
>>>   }
>>>   EXPORT_SYMBOL(blk_init_allocated_queue);
>>> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, 
>>> struct request *rq)
>>>   blk_delete_timer(rq);
>>>   blk_clear_rq_complete(rq);
>>>   trace_block_rq_requeue(q, rq);
>>> +blk_wb_requeue(q->rq_wb, rq);
>>>
>>>   if (rq->cmd_flags & REQ_QUEUED)
>>>   blk_queue_end_tag(q, rq);
>>> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, 
>>> struct request *req)
>>>   /* this is a bio leak */
>>>   WARN_ON(req->bio != NULL);
>>>
>>> +blk_wb_done(q->rq_wb, req);
>>> +
>>>   /*
>>>* Request may not have originated from ll_rw_blk. if not,
>>>* it didn't come out of our reserved rq pools
>>> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue 
>>> *q, struct bio *bio)
>>>   int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
>>>   struct request *req;
>>>   unsigned int request_count = 0;
>>> +bool wb_acct;
>>>
>>>   /*
>>>* low level driver can indicate that it wants pages above a
>>> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue 
>>> *q, struct bio *bio)
>>>   }
>>>
>>>   get_rq:
>>> +wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
>>> +
>>>   /*
>>>* This sync check and mask will be re-done in 
>>> init_request_from_bio(),
>>>* but we need to set it earlier to expose the sync flag to the
>>> @@ -1781,11 +1789,16 @@ get_rq:
>>>*/
>>>   req = get_request(q, rw_flags, bio, GFP_NOIO);
>>>   if (IS_ERR(req)) {
>>> +if (wb_acct)
>>> +__blk_wb_done(q->rq_wb);
>>>   bio->bi_error = PTR_ERR(req);
>>>   bio_endio(bio);
>>>   goto out_unlock;
>>>   }
>>>
>>> +if (wb_acct)
>>> +req->cmd_flags |= REQ_BUF_INFLIGHT;
>>> +
>>>   /*
>>>* After dropping the lock and possibly sleeping here, our request
>>>* may now be mergeable after it had proven unmergeable (above).
>>> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req)
>>>   blk_dequeue_request(req);
>>>
>>>   req->issue_time = ktime_to_ns(ktime_get());
>>> +blk_wb_issue(req->q->rq_wb, req);
>>>
>>>   /*
>>>* We are now handing the request to the hardware, initialize
>>> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int 
>>> error)
>>>   blk_unprep_request(req);
>>>
>>>   blk_account_io_done(req);
>>> +blk_wb_done(req->q->rq_wb, req);
>>
>> Hi Jens,
>>
>> Seems the function blk_wb_done() will be executed twice even if the end_io
>> callback is set.
>> Maybe the same thing would happen in blk-mq.c.
> 
> Yeah, that was a mistake, the current version has it fixed. It was 
> inadvertently added when I discovered that the flush request didn't work 
> properly. Now it just duplicates the call inside the check for if it has an 
> ->end_io() defined, since we don't use the normal path for that.
>
Hi Jens,

I have checked the wb-buf-throttle branch in your block git repo. I am not sure 
it is the completed version.
Seems only the problem is fixed in blk-mq.c. The function blk_wb_done() still 
would be executed twice in blk-core.c.
(the functions blk_finish_request() and __blk_put_request())
Maybe we can add a flag to mark whether blk_wb_done() has been done or not.



-- 
Regards
Kaixu Xia



Re: [PATCH 8/8] writeback: throttle buffered writeback

2016-04-23 Thread xiakaixu
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 40b57bf4852c..d941f69dfb4b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -39,6 +39,7 @@
>  
>  #include "blk.h"
>  #include "blk-mq.h"
> +#include "blk-wb.h"
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, 
> request_fn_proc *rfn,
>  
>  fail:
>   blk_free_flush_queue(q->fq);
> + blk_wb_exit(q);
>   return NULL;
>  }
>  EXPORT_SYMBOL(blk_init_allocated_queue);
> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, 
> struct request *rq)
>   blk_delete_timer(rq);
>   blk_clear_rq_complete(rq);
>   trace_block_rq_requeue(q, rq);
> + blk_wb_requeue(q->rq_wb, rq);
>  
>   if (rq->cmd_flags & REQ_QUEUED)
>   blk_queue_end_tag(q, rq);
> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct 
> request *req)
>   /* this is a bio leak */
>   WARN_ON(req->bio != NULL);
>  
> + blk_wb_done(q->rq_wb, req);
> +
>   /*
>* Request may not have originated from ll_rw_blk. if not,
>* it didn't come out of our reserved rq pools
> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
> struct bio *bio)
>   int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
>   struct request *req;
>   unsigned int request_count = 0;
> + bool wb_acct;
>  
>   /*
>* low level driver can indicate that it wants pages above a
> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
> struct bio *bio)
>   }
>  
>  get_rq:
> + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
> +
>   /*
>* This sync check and mask will be re-done in init_request_from_bio(),
>* but we need to set it earlier to expose the sync flag to the
> @@ -1781,11 +1789,16 @@ get_rq:
>*/
>   req = get_request(q, rw_flags, bio, GFP_NOIO);
>   if (IS_ERR(req)) {
> + if (wb_acct)
> + __blk_wb_done(q->rq_wb);
>   bio->bi_error = PTR_ERR(req);
>   bio_endio(bio);
>   goto out_unlock;
>   }
>  
> + if (wb_acct)
> + req->cmd_flags |= REQ_BUF_INFLIGHT;
> +
>   /*
>* After dropping the lock and possibly sleeping here, our request
>* may now be mergeable after it had proven unmergeable (above).
> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req)
>   blk_dequeue_request(req);
>  
>   req->issue_time = ktime_to_ns(ktime_get());
> + blk_wb_issue(req->q->rq_wb, req);
>  
>   /*
>* We are now handing the request to the hardware, initialize
> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error)
>   blk_unprep_request(req);
>  
>   blk_account_io_done(req);
> + blk_wb_done(req->q->rq_wb, req);

Hi Jens,

Seems the function blk_wb_done() will be executed twice even if the end_io
callback is set.
Maybe the same thing would happen in blk-mq.c.

>  
>   if (req->end_io)
>   req->end_io(req, error);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 71b4a13fbf94..c0c5207fe7fd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -30,6 +30,7 @@
>  #include "blk-mq.h"
>  #include "blk-mq-tag.h"
>  #include "blk-stat.h"
> +#include "blk-wb.h"
>  
>  static DEFINE_MUTEX(all_q_mutex);
>  static LIST_HEAD(all_q_list);
> @@ -275,6 +276,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx 
> *hctx,
>  
>   if (rq->cmd_flags & REQ_MQ_INFLIGHT)
>   atomic_dec(>nr_active);
> +
> + blk_wb_done(q->rq_wb, rq);
> +
>   rq->cmd_flags = 0;
>  
>   clear_bit(REQ_ATOM_STARTED, >atomic_flags);
> @@ -305,6 +309,7 @@ EXPORT_SYMBOL_GPL(blk_mq_free_request);
>  inline void __blk_mq_end_request(struct request *rq, int error)
>  {
>   blk_account_io_done(rq);
> + blk_wb_done(rq->q->rq_wb, rq);
>  
>   if (rq->end_io) {
>   rq->end_io(rq, error);
> @@ -414,6 +419,7 @@ void blk_mq_start_request(struct request *rq)
>   rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
>  
>   rq->issue_time = ktime_to_ns(ktime_get());
> + blk_wb_issue(q->rq_wb, rq);
>  
>   blk_add_timer(rq);
>  
> @@ -450,6 +456,7 @@ static void __blk_mq_requeue_request(struct request *rq)
>   struct request_queue *q = rq->q;
>  
>   trace_block_rq_requeue(q, rq);
> + blk_wb_requeue(q->rq_wb, rq);
>  
>   if (test_and_clear_bit(REQ_ATOM_STARTED, >atomic_flags)) {
>   if (q->dma_drain_size && blk_rq_bytes(rq))
> @@ -1265,6 +1272,7 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>   struct blk_plug *plug;
>   struct request *same_queue_rq = NULL;
>   blk_qc_t cookie;
> + bool wb_acct;
>  
>   blk_queue_bounce(q, );
>  
> @@ -1282,9 +1290,17 @@ static 

Re: [PATCH 8/8] writeback: throttle buffered writeback

2016-04-23 Thread xiakaixu
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 40b57bf4852c..d941f69dfb4b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -39,6 +39,7 @@
>  
>  #include "blk.h"
>  #include "blk-mq.h"
> +#include "blk-wb.h"
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, 
> request_fn_proc *rfn,
>  
>  fail:
>   blk_free_flush_queue(q->fq);
> + blk_wb_exit(q);
>   return NULL;
>  }
>  EXPORT_SYMBOL(blk_init_allocated_queue);
> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, 
> struct request *rq)
>   blk_delete_timer(rq);
>   blk_clear_rq_complete(rq);
>   trace_block_rq_requeue(q, rq);
> + blk_wb_requeue(q->rq_wb, rq);
>  
>   if (rq->cmd_flags & REQ_QUEUED)
>   blk_queue_end_tag(q, rq);
> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct 
> request *req)
>   /* this is a bio leak */
>   WARN_ON(req->bio != NULL);
>  
> + blk_wb_done(q->rq_wb, req);
> +
>   /*
>* Request may not have originated from ll_rw_blk. if not,
>* it didn't come out of our reserved rq pools
> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
> struct bio *bio)
>   int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
>   struct request *req;
>   unsigned int request_count = 0;
> + bool wb_acct;
>  
>   /*
>* low level driver can indicate that it wants pages above a
> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
> struct bio *bio)
>   }
>  
>  get_rq:
> + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
> +
>   /*
>* This sync check and mask will be re-done in init_request_from_bio(),
>* but we need to set it earlier to expose the sync flag to the
> @@ -1781,11 +1789,16 @@ get_rq:
>*/
>   req = get_request(q, rw_flags, bio, GFP_NOIO);
>   if (IS_ERR(req)) {
> + if (wb_acct)
> + __blk_wb_done(q->rq_wb);
>   bio->bi_error = PTR_ERR(req);
>   bio_endio(bio);
>   goto out_unlock;
>   }
>  
> + if (wb_acct)
> + req->cmd_flags |= REQ_BUF_INFLIGHT;
> +
>   /*
>* After dropping the lock and possibly sleeping here, our request
>* may now be mergeable after it had proven unmergeable (above).
> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req)
>   blk_dequeue_request(req);
>  
>   req->issue_time = ktime_to_ns(ktime_get());
> + blk_wb_issue(req->q->rq_wb, req);
>  
>   /*
>* We are now handing the request to the hardware, initialize
> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error)
>   blk_unprep_request(req);
>  
>   blk_account_io_done(req);
> + blk_wb_done(req->q->rq_wb, req);

Hi Jens,

Seems the function blk_wb_done() will be executed twice even if the end_io
callback is set.
Maybe the same thing would happen in blk-mq.c.

>  
>   if (req->end_io)
>   req->end_io(req, error);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 71b4a13fbf94..c0c5207fe7fd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -30,6 +30,7 @@
>  #include "blk-mq.h"
>  #include "blk-mq-tag.h"
>  #include "blk-stat.h"
> +#include "blk-wb.h"
>  
>  static DEFINE_MUTEX(all_q_mutex);
>  static LIST_HEAD(all_q_list);
> @@ -275,6 +276,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx 
> *hctx,
>  
>   if (rq->cmd_flags & REQ_MQ_INFLIGHT)
>   atomic_dec(>nr_active);
> +
> + blk_wb_done(q->rq_wb, rq);
> +
>   rq->cmd_flags = 0;
>  
>   clear_bit(REQ_ATOM_STARTED, >atomic_flags);
> @@ -305,6 +309,7 @@ EXPORT_SYMBOL_GPL(blk_mq_free_request);
>  inline void __blk_mq_end_request(struct request *rq, int error)
>  {
>   blk_account_io_done(rq);
> + blk_wb_done(rq->q->rq_wb, rq);
>  
>   if (rq->end_io) {
>   rq->end_io(rq, error);
> @@ -414,6 +419,7 @@ void blk_mq_start_request(struct request *rq)
>   rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
>  
>   rq->issue_time = ktime_to_ns(ktime_get());
> + blk_wb_issue(q->rq_wb, rq);
>  
>   blk_add_timer(rq);
>  
> @@ -450,6 +456,7 @@ static void __blk_mq_requeue_request(struct request *rq)
>   struct request_queue *q = rq->q;
>  
>   trace_block_rq_requeue(q, rq);
> + blk_wb_requeue(q->rq_wb, rq);
>  
>   if (test_and_clear_bit(REQ_ATOM_STARTED, >atomic_flags)) {
>   if (q->dma_drain_size && blk_rq_bytes(rq))
> @@ -1265,6 +1272,7 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>   struct blk_plug *plug;
>   struct request *same_queue_rq = NULL;
>   blk_qc_t cookie;
> + bool wb_acct;
>  
>   blk_queue_bounce(q, );
>  
> @@ -1282,9 +1290,17 @@ static 

Re: [PATCH v2] arm64: Store breakpoint single step state into pstate

2016-01-21 Thread xiakaixu
ping...
于 2016/1/15 16:20, xiakaixu 写道:
> 于 2016/1/13 1:06, Will Deacon 写道:
>> On Tue, Jan 05, 2016 at 01:06:15PM +0800, Wangnan (F) wrote:
>>> On 2016/1/5 0:55, Will Deacon wrote:
>>>> The problem seems to be that we take the debug exception before the
>>>> breakpointed instruction has been executed and call perf_bp_event at
>>>> that moment, so when we single-step the faulting instruction we actually
>>>> step into the SIGIO handler and end up getting stuck.
>>>>
>>>> Your fix doesn't really address this afaict, in that you don't (can't?)
>>>> handle:
>>>>
>>>>   * A longjmp out of a signal handler
>>>>   * A watchpoint and a breakpoint that fire on the same instruction
>>>>   * User-controlled single-step from a signal handler that enables a
>>>> breakpoint explicitly
>>>>   * Nested signals
>>>
>>> Please have a look at [1], which I improve test__bp_signal() to
>>> check bullet 2 and 4 you mentioned above. Seems my fix is correct.
>>>
>>> [1] 
>>> http://lkml.kernel.org/g/1451969880-14877-1-git-send-email-wangn...@huawei.com
>>
>> I'm still really uneasy about this change. Pairing up the signal delivery
>> with the sigreturn to keep track of the debug state is extremely fragile
>> and I'm not keen on adding this logic there. I also think we need to
>> track the address that the breakpoint is originally taken on so that we
>> can only perform the extra sigreturn work if we're returning to the same
>> instruction. Furthermore, I wouldn't want to do this for signals other
>> than those generated directly by a breakpoint.
>>
>> An alternative would be to postpone the signal delivery until after the
>> stepping has been taken care of, but that's a change in ABI and I worry
>> we'll break somebody relying on the current behaviour.
>>
>> What exactly does x86 do? I couldn't figure it out from the code.
> 
> Hi Will,
> 
> I changed the signal SIGIO to SIGUSR2 according to the patch that Wang Nan
> sent out about improving test__bp_signal() to check bullet 2 and 4 you 
> mentioned.
> I tested it with arm64 qemu and gdb. The single instruction execution on qemu
> shows that the result is the same as the processing described in Wang Nan's 
> patch[2].
> 
> I also tested the patch on x86 qemu and found that the result is the same as
> arm64 qemu.
> 
> [1]
>  tools/perf/tests/bp_signal.c |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/tests/bp_signal.c b/tools/perf/tests/bp_signal.c
> index 1d1bb48..3046cba 100644
> --- a/tools/perf/tests/bp_signal.c
> +++ b/tools/perf/tests/bp_signal.c
> @@ -175,7 +175,7 @@ int test__bp_signal(int subtest __maybe_unused)
> sa.sa_sigaction = (void *) sig_handler;
> sa.sa_flags = SA_SIGINFO;
> 
> -   if (sigaction(SIGIO, , NULL) < 0) {
> +   if (sigaction(SIGUSR2, , NULL) < 0) {
> pr_debug("failed setting up signal handler\n");
> return TEST_FAIL;
> }
> @@ -237,9 +237,9 @@ int test__bp_signal(int subtest __maybe_unused)
>  *
>  */
> 
> -   fd1 = bp_event(__test_function, SIGIO);
> +   fd1 = bp_event(__test_function, SIGUSR2);
> fd2 = bp_event(sig_handler, SIGUSR1);
> -   fd3 = wp_event((void *)_var, SIGIO);
> +   fd3 = wp_event((void *)_var, SIGUSR2);
>
> ioctl(fd1, PERF_EVENT_IOC_ENABLE, 0);
> ioctl(fd2, PERF_EVENT_IOC_ENABLE, 0);
> 
> [2]
>  * Following processing should happen:
>  *   Exec:   Action:   Result:
>  *   incq (%rdi)   - fd1 event breakpoint hit   -> count1 == 1
>  * - SIGIO is delivered
>  *   sig_handler   - fd2 event breakpoint hit   -> count2 == 1
>  * - SIGUSR1 is delivered
>  *   sig_handler_2  -> overflows_2 == 
> 1  (nested signal)
>  *   sys_rt_sigreturn  - return from sig_handler_2
>  *   overflows++-> overflows = 1
>  *   sys_rt_sigreturn  - return from sig_handler
>  *   incq (%rdi)   - fd3 event watchpoint hit   -> count3 == 1
>(wp and bp in one insn)
>  * - SIGIO is delivered
>  *   sig_handler   - fd2 event breakpoint hit   -> count2 == 2
>  * - SIGUSR1 is delivered
>  *   sig_handler_2  -> 

Re: [PATCH v2] arm64: Store breakpoint single step state into pstate

2016-01-21 Thread xiakaixu
ping...
于 2016/1/15 16:20, xiakaixu 写道:
> 于 2016/1/13 1:06, Will Deacon 写道:
>> On Tue, Jan 05, 2016 at 01:06:15PM +0800, Wangnan (F) wrote:
>>> On 2016/1/5 0:55, Will Deacon wrote:
>>>> The problem seems to be that we take the debug exception before the
>>>> breakpointed instruction has been executed and call perf_bp_event at
>>>> that moment, so when we single-step the faulting instruction we actually
>>>> step into the SIGIO handler and end up getting stuck.
>>>>
>>>> Your fix doesn't really address this afaict, in that you don't (can't?)
>>>> handle:
>>>>
>>>>   * A longjmp out of a signal handler
>>>>   * A watchpoint and a breakpoint that fire on the same instruction
>>>>   * User-controlled single-step from a signal handler that enables a
>>>> breakpoint explicitly
>>>>   * Nested signals
>>>
>>> Please have a look at [1], which I improve test__bp_signal() to
>>> check bullet 2 and 4 you mentioned above. Seems my fix is correct.
>>>
>>> [1] 
>>> http://lkml.kernel.org/g/1451969880-14877-1-git-send-email-wangn...@huawei.com
>>
>> I'm still really uneasy about this change. Pairing up the signal delivery
>> with the sigreturn to keep track of the debug state is extremely fragile
>> and I'm not keen on adding this logic there. I also think we need to
>> track the address that the breakpoint is originally taken on so that we
>> can only perform the extra sigreturn work if we're returning to the same
>> instruction. Furthermore, I wouldn't want to do this for signals other
>> than those generated directly by a breakpoint.
>>
>> An alternative would be to postpone the signal delivery until after the
>> stepping has been taken care of, but that's a change in ABI and I worry
>> we'll break somebody relying on the current behaviour.
>>
>> What exactly does x86 do? I couldn't figure it out from the code.
> 
> Hi Will,
> 
> I changed the signal SIGIO to SIGUSR2 according to the patch that Wang Nan
> sent out about improving test__bp_signal() to check bullet 2 and 4 you 
> mentioned.
> I tested it with arm64 qemu and gdb. The single instruction execution on qemu
> shows that the result is the same as the processing described in Wang Nan's 
> patch[2].
> 
> I also tested the patch on x86 qemu and found that the result is the same as
> arm64 qemu.
> 
> [1]
>  tools/perf/tests/bp_signal.c |6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/tests/bp_signal.c b/tools/perf/tests/bp_signal.c
> index 1d1bb48..3046cba 100644
> --- a/tools/perf/tests/bp_signal.c
> +++ b/tools/perf/tests/bp_signal.c
> @@ -175,7 +175,7 @@ int test__bp_signal(int subtest __maybe_unused)
> sa.sa_sigaction = (void *) sig_handler;
> sa.sa_flags = SA_SIGINFO;
> 
> -   if (sigaction(SIGIO, , NULL) < 0) {
> +   if (sigaction(SIGUSR2, , NULL) < 0) {
> pr_debug("failed setting up signal handler\n");
> return TEST_FAIL;
> }
> @@ -237,9 +237,9 @@ int test__bp_signal(int subtest __maybe_unused)
>  *
>  */
> 
> -   fd1 = bp_event(__test_function, SIGIO);
> +   fd1 = bp_event(__test_function, SIGUSR2);
> fd2 = bp_event(sig_handler, SIGUSR1);
> -   fd3 = wp_event((void *)_var, SIGIO);
> +   fd3 = wp_event((void *)_var, SIGUSR2);
>
> ioctl(fd1, PERF_EVENT_IOC_ENABLE, 0);
> ioctl(fd2, PERF_EVENT_IOC_ENABLE, 0);
> 
> [2]
>  * Following processing should happen:
>  *   Exec:   Action:   Result:
>  *   incq (%rdi)   - fd1 event breakpoint hit   -> count1 == 1
>  * - SIGIO is delivered
>  *   sig_handler   - fd2 event breakpoint hit   -> count2 == 1
>  * - SIGUSR1 is delivered
>  *   sig_handler_2  -> overflows_2 == 
> 1  (nested signal)
>  *   sys_rt_sigreturn  - return from sig_handler_2
>  *   overflows++-> overflows = 1
>  *   sys_rt_sigreturn  - return from sig_handler
>  *   incq (%rdi)   - fd3 event watchpoint hit   -> count3 == 1
>(wp and bp in one insn)
>  * - SIGIO is delivered
>  *   sig_handler   - fd2 event breakpoint hit   -> count2 == 2
>  * - SIGUSR1 is delivered
>  *   sig_handler_2  -> 

Re: [BUG] perf test 21("Test object code reading") failure on ARM64

2015-12-20 Thread xiakaixu
于 2015/12/20 8:25, Jan Stancek 写道:
> On Sat, Dec 19, 2015 at 11:04:21AM +0800, xiakaixu wrote:
>>
>>>>>...
>>>
>>> Hi,
>>>
>>> What is your objdump version?
>>
>> Hi,
>>
>> Sorry for the late reply.
>>
>> # objdump --version
>> GNU objdump (GNU Binutils) 2.25.
>>
>> I am sure that the system is Little endian.
>>>
> 
> I have attached a patch if you care to try it with your setup.
> If it still fails, output from -v and last objdump command output
> would be helpful.

Hi,

After applying this patch, the perf test case passed.

# perf test 21
21: Test object code reading : (no vmlinux) Ok

Thanks!
> 
> Regards,
> Jan
> 


-- 
Regards
Kaixu Xia

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] perf test 21("Test object code reading") failure on ARM64

2015-12-20 Thread xiakaixu
于 2015/12/20 8:25, Jan Stancek 写道:
> On Sat, Dec 19, 2015 at 11:04:21AM +0800, xiakaixu wrote:
>>
>>>>>...
>>>
>>> Hi,
>>>
>>> What is your objdump version?
>>
>> Hi,
>>
>> Sorry for the late reply.
>>
>> # objdump --version
>> GNU objdump (GNU Binutils) 2.25.
>>
>> I am sure that the system is Little endian.
>>>
> 
> I have attached a patch if you care to try it with your setup.
> If it still fails, output from -v and last objdump command output
> would be helpful.

Hi,

After applying this patch, the perf test case passed.

# perf test 21
21: Test object code reading : (no vmlinux) Ok

Thanks!
> 
> Regards,
> Jan
> 


-- 
Regards
Kaixu Xia

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] perf test 21("Test object code reading") failure on ARM64

2015-12-18 Thread xiakaixu

>>>...
> 
> Hi,
> 
> What is your objdump version?

Hi,

Sorry for the late reply.

# objdump --version
GNU objdump (GNU Binutils) 2.25.

I am sure that the system is Little endian.
> 
>>>
>>> So the following patch is needed.
>>> diff --git a/tools/perf/tests/code-reading.c
>>> b/tools/perf/tests/code-reading.c
>>> index a767a64..1b55fa0 100644
>>> --- a/tools/perf/tests/code-reading.c
>>> +++ b/tools/perf/tests/code-reading.c
>>> @@ -61,9 +61,6 @@ static size_t read_objdump_line(const char *line, size_t
>>> line_len, void *buf,
>>> if (i >= line_len || !isxdigit(line[i]))
>>> break;
>>> c2 = line[i++];
>>> -   /* Followed by a space */
>>> -   if (i < line_len && line[i] && !isspace(line[i]))
>>> -   break;
>>> /* Store byte */
>>> *(unsigned char *)buf = (hex(c1) << 4) | hex(c2);
>>>buf += 1;
>>>
>>> After applying this patch, the test still failed.
>>> ##
>>>   ...
>>>   Objdump command is: objdump -z -d --start-address=0x7c4c4
>>>   --stop-address=0x7c544 /tmp/oxygen_root-root/lib64/libc-2.19-2014.08.so
>>>   Bytes read differ from those read by objdump
>>>   buf1 (dso):
>>>   0x00 0x00 0x80 0xd2 0xd5 0xff 0xff 0x17 0xe0 0x03 0x19 0xaa 0xd3 0xff
>>>   0xff 0x17
>>>   0xe1 0x03 0x14 0xaa 0xa2 0x63 0x02 0x91 0xe0 0x03 0x13 0xaa 0x66 0xfe
>>>   0xff 0x97
>>>   0xfc 0x03 0x00 0xaa 0xa0 0x4f 0x40 0xf9 0xe2 0x03 0x1c 0xaa 0xe1 0x03
>>>   0x00 0xaa
>>>   0x08 0x00 0x67 0x9e 0x61 0x02 0x01 0x8b 0xe0 0x03 0x13 0xaa 0x60 0x01
>>>   0x00 0x94
>>>   0xe0 0xf9 0xff 0x35 0x95 0x07 0x00 0xd1 0x1b 0x00 0x80 0xd2 0x01 0x01
>>>   0x66 0x9e
>>>   0x60 0x02 0x15 0x8b 0x17 0x00 0x1c 0xcb 0xf8 0x03 0x1b 0xaa 0x0a 0x00
>>>   0x67 0x9e
>>>   0x20 0x00 0x80 0xd2 0x00 0x00 0x1c 0xcb 0x81 0x02 0x01 0xcb 0x09 0x00
>>>   0x67 0x9e
>>>   0x2b 0x00 0x67 0x9e 0x16 0x03 0x14 0x8b 0x20 0x03 0x1a 0x8b 0x01 0x00
>>>   0x80 0x52
>>>
>>>   buf2 (objdump):
>>>   0xd2 0x80 0x00 0x00 0x17 0xff 0xff 0xd5 0xaa 0x19 0x03 0xe0 0x17 0xff
>>>   0xff 0xd3
>>>   0xaa 0x14 0x03 0xe1 0x91 0x02 0x63 0xa2 0xaa 0x13 0x03 0xe0 0x97 0xff
>>>   0xfe 0x66
>>>   0xaa 0x00 0x03 0xfc 0xf9 0x40 0x4f 0xa0 0xaa 0x1c 0x03 0xe2 0xaa 0x00
>>>   0x03 0xe1
>>>   0x9e 0x67 0x00 0x08 0x8b 0x01 0x02 0x61 0xaa 0x13 0x03 0xe0 0x94 0x00
>>>   0x01 0x60
>>>   0x35 0xff 0xf9 0xe0 0xd1 0x00 0x07 0x95 0xd2 0x80 0x00 0x1b 0x9e 0x66
>>>   0x01 0x01
>>>   0x8b 0x15 0x02 0x60 0xcb 0x1c 0x00 0x17 0xaa 0x1b 0x03 0xf8 0x9e 0x67
>>>   0x00 0x0a
>>>   0xd2 0x80 0x00 0x20 0xcb 0x1c 0x00 0x00 0xcb 0x01 0x02 0x81 0x9e 0x67
>>>   0x00 0x09
>>>   0x9e 0x67 0x00 0x2b 0x8b 0x14 0x03 0x16 0x8b 0x1a 0x03 0x20 0x52 0x80
>>>   0x00 0x01
> 
> The data appears to match, but the endian is different.
> 
> Regards,
> Jan
> 
>>>
>>>   test child finished with -1
>>>    end 
>>>   Test object code reading: FAILED!
>>> ##
>>>
>>> Seems the dso file format is different between x86 and ARM64.
>>> Maybe this test case only works fine on x86.
>>> --
>>> Regards
>>> Kaixu Xia
>>>
>>
> 
> .
> 


-- 
Regards
Kaixu Xia

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] perf test 21("Test object code reading") failure on ARM64

2015-12-18 Thread xiakaixu

>>>...
> 
> Hi,
> 
> What is your objdump version?

Hi,

Sorry for the late reply.

# objdump --version
GNU objdump (GNU Binutils) 2.25.

I am sure that the system is Little endian.
> 
>>>
>>> So the following patch is needed.
>>> diff --git a/tools/perf/tests/code-reading.c
>>> b/tools/perf/tests/code-reading.c
>>> index a767a64..1b55fa0 100644
>>> --- a/tools/perf/tests/code-reading.c
>>> +++ b/tools/perf/tests/code-reading.c
>>> @@ -61,9 +61,6 @@ static size_t read_objdump_line(const char *line, size_t
>>> line_len, void *buf,
>>> if (i >= line_len || !isxdigit(line[i]))
>>> break;
>>> c2 = line[i++];
>>> -   /* Followed by a space */
>>> -   if (i < line_len && line[i] && !isspace(line[i]))
>>> -   break;
>>> /* Store byte */
>>> *(unsigned char *)buf = (hex(c1) << 4) | hex(c2);
>>>buf += 1;
>>>
>>> After applying this patch, the test still failed.
>>> ##
>>>   ...
>>>   Objdump command is: objdump -z -d --start-address=0x7c4c4
>>>   --stop-address=0x7c544 /tmp/oxygen_root-root/lib64/libc-2.19-2014.08.so
>>>   Bytes read differ from those read by objdump
>>>   buf1 (dso):
>>>   0x00 0x00 0x80 0xd2 0xd5 0xff 0xff 0x17 0xe0 0x03 0x19 0xaa 0xd3 0xff
>>>   0xff 0x17
>>>   0xe1 0x03 0x14 0xaa 0xa2 0x63 0x02 0x91 0xe0 0x03 0x13 0xaa 0x66 0xfe
>>>   0xff 0x97
>>>   0xfc 0x03 0x00 0xaa 0xa0 0x4f 0x40 0xf9 0xe2 0x03 0x1c 0xaa 0xe1 0x03
>>>   0x00 0xaa
>>>   0x08 0x00 0x67 0x9e 0x61 0x02 0x01 0x8b 0xe0 0x03 0x13 0xaa 0x60 0x01
>>>   0x00 0x94
>>>   0xe0 0xf9 0xff 0x35 0x95 0x07 0x00 0xd1 0x1b 0x00 0x80 0xd2 0x01 0x01
>>>   0x66 0x9e
>>>   0x60 0x02 0x15 0x8b 0x17 0x00 0x1c 0xcb 0xf8 0x03 0x1b 0xaa 0x0a 0x00
>>>   0x67 0x9e
>>>   0x20 0x00 0x80 0xd2 0x00 0x00 0x1c 0xcb 0x81 0x02 0x01 0xcb 0x09 0x00
>>>   0x67 0x9e
>>>   0x2b 0x00 0x67 0x9e 0x16 0x03 0x14 0x8b 0x20 0x03 0x1a 0x8b 0x01 0x00
>>>   0x80 0x52
>>>
>>>   buf2 (objdump):
>>>   0xd2 0x80 0x00 0x00 0x17 0xff 0xff 0xd5 0xaa 0x19 0x03 0xe0 0x17 0xff
>>>   0xff 0xd3
>>>   0xaa 0x14 0x03 0xe1 0x91 0x02 0x63 0xa2 0xaa 0x13 0x03 0xe0 0x97 0xff
>>>   0xfe 0x66
>>>   0xaa 0x00 0x03 0xfc 0xf9 0x40 0x4f 0xa0 0xaa 0x1c 0x03 0xe2 0xaa 0x00
>>>   0x03 0xe1
>>>   0x9e 0x67 0x00 0x08 0x8b 0x01 0x02 0x61 0xaa 0x13 0x03 0xe0 0x94 0x00
>>>   0x01 0x60
>>>   0x35 0xff 0xf9 0xe0 0xd1 0x00 0x07 0x95 0xd2 0x80 0x00 0x1b 0x9e 0x66
>>>   0x01 0x01
>>>   0x8b 0x15 0x02 0x60 0xcb 0x1c 0x00 0x17 0xaa 0x1b 0x03 0xf8 0x9e 0x67
>>>   0x00 0x0a
>>>   0xd2 0x80 0x00 0x20 0xcb 0x1c 0x00 0x00 0xcb 0x01 0x02 0x81 0x9e 0x67
>>>   0x00 0x09
>>>   0x9e 0x67 0x00 0x2b 0x8b 0x14 0x03 0x16 0x8b 0x1a 0x03 0x20 0x52 0x80
>>>   0x00 0x01
> 
> The data appears to match, but the endian is different.
> 
> Regards,
> Jan
> 
>>>
>>>   test child finished with -1
>>>    end 
>>>   Test object code reading: FAILED!
>>> ##
>>>
>>> Seems the dso file format is different between x86 and ARM64.
>>> Maybe this test case only works fine on x86.
>>> --
>>> Regards
>>> Kaixu Xia
>>>
>>
> 
> .
> 


-- 
Regards
Kaixu Xia

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] perf test 21("Test object code reading") failure on ARM64

2015-12-17 Thread xiakaixu
Hi,

Perf test "Test object code reading" failed on ARM64 board and the test log is 
here.

##
# perf test -v 21
  ...
  File is: /tmp/oxygen_root-root/lib64/libc-2.19-2014.08.so
  On file address is: 0x70c3c
  Objdump command is: objdump -z -d --start-address=0x70c3c 
--stop-address=0x70cbc /tmp/oxygen_root-root/lib64/libc-2.19-2014.08.so
  objdump read too few bytes
  Bytes read differ from those read by objdump
  buf1 (dso):
  0x16 0x58 0x41 0xb9 0xf6 0x0e 0x00 0x35 0x00 0x00 0x00 0x90 0xbf 0x27 0x00 
0xf9
  0x00 0x40 0x22 0x91 0xa0 0x23 0x00 0xf9 0xb4 0x06 0x00 0xf0 0x55 0xd0 0x3b 
0xd5
  0x80 0xe2 0x14 0x91 0xb5 0xc2 0x1b 0xd1 0x00 0x60 0x00 0x91 0x01 0x04 0x40 
0xf9
  0x3f 0x00 0x15 0xeb 0xc0 0x01 0x00 0x54 0xbf 0x3f 0x00 0xb9 0x21 0x00 0x80 
0x52
  0x02 0xfc 0x5f 0x88 0x5f 0x00 0x1f 0x6b 0x61 0x00 0x00 0x54 0x01 0x7c 0x03 
0x88
  0x83 0xff 0xff 0x35 0x40 0x00 0x00 0x54 0xa2 0x3f 0x00 0xb9 0xa0 0x3f 0x40 
0xb9
  0xc0 0x0f 0x00 0x35 0x80 0xe2 0x14 0x91 0x15 0x10 0x00 0xf9 0x80 0xe2 0x14 
0x91
  0x62 0x02 0x40 0xb9 0x44 0x00 0x11 0x12 0x01 0x1c 0x40 0xb9 0x13 0x08 0x00 
0xf9

  buf2 (objdump):
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00

  test child finished with -1
   end 
  Test object code reading: FAILED!
##

The result of objdump on ARM64 like this:
   ...
   70c40:   35000ef6cbnzw22, 70e1c <_IO_link_in+0x208>
   70c44:   9000adrpx0, 7 <_IO_file_fopen+0x164>
   70c48:   f90027bfstr xzr, [x29,#72]
   70c4c:   91224000add x0, x0, #0x890
   70c50:   f90023a0str x0, [x29,#64]
   70c54:   f6b4adrpx20, 147000 <__abort_msg+0x588>
   ...

So the following patch is needed.
diff --git a/tools/perf/tests/code-reading.c b/tools/perf/tests/code-reading.c
index a767a64..1b55fa0 100644
--- a/tools/perf/tests/code-reading.c
+++ b/tools/perf/tests/code-reading.c
@@ -61,9 +61,6 @@ static size_t read_objdump_line(const char *line, size_t 
line_len, void *buf,
if (i >= line_len || !isxdigit(line[i]))
break;
c2 = line[i++];
-   /* Followed by a space */
-   if (i < line_len && line[i] && !isspace(line[i]))
-   break;
/* Store byte */
*(unsigned char *)buf = (hex(c1) << 4) | hex(c2);
   buf += 1;

After applying this patch, the test still failed.
##
  ...
  Objdump command is: objdump -z -d --start-address=0x7c4c4 
--stop-address=0x7c544 /tmp/oxygen_root-root/lib64/libc-2.19-2014.08.so
  Bytes read differ from those read by objdump
  buf1 (dso):
  0x00 0x00 0x80 0xd2 0xd5 0xff 0xff 0x17 0xe0 0x03 0x19 0xaa 0xd3 0xff 0xff 
0x17
  0xe1 0x03 0x14 0xaa 0xa2 0x63 0x02 0x91 0xe0 0x03 0x13 0xaa 0x66 0xfe 0xff 
0x97
  0xfc 0x03 0x00 0xaa 0xa0 0x4f 0x40 0xf9 0xe2 0x03 0x1c 0xaa 0xe1 0x03 0x00 
0xaa
  0x08 0x00 0x67 0x9e 0x61 0x02 0x01 0x8b 0xe0 0x03 0x13 0xaa 0x60 0x01 0x00 
0x94
  0xe0 0xf9 0xff 0x35 0x95 0x07 0x00 0xd1 0x1b 0x00 0x80 0xd2 0x01 0x01 0x66 
0x9e
  0x60 0x02 0x15 0x8b 0x17 0x00 0x1c 0xcb 0xf8 0x03 0x1b 0xaa 0x0a 0x00 0x67 
0x9e
  0x20 0x00 0x80 0xd2 0x00 0x00 0x1c 0xcb 0x81 0x02 0x01 0xcb 0x09 0x00 0x67 
0x9e
  0x2b 0x00 0x67 0x9e 0x16 0x03 0x14 0x8b 0x20 0x03 0x1a 0x8b 0x01 0x00 0x80 
0x52

  buf2 (objdump):
  0xd2 0x80 0x00 0x00 0x17 0xff 0xff 0xd5 0xaa 0x19 0x03 0xe0 0x17 0xff 0xff 
0xd3
  0xaa 0x14 0x03 0xe1 0x91 0x02 0x63 0xa2 0xaa 0x13 0x03 0xe0 0x97 0xff 0xfe 
0x66
  0xaa 0x00 0x03 0xfc 0xf9 0x40 0x4f 0xa0 0xaa 0x1c 0x03 0xe2 0xaa 0x00 0x03 
0xe1
  0x9e 0x67 0x00 0x08 0x8b 0x01 0x02 0x61 0xaa 0x13 0x03 0xe0 0x94 0x00 0x01 
0x60
  0x35 0xff 0xf9 0xe0 0xd1 0x00 0x07 0x95 0xd2 0x80 0x00 0x1b 0x9e 0x66 0x01 
0x01
  0x8b 0x15 0x02 0x60 0xcb 0x1c 0x00 0x17 0xaa 0x1b 0x03 0xf8 0x9e 0x67 0x00 
0x0a
  0xd2 0x80 0x00 0x20 0xcb 0x1c 0x00 0x00 0xcb 0x01 0x02 0x81 0x9e 0x67 0x00 
0x09
  0x9e 0x67 0x00 0x2b 0x8b 0x14 0x03 0x16 0x8b 0x1a 0x03 0x20 0x52 0x80 0x00 
0x01

  test child finished with -1
   end 
  Test object code reading: FAILED!

[BUG] perf test 21("Test object code reading") failure on ARM64

2015-12-17 Thread xiakaixu
Hi,

Perf test "Test object code reading" failed on ARM64 board and the test log is 
here.

##
# perf test -v 21
  ...
  File is: /tmp/oxygen_root-root/lib64/libc-2.19-2014.08.so
  On file address is: 0x70c3c
  Objdump command is: objdump -z -d --start-address=0x70c3c 
--stop-address=0x70cbc /tmp/oxygen_root-root/lib64/libc-2.19-2014.08.so
  objdump read too few bytes
  Bytes read differ from those read by objdump
  buf1 (dso):
  0x16 0x58 0x41 0xb9 0xf6 0x0e 0x00 0x35 0x00 0x00 0x00 0x90 0xbf 0x27 0x00 
0xf9
  0x00 0x40 0x22 0x91 0xa0 0x23 0x00 0xf9 0xb4 0x06 0x00 0xf0 0x55 0xd0 0x3b 
0xd5
  0x80 0xe2 0x14 0x91 0xb5 0xc2 0x1b 0xd1 0x00 0x60 0x00 0x91 0x01 0x04 0x40 
0xf9
  0x3f 0x00 0x15 0xeb 0xc0 0x01 0x00 0x54 0xbf 0x3f 0x00 0xb9 0x21 0x00 0x80 
0x52
  0x02 0xfc 0x5f 0x88 0x5f 0x00 0x1f 0x6b 0x61 0x00 0x00 0x54 0x01 0x7c 0x03 
0x88
  0x83 0xff 0xff 0x35 0x40 0x00 0x00 0x54 0xa2 0x3f 0x00 0xb9 0xa0 0x3f 0x40 
0xb9
  0xc0 0x0f 0x00 0x35 0x80 0xe2 0x14 0x91 0x15 0x10 0x00 0xf9 0x80 0xe2 0x14 
0x91
  0x62 0x02 0x40 0xb9 0x44 0x00 0x11 0x12 0x01 0x1c 0x40 0xb9 0x13 0x08 0x00 
0xf9

  buf2 (objdump):
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
  0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00

  test child finished with -1
   end 
  Test object code reading: FAILED!
##

The result of objdump on ARM64 like this:
   ...
   70c40:   35000ef6cbnzw22, 70e1c <_IO_link_in+0x208>
   70c44:   9000adrpx0, 7 <_IO_file_fopen+0x164>
   70c48:   f90027bfstr xzr, [x29,#72]
   70c4c:   91224000add x0, x0, #0x890
   70c50:   f90023a0str x0, [x29,#64]
   70c54:   f6b4adrpx20, 147000 <__abort_msg+0x588>
   ...

So the following patch is needed.
diff --git a/tools/perf/tests/code-reading.c b/tools/perf/tests/code-reading.c
index a767a64..1b55fa0 100644
--- a/tools/perf/tests/code-reading.c
+++ b/tools/perf/tests/code-reading.c
@@ -61,9 +61,6 @@ static size_t read_objdump_line(const char *line, size_t 
line_len, void *buf,
if (i >= line_len || !isxdigit(line[i]))
break;
c2 = line[i++];
-   /* Followed by a space */
-   if (i < line_len && line[i] && !isspace(line[i]))
-   break;
/* Store byte */
*(unsigned char *)buf = (hex(c1) << 4) | hex(c2);
   buf += 1;

After applying this patch, the test still failed.
##
  ...
  Objdump command is: objdump -z -d --start-address=0x7c4c4 
--stop-address=0x7c544 /tmp/oxygen_root-root/lib64/libc-2.19-2014.08.so
  Bytes read differ from those read by objdump
  buf1 (dso):
  0x00 0x00 0x80 0xd2 0xd5 0xff 0xff 0x17 0xe0 0x03 0x19 0xaa 0xd3 0xff 0xff 
0x17
  0xe1 0x03 0x14 0xaa 0xa2 0x63 0x02 0x91 0xe0 0x03 0x13 0xaa 0x66 0xfe 0xff 
0x97
  0xfc 0x03 0x00 0xaa 0xa0 0x4f 0x40 0xf9 0xe2 0x03 0x1c 0xaa 0xe1 0x03 0x00 
0xaa
  0x08 0x00 0x67 0x9e 0x61 0x02 0x01 0x8b 0xe0 0x03 0x13 0xaa 0x60 0x01 0x00 
0x94
  0xe0 0xf9 0xff 0x35 0x95 0x07 0x00 0xd1 0x1b 0x00 0x80 0xd2 0x01 0x01 0x66 
0x9e
  0x60 0x02 0x15 0x8b 0x17 0x00 0x1c 0xcb 0xf8 0x03 0x1b 0xaa 0x0a 0x00 0x67 
0x9e
  0x20 0x00 0x80 0xd2 0x00 0x00 0x1c 0xcb 0x81 0x02 0x01 0xcb 0x09 0x00 0x67 
0x9e
  0x2b 0x00 0x67 0x9e 0x16 0x03 0x14 0x8b 0x20 0x03 0x1a 0x8b 0x01 0x00 0x80 
0x52

  buf2 (objdump):
  0xd2 0x80 0x00 0x00 0x17 0xff 0xff 0xd5 0xaa 0x19 0x03 0xe0 0x17 0xff 0xff 
0xd3
  0xaa 0x14 0x03 0xe1 0x91 0x02 0x63 0xa2 0xaa 0x13 0x03 0xe0 0x97 0xff 0xfe 
0x66
  0xaa 0x00 0x03 0xfc 0xf9 0x40 0x4f 0xa0 0xaa 0x1c 0x03 0xe2 0xaa 0x00 0x03 
0xe1
  0x9e 0x67 0x00 0x08 0x8b 0x01 0x02 0x61 0xaa 0x13 0x03 0xe0 0x94 0x00 0x01 
0x60
  0x35 0xff 0xf9 0xe0 0xd1 0x00 0x07 0x95 0xd2 0x80 0x00 0x1b 0x9e 0x66 0x01 
0x01
  0x8b 0x15 0x02 0x60 0xcb 0x1c 0x00 0x17 0xaa 0x1b 0x03 0xf8 0x9e 0x67 0x00 
0x0a
  0xd2 0x80 0x00 0x20 0xcb 0x1c 0x00 0x00 0xcb 0x01 0x02 0x81 0x9e 0x67 0x00 
0x09
  0x9e 0x67 0x00 0x2b 0x8b 0x14 0x03 0x16 0x8b 0x1a 0x03 0x20 0x52 0x80 0x00 
0x01

  test child finished with -1
   end 
  Test object code reading: FAILED!

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-27 Thread xiakaixu
于 2015/10/23 23:12, Peter Zijlstra 写道:
> On Fri, Oct 23, 2015 at 02:52:11PM +0200, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2015 at 06:28:22PM +0800, Wangnan (F) wrote:
>>> information to analysis when glitch happen. Another way we are trying to
>>> implement
>>> now is to dynamically turn events on and off, or at least enable/disable
>>> sampling dynamically because the overhead of copying those samples
>>> is a big part of perf's total overhead. After that we can trace as many
>>> event as possible, but only fetch data from them when we detect a glitch.
>>
>> So why don't you 'fix' the flight recorder mode and just leave the data
>> in memory and not bother copying it out until a glitch happens?
>>
>> Something like this:
>>
>> lkml.kernel.org/r/20130708121557.ga17...@twins.programming.kicks-ass.net
>>
>> it appears we never quite finished that.
> 
> Updated to current sources, compile tested only.
> 
> It obviously needs testing and performance numbers..  and some
> userspace.
> 
> ---
> Subject: perf: Update event buffer tail when overwriting old events
> From: Peter Zijlstra 
> 
>> From: "Yan, Zheng" 
>>
>> If perf event buffer is in overwrite mode, the kernel only updates
>> the data head when it overwrites old samples. The program that owns
>> the buffer need periodically check the buffer and update a variable
>> that tracks the date tail. If the program fails to do this in time,
>> the data tail can be overwritted by new samples. The program has to
>> rewind the buffer because it does not know where is the first vaild
>> sample.
>>
>> This patch makes the kernel update the date tail when it overwrites
>> old events. So the program that owns the event buffer can always
>> read the latest samples. This is convenient for programs that use
>> perf to do branch tracing. One use case is GDB branch tracing:
>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
>> It uses perf interface to read BTS, but only cares the branches
>> before the ptrace event.
> 
> Original-patch-by: "Yan, Zheng" 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/kernel/cpu/perf_event_intel_ds.c |2 
>  include/linux/perf_event.h|6 --
>  kernel/events/core.c  |   56 +
>  kernel/events/internal.h  |2 
>  kernel/events/ring_buffer.c   |   77 
> +-
>  5 files changed, 107 insertions(+), 36 deletions(-)
> 
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -1140,7 +1140,7 @@ static void __intel_pmu_pebs_event(struc
>  
>   while (count > 1) {
>   setup_pebs_sample_data(event, iregs, at, , );
> - perf_event_output(event, , );
> + event->overflow_handler(event, , );
>   at += x86_pmu.pebs_record_size;
>   at = get_next_pebs_record_by_bit(at, top, bit);
>   count--;
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -828,10 +828,6 @@ extern int perf_event_overflow(struct pe
>struct perf_sample_data *data,
>struct pt_regs *regs);
>  
> -extern void perf_event_output(struct perf_event *event,
> - struct perf_sample_data *data,
> - struct pt_regs *regs);
> -
>  extern void
>  perf_event_header__init_id(struct perf_event_header *header,
>  struct perf_sample_data *data,
> @@ -1032,6 +1028,8 @@ static inline bool has_aux(struct perf_e
>  
>  extern int perf_output_begin(struct perf_output_handle *handle,
>struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
> +  struct perf_event *event, unsigned int size);
>  extern void perf_output_end(struct perf_output_handle *handle);
>  extern unsigned int perf_output_copy(struct perf_output_handle *handle,
>const void *buf, unsigned int len);
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4515,6 +4515,8 @@ static int perf_mmap_fault(struct vm_are
>   return ret;
>  }
>  
> +static void perf_event_set_overflow(struct perf_event *event, struct 
> ring_buffer *rb);
> +
>  static void ring_buffer_attach(struct perf_event *event,
>  struct ring_buffer *rb)
>  {
> @@ -4546,6 +4548,8 @@ static void ring_buffer_attach(struct pe
>   spin_lock_irqsave(>event_lock, flags);
>   list_add_rcu(>rb_entry, >event_list);
>   spin_unlock_irqrestore(>event_lock, flags);
> +
> + perf_event_set_overflow(event, rb);
>   }
>  
>   rcu_assign_pointer(event->rb, rb);
> @@ -5579,9 +5583,12 @@ void perf_prepare_sample(struct perf_eve
>   }
>  }
>  
> -void perf_event_output(struct perf_event *event,
> - struct 

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-27 Thread xiakaixu
于 2015/10/23 23:12, Peter Zijlstra 写道:
> On Fri, Oct 23, 2015 at 02:52:11PM +0200, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2015 at 06:28:22PM +0800, Wangnan (F) wrote:
>>> information to analysis when glitch happen. Another way we are trying to
>>> implement
>>> now is to dynamically turn events on and off, or at least enable/disable
>>> sampling dynamically because the overhead of copying those samples
>>> is a big part of perf's total overhead. After that we can trace as many
>>> event as possible, but only fetch data from them when we detect a glitch.
>>
>> So why don't you 'fix' the flight recorder mode and just leave the data
>> in memory and not bother copying it out until a glitch happens?
>>
>> Something like this:
>>
>> lkml.kernel.org/r/20130708121557.ga17...@twins.programming.kicks-ass.net
>>
>> it appears we never quite finished that.
> 
> Updated to current sources, compile tested only.
> 
> It obviously needs testing and performance numbers..  and some
> userspace.
> 
> ---
> Subject: perf: Update event buffer tail when overwriting old events
> From: Peter Zijlstra 
> 
>> From: "Yan, Zheng" 
>>
>> If perf event buffer is in overwrite mode, the kernel only updates
>> the data head when it overwrites old samples. The program that owns
>> the buffer need periodically check the buffer and update a variable
>> that tracks the date tail. If the program fails to do this in time,
>> the data tail can be overwritted by new samples. The program has to
>> rewind the buffer because it does not know where is the first vaild
>> sample.
>>
>> This patch makes the kernel update the date tail when it overwrites
>> old events. So the program that owns the event buffer can always
>> read the latest samples. This is convenient for programs that use
>> perf to do branch tracing. One use case is GDB branch tracing:
>> (http://sourceware.org/ml/gdb-patches/2012-06/msg00172.html)
>> It uses perf interface to read BTS, but only cares the branches
>> before the ptrace event.
> 
> Original-patch-by: "Yan, Zheng" 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/kernel/cpu/perf_event_intel_ds.c |2 
>  include/linux/perf_event.h|6 --
>  kernel/events/core.c  |   56 +
>  kernel/events/internal.h  |2 
>  kernel/events/ring_buffer.c   |   77 
> +-
>  5 files changed, 107 insertions(+), 36 deletions(-)
> 
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -1140,7 +1140,7 @@ static void __intel_pmu_pebs_event(struc
>  
>   while (count > 1) {
>   setup_pebs_sample_data(event, iregs, at, , );
> - perf_event_output(event, , );
> + event->overflow_handler(event, , );
>   at += x86_pmu.pebs_record_size;
>   at = get_next_pebs_record_by_bit(at, top, bit);
>   count--;
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -828,10 +828,6 @@ extern int perf_event_overflow(struct pe
>struct perf_sample_data *data,
>struct pt_regs *regs);
>  
> -extern void perf_event_output(struct perf_event *event,
> - struct perf_sample_data *data,
> - struct pt_regs *regs);
> -
>  extern void
>  perf_event_header__init_id(struct perf_event_header *header,
>  struct perf_sample_data *data,
> @@ -1032,6 +1028,8 @@ static inline bool has_aux(struct perf_e
>  
>  extern int perf_output_begin(struct perf_output_handle *handle,
>struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_overwrite(struct perf_output_handle *handle,
> +  struct perf_event *event, unsigned int size);
>  extern void perf_output_end(struct perf_output_handle *handle);
>  extern unsigned int perf_output_copy(struct perf_output_handle *handle,
>const void *buf, unsigned int len);
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4515,6 +4515,8 @@ static int perf_mmap_fault(struct vm_are
>   return ret;
>  }
>  
> +static void perf_event_set_overflow(struct perf_event *event, struct 
> ring_buffer *rb);
> +
>  static void ring_buffer_attach(struct perf_event *event,
>  struct ring_buffer *rb)
>  {
> @@ -4546,6 +4548,8 @@ static void ring_buffer_attach(struct pe
>   spin_lock_irqsave(>event_lock, flags);
>   list_add_rcu(>rb_entry, >event_list);
>   spin_unlock_irqrestore(>event_lock, flags);
> +
> + perf_event_set_overflow(event, rb);
>   }
>  
>   rcu_assign_pointer(event->rb, rb);
> @@ -5579,9 +5583,12 @@ void perf_prepare_sample(struct perf_eve
>   }
>  }
>  
> 

Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-21 Thread xiakaixu
于 2015/10/21 17:12, Peter Zijlstra 写道:
> On Tue, Oct 20, 2015 at 03:53:02PM -0700, Alexei Starovoitov wrote:
>> On 10/20/15 12:22 AM, Kaixu Xia wrote:
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index b11756f..5219635 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event 
>>> *event,
>>> irq_work_queue(>pending);
>>> }
>>>
>>> +   if (unlikely(!atomic_read(>soft_enable)))
>>> +   return 0;
>>> +
>>> if (event->overflow_handler)
>>> event->overflow_handler(event, data, regs);
>>> else
>>
>> Peter,
>> does this part look right or it should be moved right after
>> if (unlikely(!is_sampling_event(event)))
>> return 0;
>> or even to other function?
>>
>> It feels to me that it should be moved, since we probably don't
>> want to active throttling, period adjust and event_limit for events
>> that are in soft_disabled state.
> 
> Depends on what its meant to do. As long as you let the interrupt
> happen, I think we should in fact do those things (maybe not the
> event_limit), but period adjustment and esp. throttling are important
> when the event is enabled.
> 
> If you want to actually disable the event: pmu->stop() will make it
> stop, and you can restart using pmu->start().
> 
> And I suppose you can wrap that with a counter if you need nesting.
> 
> I'm not sure if any of that is a viable solution, because the patch
> description is somewhat short on the problem statement.
> 
> As is, I'm not too charmed with the patch, but lacking a better
> understanding of what exactly we're trying to achieve I'm struggling
> with proposing alternatives.

Thanks for your comments!
The RFC patch set contains the necessary commit log [1].

In some scenarios we don't want to output trace data when perf sampling
in order to reduce overhead. For example, perf can be run as daemon to
dump trace data when necessary, such as the system performance goes down.
Just like the example given in the cover letter, we only receive the
samples within sys_write() syscall.

The helper bpf_perf_event_control() in this patch set can control the
data output process and get the samples we are most interested in.
The cpu_function_call is probably too much to do from bpf program, so
I choose current design that like 'soft_disable'.

[1] https://lkml.org/lkml/2015/10/12/135
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V5 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-21 Thread xiakaixu
于 2015/10/21 17:12, Peter Zijlstra 写道:
> On Tue, Oct 20, 2015 at 03:53:02PM -0700, Alexei Starovoitov wrote:
>> On 10/20/15 12:22 AM, Kaixu Xia wrote:
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index b11756f..5219635 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event 
>>> *event,
>>> irq_work_queue(>pending);
>>> }
>>>
>>> +   if (unlikely(!atomic_read(>soft_enable)))
>>> +   return 0;
>>> +
>>> if (event->overflow_handler)
>>> event->overflow_handler(event, data, regs);
>>> else
>>
>> Peter,
>> does this part look right or it should be moved right after
>> if (unlikely(!is_sampling_event(event)))
>> return 0;
>> or even to other function?
>>
>> It feels to me that it should be moved, since we probably don't
>> want to active throttling, period adjust and event_limit for events
>> that are in soft_disabled state.
> 
> Depends on what its meant to do. As long as you let the interrupt
> happen, I think we should in fact do those things (maybe not the
> event_limit), but period adjustment and esp. throttling are important
> when the event is enabled.
> 
> If you want to actually disable the event: pmu->stop() will make it
> stop, and you can restart using pmu->start().
> 
> And I suppose you can wrap that with a counter if you need nesting.
> 
> I'm not sure if any of that is a viable solution, because the patch
> description is somewhat short on the problem statement.
> 
> As is, I'm not too charmed with the patch, but lacking a better
> understanding of what exactly we're trying to achieve I'm struggling
> with proposing alternatives.

Thanks for your comments!
The RFC patch set contains the necessary commit log [1].

In some scenarios we don't want to output trace data when perf sampling
in order to reduce overhead. For example, perf can be run as daemon to
dump trace data when necessary, such as the system performance goes down.
Just like the example given in the cover letter, we only receive the
samples within sys_write() syscall.

The helper bpf_perf_event_control() in this patch set can control the
data output process and get the samples we are most interested in.
The cpu_function_call is probably too much to do from bpf program, so
I choose current design that like 'soft_disable'.

[1] https://lkml.org/lkml/2015/10/12/135
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V4 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-19 Thread xiakaixu
于 2015/10/20 10:14, Alexei Starovoitov 写道:
> On 10/19/15 3:37 AM, Kaixu Xia wrote:
>> +/* flags for PERF_EVENT_ARRAY maps*/
>> +enum {
>> +BPF_EVENT_CTL_BIT_CUR = 0,
>> +BPF_EVENT_CTL_BIT_ALL = 1,
>> +__NR_BPF_EVENT_CTL_BITS,
>> +};
>> +
>> +#defineBPF_CTL_BIT_FLAG_MASK \
>> +((1ULL << __NR_BPF_EVENT_CTL_BITS) - 1)
>> +#defineBPF_CTL_BIT_DUMP_CUR \
>> +(1ULL << BPF_EVENT_CTL_BIT_CUR)
>> +#defineBPF_CTL_BIT_DUMP_ALL \
>> +(1ULL << BPF_EVENT_CTL_BIT_ALL)
>> +
> 
> the above shouldn't be part of uapi header. It can stay in bpf_trace.c
> Just document these bits next to helper similar to skb_store_bytes()
> 
> The rest looks ok.
> It still needs an ack from Peter for perf_event bits

Thanks for your comments!
This part will be moved to bpf_trace.c in next version.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V4 1/1] bpf: control events stored in PERF_EVENT_ARRAY maps trace data output when perf sampling

2015-10-19 Thread xiakaixu
于 2015/10/20 10:14, Alexei Starovoitov 写道:
> On 10/19/15 3:37 AM, Kaixu Xia wrote:
>> +/* flags for PERF_EVENT_ARRAY maps*/
>> +enum {
>> +BPF_EVENT_CTL_BIT_CUR = 0,
>> +BPF_EVENT_CTL_BIT_ALL = 1,
>> +__NR_BPF_EVENT_CTL_BITS,
>> +};
>> +
>> +#defineBPF_CTL_BIT_FLAG_MASK \
>> +((1ULL << __NR_BPF_EVENT_CTL_BITS) - 1)
>> +#defineBPF_CTL_BIT_DUMP_CUR \
>> +(1ULL << BPF_EVENT_CTL_BIT_CUR)
>> +#defineBPF_CTL_BIT_DUMP_ALL \
>> +(1ULL << BPF_EVENT_CTL_BIT_ALL)
>> +
> 
> the above shouldn't be part of uapi header. It can stay in bpf_trace.c
> Just document these bits next to helper similar to skb_store_bytes()
> 
> The rest looks ok.
> It still needs an ack from Peter for perf_event bits

Thanks for your comments!
This part will be moved to bpf_trace.c in next version.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/2] bpf: control the trace data output on current cpu when perf sampling

2015-10-18 Thread xiakaixu
于 2015/10/17 6:06, Alexei Starovoitov 写道:
> On 10/16/15 12:42 AM, Kaixu Xia wrote:
>> This patch adds the flag dump_enable to control the trace data
>> output process when perf sampling. By setting this flag and
>> integrating with ebpf, we can control the data output process and
>> get the samples we are most interested in.
>>
>> The bpf helper bpf_perf_event_dump_control() can control the
>> perf_event on current cpu.
>>
>> Signed-off-by: Kaixu Xia 
>> ---
>>   include/linux/perf_event.h  |  1 +
>>   include/uapi/linux/bpf.h|  5 +
>>   include/uapi/linux/perf_event.h |  3 ++-
>>   kernel/bpf/verifier.c   |  3 ++-
>>   kernel/events/core.c| 13 
>>   kernel/trace/bpf_trace.c| 44 
>> +
>>   6 files changed, 67 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 092a0e8..2af527e 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -472,6 +472,7 @@ struct perf_event {
>>   struct irq_workpending;
>>
>>   atomic_tevent_limit;
>> +atomic_tdump_enable;
> 
> The naming is the hardest...
> How about calling it 'soft_enable' instead?
> 
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -287,6 +287,11 @@ enum bpf_func_id {
>>* Return: realm if != 0
>>*/
>>   BPF_FUNC_get_route_realm,
>> +
>> +/**
>> + * u64 bpf_perf_event_dump_control(, index, flag)
>> + */
>> +BPF_FUNC_perf_event_dump_control,
> 
> and this one is too long.
> May be bpf_perf_event_control() ?
> 
> Daniel, any thoughts on naming?
> 
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -331,7 +331,8 @@ struct perf_event_attr {
>>   comm_exec  :  1, /* flag comm events that are due to 
>> an exec */
>>   use_clockid:  1, /* use @clockid for time fields */
>>   context_switch :  1, /* context switch data */
>> -__reserved_1   : 37;
>> +dump_enable:  1, /* don't output data on samples */
> 
> either comment or name is wrong.
> how about calling this one 'soft_disable',
> since you want zero to be default and the event should be on.
> 
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index b11756f..74a16af 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event 
>> *event,
>>   irq_work_queue(>pending);
>>   }
>>
>> +if (!atomic_read(>dump_enable))
>> +return ret;
> 
> I'm not an expert in this piece of perf, but should it be 'return 0'
> instead ?
> and may be moved to is_sampling_event() check?
> Also please add unlikely().

The is_sampling_event() is checked in many other kernel places, not only in
perf events interrupt overflow handle. I'm not sure it is fine if we move it
to there. In addition, I think hwc->interrupts++ should be done in
__perf_event_overflow() before event->soft_enable is checked.
> 
>> +static void perf_event_check_dump_flag(struct perf_event *event)
>> +{
>> +if (event->attr.dump_enable == 1)
>> +atomic_set(>dump_enable, 1);
>> +else
>> +atomic_set(>dump_enable, 0);
> 
> that looks like it breaks perf, since default for bits is zero
> and all events will be soft-disabled?
> How did you test it?
> Please add a test to samples/bpf/ for this feature.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/2] bpf: control the trace data output on current cpu when perf sampling

2015-10-18 Thread xiakaixu
于 2015/10/17 6:06, Alexei Starovoitov 写道:
> On 10/16/15 12:42 AM, Kaixu Xia wrote:
>> This patch adds the flag dump_enable to control the trace data
>> output process when perf sampling. By setting this flag and
>> integrating with ebpf, we can control the data output process and
>> get the samples we are most interested in.
>>
>> The bpf helper bpf_perf_event_dump_control() can control the
>> perf_event on current cpu.
>>
>> Signed-off-by: Kaixu Xia 
>> ---
>>   include/linux/perf_event.h  |  1 +
>>   include/uapi/linux/bpf.h|  5 +
>>   include/uapi/linux/perf_event.h |  3 ++-
>>   kernel/bpf/verifier.c   |  3 ++-
>>   kernel/events/core.c| 13 
>>   kernel/trace/bpf_trace.c| 44 
>> +
>>   6 files changed, 67 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 092a0e8..2af527e 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -472,6 +472,7 @@ struct perf_event {
>>   struct irq_workpending;
>>
>>   atomic_tevent_limit;
>> +atomic_tdump_enable;
> 
> The naming is the hardest...
> How about calling it 'soft_enable' instead?
> 
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -287,6 +287,11 @@ enum bpf_func_id {
>>* Return: realm if != 0
>>*/
>>   BPF_FUNC_get_route_realm,
>> +
>> +/**
>> + * u64 bpf_perf_event_dump_control(, index, flag)
>> + */
>> +BPF_FUNC_perf_event_dump_control,
> 
> and this one is too long.
> May be bpf_perf_event_control() ?
> 
> Daniel, any thoughts on naming?
> 
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -331,7 +331,8 @@ struct perf_event_attr {
>>   comm_exec  :  1, /* flag comm events that are due to 
>> an exec */
>>   use_clockid:  1, /* use @clockid for time fields */
>>   context_switch :  1, /* context switch data */
>> -__reserved_1   : 37;
>> +dump_enable:  1, /* don't output data on samples */
> 
> either comment or name is wrong.
> how about calling this one 'soft_disable',
> since you want zero to be default and the event should be on.
> 
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index b11756f..74a16af 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event 
>> *event,
>>   irq_work_queue(>pending);
>>   }
>>
>> +if (!atomic_read(>dump_enable))
>> +return ret;
> 
> I'm not an expert in this piece of perf, but should it be 'return 0'
> instead ?
> and may be moved to is_sampling_event() check?
> Also please add unlikely().
> 
>> +static void perf_event_check_dump_flag(struct perf_event *event)
>> +{
>> +if (event->attr.dump_enable == 1)it 
>> +atomic_set(>dump_enable, 1);
>> +else
>> +atomic_set(>dump_enable, 0);
> 
> that looks like it breaks perf, since default for bits is zero
> and all events will be soft-disabled?
> How did you test it?
> Please add a test to samples/bpf/ for this feature.

It is really hard that adding a test to samples/bpf/. We need to implement most 
of
'perf record/report' commands from tools/perf/, like mmap(), dump trace, etc. 
Only
the perf_event_open syscall is really not enough.

Actually, this patch set is only the kernel space side, and it still needs the 
perf
user space side, you can find the necessary patches in Wang Nan's git tree[1].
Based on Wang Nan's git tree, we can config BPF maps through perf cmdline.
We also need to confing attr->soft_disable in perf user side based on tree[1]. 
so
it was not included in this patchset. I will send out the perf userspace part 
after
this patch set is applied.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/pi3orama/linux.git perf/ebpf
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/2] bpf: control the trace data output on current cpu when perf sampling

2015-10-18 Thread xiakaixu
于 2015/10/17 6:06, Alexei Starovoitov 写道:
> On 10/16/15 12:42 AM, Kaixu Xia wrote:
>> This patch adds the flag dump_enable to control the trace data
>> output process when perf sampling. By setting this flag and
>> integrating with ebpf, we can control the data output process and
>> get the samples we are most interested in.
>>
>> The bpf helper bpf_perf_event_dump_control() can control the
>> perf_event on current cpu.
>>
>> Signed-off-by: Kaixu Xia 
>> ---
>>   include/linux/perf_event.h  |  1 +
>>   include/uapi/linux/bpf.h|  5 +
>>   include/uapi/linux/perf_event.h |  3 ++-
>>   kernel/bpf/verifier.c   |  3 ++-
>>   kernel/events/core.c| 13 
>>   kernel/trace/bpf_trace.c| 44 
>> +
>>   6 files changed, 67 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 092a0e8..2af527e 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -472,6 +472,7 @@ struct perf_event {
>>   struct irq_workpending;
>>
>>   atomic_tevent_limit;
>> +atomic_tdump_enable;
> 
> The naming is the hardest...
> How about calling it 'soft_enable' instead?
> 
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -287,6 +287,11 @@ enum bpf_func_id {
>>* Return: realm if != 0
>>*/
>>   BPF_FUNC_get_route_realm,
>> +
>> +/**
>> + * u64 bpf_perf_event_dump_control(, index, flag)
>> + */
>> +BPF_FUNC_perf_event_dump_control,
> 
> and this one is too long.
> May be bpf_perf_event_control() ?
> 
> Daniel, any thoughts on naming?
> 
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -331,7 +331,8 @@ struct perf_event_attr {
>>   comm_exec  :  1, /* flag comm events that are due to 
>> an exec */
>>   use_clockid:  1, /* use @clockid for time fields */
>>   context_switch :  1, /* context switch data */
>> -__reserved_1   : 37;
>> +dump_enable:  1, /* don't output data on samples */
> 
> either comment or name is wrong.
> how about calling this one 'soft_disable',
> since you want zero to be default and the event should be on.
> 
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index b11756f..74a16af 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event 
>> *event,
>>   irq_work_queue(>pending);
>>   }
>>
>> +if (!atomic_read(>dump_enable))
>> +return ret;
> 
> I'm not an expert in this piece of perf, but should it be 'return 0'
> instead ?
> and may be moved to is_sampling_event() check?
> Also please add unlikely().

The is_sampling_event() is checked in many other kernel places, not only in
perf events interrupt overflow handle. I'm not sure it is fine if we move it
to there. In addition, I think hwc->interrupts++ should be done in
__perf_event_overflow() before event->soft_enable is checked.
> 
>> +static void perf_event_check_dump_flag(struct perf_event *event)
>> +{
>> +if (event->attr.dump_enable == 1)
>> +atomic_set(>dump_enable, 1);
>> +else
>> +atomic_set(>dump_enable, 0);
> 
> that looks like it breaks perf, since default for bits is zero
> and all events will be soft-disabled?
> How did you test it?
> Please add a test to samples/bpf/ for this feature.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/2] bpf: control the trace data output on current cpu when perf sampling

2015-10-18 Thread xiakaixu
于 2015/10/17 6:06, Alexei Starovoitov 写道:
> On 10/16/15 12:42 AM, Kaixu Xia wrote:
>> This patch adds the flag dump_enable to control the trace data
>> output process when perf sampling. By setting this flag and
>> integrating with ebpf, we can control the data output process and
>> get the samples we are most interested in.
>>
>> The bpf helper bpf_perf_event_dump_control() can control the
>> perf_event on current cpu.
>>
>> Signed-off-by: Kaixu Xia 
>> ---
>>   include/linux/perf_event.h  |  1 +
>>   include/uapi/linux/bpf.h|  5 +
>>   include/uapi/linux/perf_event.h |  3 ++-
>>   kernel/bpf/verifier.c   |  3 ++-
>>   kernel/events/core.c| 13 
>>   kernel/trace/bpf_trace.c| 44 
>> +
>>   6 files changed, 67 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 092a0e8..2af527e 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -472,6 +472,7 @@ struct perf_event {
>>   struct irq_workpending;
>>
>>   atomic_tevent_limit;
>> +atomic_tdump_enable;
> 
> The naming is the hardest...
> How about calling it 'soft_enable' instead?
> 
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -287,6 +287,11 @@ enum bpf_func_id {
>>* Return: realm if != 0
>>*/
>>   BPF_FUNC_get_route_realm,
>> +
>> +/**
>> + * u64 bpf_perf_event_dump_control(, index, flag)
>> + */
>> +BPF_FUNC_perf_event_dump_control,
> 
> and this one is too long.
> May be bpf_perf_event_control() ?
> 
> Daniel, any thoughts on naming?
> 
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -331,7 +331,8 @@ struct perf_event_attr {
>>   comm_exec  :  1, /* flag comm events that are due to 
>> an exec */
>>   use_clockid:  1, /* use @clockid for time fields */
>>   context_switch :  1, /* context switch data */
>> -__reserved_1   : 37;
>> +dump_enable:  1, /* don't output data on samples */
> 
> either comment or name is wrong.
> how about calling this one 'soft_disable',
> since you want zero to be default and the event should be on.
> 
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index b11756f..74a16af 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -6337,6 +6337,9 @@ static int __perf_event_overflow(struct perf_event 
>> *event,
>>   irq_work_queue(>pending);
>>   }
>>
>> +if (!atomic_read(>dump_enable))
>> +return ret;
> 
> I'm not an expert in this piece of perf, but should it be 'return 0'
> instead ?
> and may be moved to is_sampling_event() check?
> Also please add unlikely().
> 
>> +static void perf_event_check_dump_flag(struct perf_event *event)
>> +{
>> +if (event->attr.dump_enable == 1)it 
>> +atomic_set(>dump_enable, 1);
>> +else
>> +atomic_set(>dump_enable, 0);
> 
> that looks like it breaks perf, since default for bits is zero
> and all events will be soft-disabled?
> How did you test it?
> Please add a test to samples/bpf/ for this feature.

It is really hard that adding a test to samples/bpf/. We need to implement most 
of
'perf record/report' commands from tools/perf/, like mmap(), dump trace, etc. 
Only
the perf_event_open syscall is really not enough.

Actually, this patch set is only the kernel space side, and it still needs the 
perf
user space side, you can find the necessary patches in Wang Nan's git tree[1].
Based on Wang Nan's git tree, we can config BPF maps through perf cmdline.
We also need to confing attr->soft_disable in perf user side based on tree[1]. 
so
it was not included in this patchset. I will send out the perf userspace part 
after
this patch set is applied.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/pi3orama/linux.git perf/ebpf
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 2/2] bpf: control a set of perf events by creating a new ioctl PERF_EVENT_IOC_SET_ENABLER

2015-10-14 Thread xiakaixu
于 2015/10/15 5:28, Alexei Starovoitov 写道:
> On 10/14/15 5:37 AM, Kaixu Xia wrote:
>> +event->p_sample_disable = _event->sample_disable;
> 
> I don't like it as a concept and it's buggy implementation.
> What happens here when enabler is alive, but other event is destroyed?
> 
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -221,9 +221,12 @@ static u64 bpf_perf_event_sample_control(u64 r1, u64 
>> index, u64 flag, u64 r4, u6
>>   struct bpf_array *array = container_of(map, struct bpf_array, map);
>>   struct perf_event *event;
>>
>> -if (unlikely(index >= array->map.max_entries))
>> +if (unlikely(index > array->map.max_entries))
>>   return -E2BIG;
>>
>> +if (index == array->map.max_entries)
>> +index = 0;
> 
> what is this hack for ?
> 
> Either use notification and user space disable or
> call bpf_perf_event_sample_control() manually for each cpu.

I will discard current implemention that controlling a set of
perf events by the 'enabler' event. Call bpf_perf_event_sample_control()
manually for each cpu is fine. Maybe we can add a loop to control all the
events stored in maps by judging the index, OK?
> 
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 2/2] bpf: control a set of perf events by creating a new ioctl PERF_EVENT_IOC_SET_ENABLER

2015-10-14 Thread xiakaixu
于 2015/10/15 5:28, Alexei Starovoitov 写道:
> On 10/14/15 5:37 AM, Kaixu Xia wrote:
>> +event->p_sample_disable = _event->sample_disable;
> 
> I don't like it as a concept and it's buggy implementation.
> What happens here when enabler is alive, but other event is destroyed?
> 
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -221,9 +221,12 @@ static u64 bpf_perf_event_sample_control(u64 r1, u64 
>> index, u64 flag, u64 r4, u6
>>   struct bpf_array *array = container_of(map, struct bpf_array, map);
>>   struct perf_event *event;
>>
>> -if (unlikely(index >= array->map.max_entries))
>> +if (unlikely(index > array->map.max_entries))
>>   return -E2BIG;
>>
>> +if (index == array->map.max_entries)
>> +index = 0;
> 
> what is this hack for ?
> 
> Either use notification and user space disable or
> call bpf_perf_event_sample_control() manually for each cpu.

I will discard current implemention that controlling a set of
perf events by the 'enabler' event. Call bpf_perf_event_sample_control()
manually for each cpu is fine. Maybe we can add a loop to control all the
events stored in maps by judging the index, OK?
> 
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples

2015-10-12 Thread xiakaixu
于 2015/10/13 3:20, Alexei Starovoitov 写道:
> On 10/12/15 2:02 AM, Kaixu Xia wrote:
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index f57d7fe..25e073d 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -39,6 +39,7 @@ struct bpf_map {
>>   u32 max_entries;
>>   const struct bpf_map_ops *ops;
>>   struct work_struct work;
>> +atomic_t perf_sample_disable;
>>   };
>>
>>   struct bpf_map_type_list {
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 092a0e8..0606d1d 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -483,6 +483,8 @@ struct perf_event {
>>   perf_overflow_handler_toverflow_handler;
>>   void*overflow_handler_context;
>>
>> +atomic_t*sample_disable;
> 
> this looks fragile and unnecessary.
> Why add such field to generic bpf_map and carry its pointer into perf_event?
> Single extra field in perf_event would have been enough.
> Even better is to avoid adding any fields.
> There is already event->state why not to use that?
> The proper perf_event_enable/disable are so heavy that another
> mechanism needed? cpu_function_call is probably too much to do
> from bpf program, but that can be simplified?
> Based on the use case from cover letter, sounds like you want
> something like soft_disable?
> Then extending event->state would make the most sense.
> Also consider the case of re-entrant event enable/disable.
> So inc/dec of a flag may be needed?

Thanks for your comments!
I've tried perf_event_enable/disable, but there is a warning caused
by cpu_function_call. The main reason as follows,
 int smp_call_function_single(...)
 {
...
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
 && !oops_in_progress);
...
}
So I added the extra atomic flag filed in order to avoid this problem.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] perf: Add the flag sample_disable not to output data on samples

2015-10-12 Thread xiakaixu
于 2015/10/13 3:20, Alexei Starovoitov 写道:
> On 10/12/15 2:02 AM, Kaixu Xia wrote:
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index f57d7fe..25e073d 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -39,6 +39,7 @@ struct bpf_map {
>>   u32 max_entries;
>>   const struct bpf_map_ops *ops;
>>   struct work_struct work;
>> +atomic_t perf_sample_disable;
>>   };
>>
>>   struct bpf_map_type_list {
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 092a0e8..0606d1d 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -483,6 +483,8 @@ struct perf_event {
>>   perf_overflow_handler_toverflow_handler;
>>   void*overflow_handler_context;
>>
>> +atomic_t*sample_disable;
> 
> this looks fragile and unnecessary.
> Why add such field to generic bpf_map and carry its pointer into perf_event?
> Single extra field in perf_event would have been enough.
> Even better is to avoid adding any fields.
> There is already event->state why not to use that?
> The proper perf_event_enable/disable are so heavy that another
> mechanism needed? cpu_function_call is probably too much to do
> from bpf program, but that can be simplified?
> Based on the use case from cover letter, sounds like you want
> something like soft_disable?
> Then extending event->state would make the most sense.
> Also consider the case of re-entrant event enable/disable.
> So inc/dec of a flag may be needed?

Thanks for your comments!
I've tried perf_event_enable/disable, but there is a warning caused
by cpu_function_call. The main reason as follows,
 int smp_call_function_single(...)
 {
...
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
 && !oops_in_progress);
...
}
So I added the extra atomic flag filed in order to avoid this problem.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread xiakaixu
于 2015/8/29 9:28, Alexei Starovoitov 写道:
> On 8/27/15 3:42 AM, Kaixu Xia wrote:
>> An example is pasted at the bottom of this cover letter. In that example,
>> we can get the cpu_cycles and exception taken in sys_write.
>>
>>   $ cat /sys/kernel/debug/tracing/trace_pipe
>>   $ ./perf record --event perf-bpf.o ls
>> ...
>>   cat-1653  [003] d..1 88174.613854: : ente:  CPU-3
>> cyc:48746333exc:84
>>   cat-1653  [003] d..2 88174.613861: : exit:  CPU-3
>> cyc:48756041exc:84
> 
> nice. probably more complex example that computes the delta of the pmu
> counters on the kernel side would be even more interesting.

Right, this is just a little example. Actually, I have tested this
ability on kernel side and user space side, that is kprobe and uprobe.
The collected delta of the pmu counters form kernel and glibc is correct
and meets the expected goals. I will give them in the next version.

At this time i wish to get your comment on the current chosen implementation.
Now the struct perf_event_map_def is introduced and the user can directly
define the struct perf_event_attr, so we can skip the parse_events process
and call the sys_perf_event_open on the events directly. This is the most
simple implementation, but I am not sure it is the most appropriate.
> Do you think you can extend 'perf stat' with a flag that does
> stats collection for a given kernel or user function instead of the
> whole process ?
> Then we can use perf record/report to figure out hot functions and
> follow with 'perf stat -f my_hot_func my_process' to drill into
> particular function stats.

Good idea! I will consider it when this patchset is basically completed.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread xiakaixu
于 2015/8/29 9:28, Alexei Starovoitov 写道:
 On 8/27/15 3:42 AM, Kaixu Xia wrote:
 An example is pasted at the bottom of this cover letter. In that example,
 we can get the cpu_cycles and exception taken in sys_write.

   $ cat /sys/kernel/debug/tracing/trace_pipe
   $ ./perf record --event perf-bpf.o ls
 ...
   cat-1653  [003] d..1 88174.613854: : ente:  CPU-3
 cyc:48746333exc:84
   cat-1653  [003] d..2 88174.613861: : exit:  CPU-3
 cyc:48756041exc:84
 
 nice. probably more complex example that computes the delta of the pmu
 counters on the kernel side would be even more interesting.

Right, this is just a little example. Actually, I have tested this
ability on kernel side and user space side, that is kprobe and uprobe.
The collected delta of the pmu counters form kernel and glibc is correct
and meets the expected goals. I will give them in the next version.

At this time i wish to get your comment on the current chosen implementation.
Now the struct perf_event_map_def is introduced and the user can directly
define the struct perf_event_attr, so we can skip the parse_events process
and call the sys_perf_event_open on the events directly. This is the most
simple implementation, but I am not sure it is the most appropriate.
 Do you think you can extend 'perf stat' with a flag that does
 stats collection for a given kernel or user function instead of the
 whole process ?
 Then we can use perf record/report to figure out hot functions and
 follow with 'perf stat -f my_hot_func my_process' to drill into
 particular function stats.

Good idea! I will consider it when this patchset is basically completed.
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next] bpf: s390: Fix build error caused by the struct bpf_array member name changed

2015-08-11 Thread xiakaixu
于 2015/8/11 16:24, Daniel Borkmann 写道:
> On 08/11/2015 08:53 AM, Kaixu Xia wrote:
>> There is a build error that "'struct bpf_array' has no member
>> named 'prog'" on s390. In commit 2a36f0b, the member 'prog' of
>> struct bpf_array is replaced by 'ptrs'. So this patch fixes it.
>>
>> Signed-off-by: Kaixu Xia 
> 
> You were also asked to add a proper Fixes tag :
> 
> Fixes: 2a36f0b92eb6 ("bpf: Make the bpf_prog_array_map more generic")
> 
> I guess this bug was reported by Fengguang as you have him in Cc ?
> 
> If that should be the case, then please also add a "Reported-by:" tag
> for his bug report.
> 
> Code looks good to me.

Thanks for your review. Will follow them in the resubmit version.
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 net-next] bpf: s390: Fix build error caused by the struct bpf_array member name changed

2015-08-11 Thread xiakaixu
于 2015/8/11 16:24, Daniel Borkmann 写道:
 On 08/11/2015 08:53 AM, Kaixu Xia wrote:
 There is a build error that 'struct bpf_array' has no member
 named 'prog' on s390. In commit 2a36f0b, the member 'prog' of
 struct bpf_array is replaced by 'ptrs'. So this patch fixes it.

 Signed-off-by: Kaixu Xia xiaka...@huawei.com
 
 You were also asked to add a proper Fixes tag :
 
 Fixes: 2a36f0b92eb6 (bpf: Make the bpf_prog_array_map more generic)
 
 I guess this bug was reported by Fengguang as you have him in Cc ?
 
 If that should be the case, then please also add a Reported-by: tag
 for his bug report.
 
 Code looks good to me.

Thanks for your review. Will follow them in the resubmit version.
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] perf ebpf: The example that how to access hardware PMU counter in eBPF programs bu using perf

2015-08-07 Thread xiakaixu
By combining PMU, kprobe and eBPF program together, many interesting things
can be done. For example, by probing at sched:sched_switch we can
measure IPC changing between different processes by watching 'cycle' PMU
counter; by probing at entry and exit points of a kernel function we are
able to compute cache miss rate for a function by collecting
'cache-misses' counter and see the differences. In summary, we can
define the begin and end points of a procedure, insert kprobes on them,
attach two BPF programs and let them collect specific PMU counter.
Further, by reading those PMU counter BPF program can bring some hints
to resource schedulers.

I am focusing on the work that giving eBPF programs the new ability to
access hardware PMU counter and using it from perf. In recent weeks I have
submitted the kernel space code first and the latest V7 version is here
(www.spinics.net/lists/netdev/msg338468.html). According to the design
plan, we still need the perf side code. I will do it based on Wang Nan's
patches (perf tools: filtering events using eBPF programs).

Here is a simple eBPF program example that is loaded by using perf.
It is just the basic design principle, and if OK, we will implement
the perf side code refer to it.

Waiting for your comments.
Thanks.



struct bpf_map_def SEC("maps") my_cycles_map = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = 32,
};
struct bpf_map_def SEC("maps") my_exception_map = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = 32,
};

struct perf_event_map {
struct bpf_map_def *map_def;
char description[64];
};

struct perf_event_map SEC("perf_event_map") cycles = {
.map_def = _cycles_map,
.description = "cycles",
};

struct perf_event_map SEC("perf_event_map") exception = {
.map_def = _exception_map,
.description = "exception",
};

SEC("kprobe/sys_write")
int bpf_prog(struct pt_regs *ctx)
{
u64 count_cycles, count_exception;
u32 key = bpf_get_smp_processor_id();
char fmt[] = "CPU-%d   cyc:%llu  exc:%llu\n";

count_cycles = bpf_perf_event_read(_cycles_map, key);
count_exception = bpf_perf_event_read(_exception_map, key);
bpf_trace_printk(fmt, sizeof(fmt), key, count_cycles, count_exception);

return 0;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] perf ebpf: The example that how to access hardware PMU counter in eBPF programs bu using perf

2015-08-07 Thread xiakaixu
By combining PMU, kprobe and eBPF program together, many interesting things
can be done. For example, by probing at sched:sched_switch we can
measure IPC changing between different processes by watching 'cycle' PMU
counter; by probing at entry and exit points of a kernel function we are
able to compute cache miss rate for a function by collecting
'cache-misses' counter and see the differences. In summary, we can
define the begin and end points of a procedure, insert kprobes on them,
attach two BPF programs and let them collect specific PMU counter.
Further, by reading those PMU counter BPF program can bring some hints
to resource schedulers.

I am focusing on the work that giving eBPF programs the new ability to
access hardware PMU counter and using it from perf. In recent weeks I have
submitted the kernel space code first and the latest V7 version is here
(www.spinics.net/lists/netdev/msg338468.html). According to the design
plan, we still need the perf side code. I will do it based on Wang Nan's
patches (perf tools: filtering events using eBPF programs).

Here is a simple eBPF program example that is loaded by using perf.
It is just the basic design principle, and if OK, we will implement
the perf side code refer to it.

Waiting for your comments.
Thanks.



struct bpf_map_def SEC(maps) my_cycles_map = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = 32,
};
struct bpf_map_def SEC(maps) my_exception_map = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = 32,
};

struct perf_event_map {
struct bpf_map_def *map_def;
char description[64];
};

struct perf_event_map SEC(perf_event_map) cycles = {
.map_def = my_cycles_map,
.description = cycles,
};

struct perf_event_map SEC(perf_event_map) exception = {
.map_def = my_exception_map,
.description = exception,
};

SEC(kprobe/sys_write)
int bpf_prog(struct pt_regs *ctx)
{
u64 count_cycles, count_exception;
u32 key = bpf_get_smp_processor_id();
char fmt[] = CPU-%d   cyc:%llu  exc:%llu\n;

count_cycles = bpf_perf_event_read(my_cycles_map, key);
count_exception = bpf_perf_event_read(my_exception_map, key);
bpf_trace_printk(fmt, sizeof(fmt), key, count_cycles, count_exception);

return 0;
}

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-08-05 Thread xiakaixu
于 2015/8/5 21:53, Peter Zijlstra 写道:
> On Wed, Aug 05, 2015 at 12:04:25PM +0200, Peter Zijlstra wrote:
>> Also, you probably want a WARN_ON(in_nmi()) there, this function is
>> _NOT_ NMI safe.
> 
> I had a wee think about that, and I think the below is safe.
> 
> (with the obvious problem that WARN from NMI context is not safe)
> 
> It does not give you up-to-date overcommit times but your version didn't
> either so I'm assuming you don't need those, if you do need those it
> needs more but we can do that too.
> 
> ---
>  include/linux/perf_event.h |  1 +
>  kernel/events/core.c   | 53 
> ++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2027809433b3..64e821dd64f0 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -659,6 +659,7 @@ perf_event_create_kernel_counter(struct perf_event_attr 
> *attr,
>   void *context);
>  extern void perf_pmu_migrate_context(struct pmu *pmu,
>   int src_cpu, int dst_cpu);
> +extern u64 perf_event_read_local(struct perf_event *event);
>  extern u64 perf_event_read_value(struct perf_event *event,
>u64 *enabled, u64 *running);
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 39753bfd9520..7105d37763c1 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3222,6 +3222,59 @@ static inline u64 perf_event_count(struct perf_event 
> *event)
>   return __perf_event_count(event);
>  }
>  
> +/*
> + * NMI-safe method to read a local event, that is an event that
> + * is:
> + *   - either for the current task, or for this CPU
> + *   - does not have inherit set, for inherited task events
> + * will not be local and we cannot read them atomically
> + *   - must not have a pmu::count method
> + */
> +u64 perf_event_read_local(struct perf_event *event)
> +{
> + unsigned long flags;
> + u64 val;
> +
> + /*
> +  * Disabling interrupts avoids all counter scheduling (context
> +  * switches, timer based rotation and IPIs).
> +  */
> + local_irq_safe(flags);

s/local_irq_safe/local_irq_save, and I have compiled and tested this function
and it is fine. Will use it in the next set.

Thanks.
> +
> + /* If this is a per-task event, it must be for current */
> + WARN_ON_ONCE((event->attach_state & PERF_ATTACH_TASK) &&
> +  event->hw.target != current);
> +
> + /* If this is a per-CPU event, it must be for this CPU */
> + WARN_ON_ONCE(!(event->attach_state & PERF_ATTACH_TASK) &&
> +  event->cpu != smp_processor_id());
> +
> + /*
> +  * It must not be an event with inherit set, we cannot read
> +  * all child counters from atomic context.
> +  */
> + WARN_ON_ONCE(event->attr.inherit);
> +
> + /*
> +  * It must not have a pmu::count method, those are not
> +  * NMI safe.
> +  */
> + WARN_ON_ONCE(event->pmu->count);
> +
> + /*
> +  * If the event is currently on this CPU, its either a per-task event,
> +  * or local to this CPU. Furthermore it means its ACTIVE (otherwise
> +  * oncpu == -1).
> +  */
> + if (event->oncpu == smp_processor_id())
> + event->pmu->read(event);
> +
> + val = local64_read(>count);
> + local_irq_restore(flags);
> +
> + return val;
> +}
> +
>  static u64 perf_event_read(struct perf_event *event)
>  {
>   /*
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-08-05 Thread xiakaixu
于 2015/8/5 18:04, Peter Zijlstra 写道:
> On Tue, Aug 04, 2015 at 08:58:15AM +, Kaixu Xia wrote:
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 6251b53..726ca1b 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -8599,6 +8599,25 @@ struct perf_event_attr *perf_event_attrs(struct 
>> perf_event *event)
>>  return >attr;
>>  }
>>  
>> +u64 perf_event_read_internal(struct perf_event *event)
> 
> Maybe: perf_event_read_local(), as this is this function only works for
> events active on the current CPU.
> 
>> +{
>> +if (!event)
>> +return -EINVAL;
>> +
>> +if (unlikely(event->state != PERF_EVENT_STATE_ACTIVE))
>> +return -EINVAL;
> 
> You can return perf_event_count() in that case.
> 
>> +
>> +if (event->oncpu != raw_smp_processor_id() &&
> 
> That _must_ be smp_processor_id(). If that gives a warning (ie.
> preemption is not disabled or we're not affine to this one cpu) then the
> warning is valid.
> 
>> +event->ctx->task != current)
> 
> Write it like:
> 
>   if (event->ctx->task != current &&
>   event->oncpu != smp_processor_id())
> 
> That way you'll not evaluate smp_processor_id() for current task events.
> 
>> +return -EINVAL;
>> +
>> +if (unlikely(event->attr.inherit))
>> +return -EINVAL;
> 
> This should be in your accept function, inherited events should never
> get this far.
> 
> You need IRQs disabled while calling __perf_event_read(), did you test
> with lockdep enabled?

Thanks for your review! I've not tested it, will do that from now on.
> 
>> +__perf_event_read(event);
>> +return perf_event_count(event);
>> +}
> 
> Also, you probably want a WARN_ON(in_nmi()) there, this function is
> _NOT_ NMI safe.
> 
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-08-05 Thread xiakaixu
于 2015/8/5 18:04, Peter Zijlstra 写道:
 On Tue, Aug 04, 2015 at 08:58:15AM +, Kaixu Xia wrote:
 diff --git a/kernel/events/core.c b/kernel/events/core.c
 index 6251b53..726ca1b 100644
 --- a/kernel/events/core.c
 +++ b/kernel/events/core.c
 @@ -8599,6 +8599,25 @@ struct perf_event_attr *perf_event_attrs(struct 
 perf_event *event)
  return event-attr;
  }
  
 +u64 perf_event_read_internal(struct perf_event *event)
 
 Maybe: perf_event_read_local(), as this is this function only works for
 events active on the current CPU.
 
 +{
 +if (!event)
 +return -EINVAL;
 +
 +if (unlikely(event-state != PERF_EVENT_STATE_ACTIVE))
 +return -EINVAL;
 
 You can return perf_event_count() in that case.
 
 +
 +if (event-oncpu != raw_smp_processor_id() 
 
 That _must_ be smp_processor_id(). If that gives a warning (ie.
 preemption is not disabled or we're not affine to this one cpu) then the
 warning is valid.
 
 +event-ctx-task != current)
 
 Write it like:
 
   if (event-ctx-task != current 
   event-oncpu != smp_processor_id())
 
 That way you'll not evaluate smp_processor_id() for current task events.
 
 +return -EINVAL;
 +
 +if (unlikely(event-attr.inherit))
 +return -EINVAL;
 
 This should be in your accept function, inherited events should never
 get this far.
 
 You need IRQs disabled while calling __perf_event_read(), did you test
 with lockdep enabled?

Thanks for your review! I've not tested it, will do that from now on.
 
 +__perf_event_read(event);
 +return perf_event_count(event);
 +}
 
 Also, you probably want a WARN_ON(in_nmi()) there, this function is
 _NOT_ NMI safe.
 
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-08-05 Thread xiakaixu
于 2015/8/5 21:53, Peter Zijlstra 写道:
 On Wed, Aug 05, 2015 at 12:04:25PM +0200, Peter Zijlstra wrote:
 Also, you probably want a WARN_ON(in_nmi()) there, this function is
 _NOT_ NMI safe.
 
 I had a wee think about that, and I think the below is safe.
 
 (with the obvious problem that WARN from NMI context is not safe)
 
 It does not give you up-to-date overcommit times but your version didn't
 either so I'm assuming you don't need those, if you do need those it
 needs more but we can do that too.
 
 ---
  include/linux/perf_event.h |  1 +
  kernel/events/core.c   | 53 
 ++
  2 files changed, 54 insertions(+)
 
 diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
 index 2027809433b3..64e821dd64f0 100644
 --- a/include/linux/perf_event.h
 +++ b/include/linux/perf_event.h
 @@ -659,6 +659,7 @@ perf_event_create_kernel_counter(struct perf_event_attr 
 *attr,
   void *context);
  extern void perf_pmu_migrate_context(struct pmu *pmu,
   int src_cpu, int dst_cpu);
 +extern u64 perf_event_read_local(struct perf_event *event);
  extern u64 perf_event_read_value(struct perf_event *event,
u64 *enabled, u64 *running);
  
 diff --git a/kernel/events/core.c b/kernel/events/core.c
 index 39753bfd9520..7105d37763c1 100644
 --- a/kernel/events/core.c
 +++ b/kernel/events/core.c
 @@ -3222,6 +3222,59 @@ static inline u64 perf_event_count(struct perf_event 
 *event)
   return __perf_event_count(event);
  }
  
 +/*
 + * NMI-safe method to read a local event, that is an event that
 + * is:
 + *   - either for the current task, or for this CPU
 + *   - does not have inherit set, for inherited task events
 + * will not be local and we cannot read them atomically
 + *   - must not have a pmu::count method
 + */
 +u64 perf_event_read_local(struct perf_event *event)
 +{
 + unsigned long flags;
 + u64 val;
 +
 + /*
 +  * Disabling interrupts avoids all counter scheduling (context
 +  * switches, timer based rotation and IPIs).
 +  */
 + local_irq_safe(flags);

s/local_irq_safe/local_irq_save, and I have compiled and tested this function
and it is fine. Will use it in the next set.

Thanks.
 +
 + /* If this is a per-task event, it must be for current */
 + WARN_ON_ONCE((event-attach_state  PERF_ATTACH_TASK) 
 +  event-hw.target != current);
 +
 + /* If this is a per-CPU event, it must be for this CPU */
 + WARN_ON_ONCE(!(event-attach_state  PERF_ATTACH_TASK) 
 +  event-cpu != smp_processor_id());
 +
 + /*
 +  * It must not be an event with inherit set, we cannot read
 +  * all child counters from atomic context.
 +  */
 + WARN_ON_ONCE(event-attr.inherit);
 +
 + /*
 +  * It must not have a pmu::count method, those are not
 +  * NMI safe.
 +  */
 + WARN_ON_ONCE(event-pmu-count);
 +
 + /*
 +  * If the event is currently on this CPU, its either a per-task event,
 +  * or local to this CPU. Furthermore it means its ACTIVE (otherwise
 +  * oncpu == -1).
 +  */
 + if (event-oncpu == smp_processor_id())
 + event-pmu-read(event);
 +
 + val = local64_read(event-count);
 + local_irq_restore(flags);
 +
 + return val;
 +}
 +
  static u64 perf_event_read(struct perf_event *event)
  {
   /*
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-08-04 Thread xiakaixu
于 2015/8/5 1:55, Alexei Starovoitov 写道:
> On 8/4/15 1:58 AM, Kaixu Xia wrote:
>> +static int check_func_limit(struct bpf_map **mapp, int func_id)
> 
> how about 'check_map_func_compatibility' or 'check_map_func_affinity' ?
> 
>> +{
>> +struct bpf_map *map = *mapp;
> 
> why pass pointer to a pointer? single pointer would be be fine.
> 
>> +bool bool_map, bool_func;
>> +int i;
>> +
>> +if (!map)
>> +return 0;
>> +
>> +for (i = 0; i <= ARRAY_SIZE(func_limit); i++) {
>> +bool_map = (map->map_type == func_limit[i].map_type);
>> +bool_func = (func_id == func_limit[i].func_id);
>> +/* only when map & func pair match it can continue.
>> + * don't allow any other map type to be passed into
>> + * the special func;
>> + */
>> +if (bool_map != bool_func)
>> +return -EINVAL;
>> +}
> 
> nice simplification!
> 
> the rest of the changes look good.
> please respin your next set against net-next.

Thanks for your review! I will follow them in the next set.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-08-04 Thread xiakaixu
于 2015/8/5 1:55, Alexei Starovoitov 写道:
 On 8/4/15 1:58 AM, Kaixu Xia wrote:
 +static int check_func_limit(struct bpf_map **mapp, int func_id)
 
 how about 'check_map_func_compatibility' or 'check_map_func_affinity' ?
 
 +{
 +struct bpf_map *map = *mapp;
 
 why pass pointer to a pointer? single pointer would be be fine.
 
 +bool bool_map, bool_func;
 +int i;
 +
 +if (!map)
 +return 0;
 +
 +for (i = 0; i = ARRAY_SIZE(func_limit); i++) {
 +bool_map = (map-map_type == func_limit[i].map_type);
 +bool_func = (func_id == func_limit[i].func_id);
 +/* only when map  func pair match it can continue.
 + * don't allow any other map type to be passed into
 + * the special func;
 + */
 +if (bool_map != bool_func)
 +return -EINVAL;
 +}
 
 nice simplification!
 
 the rest of the changes look good.
 please respin your next set against net-next.

Thanks for your review! I will follow them in the next set.
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-08-03 Thread xiakaixu
于 2015/8/3 17:34, Peter Zijlstra 写道:
> On Thu, Jul 23, 2015 at 09:42:41AM +, Kaixu Xia wrote:
>> +static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
>> +{
>> +struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
>> +struct bpf_array *array = container_of(map, struct bpf_array, map);
>> +struct perf_event *event;
>> +
>> +if (index >= array->map.max_entries)
>> +return -E2BIG;
>> +
>> +event = array->events[index];
>> +if (!event)
>> +return -EBADF;
>> +
>> +if (event->state != PERF_EVENT_STATE_ACTIVE)
>> +return -ENOENT;
>> +
>> +if (event->oncpu != raw_smp_processor_id() &&
>> +event->ctx->task != current)
>> +return -EINVAL;
>> +
>> +if (event->attr.inherit)
>> +return -EINVAL;
>> +
>> +__perf_event_read(event);
>> +
>> +return perf_event_count(event);
>> +}
> 
> Please no poking of event internal state outside of perf code.

Thanks for your review. I will move it to kernel/events/core.c.
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-08-03 Thread xiakaixu
于 2015/8/3 17:34, Peter Zijlstra 写道:
 On Thu, Jul 23, 2015 at 09:42:41AM +, Kaixu Xia wrote:
 +static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
 +{
 +struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
 +struct bpf_array *array = container_of(map, struct bpf_array, map);
 +struct perf_event *event;
 +
 +if (index = array-map.max_entries)
 +return -E2BIG;
 +
 +event = array-events[index];
 +if (!event)
 +return -EBADF;
 +
 +if (event-state != PERF_EVENT_STATE_ACTIVE)
 +return -ENOENT;
 +
 +if (event-oncpu != raw_smp_processor_id() 
 +event-ctx-task != current)
 +return -EINVAL;
 +
 +if (event-attr.inherit)
 +return -EINVAL;
 +
 +__perf_event_read(event);
 +
 +return perf_event_count(event);
 +}
 
 Please no poking of event internal state outside of perf code.

Thanks for your review. I will move it to kernel/events/core.c.
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/4] bpf: Make the bpf_prog_array_map more generic

2015-07-31 Thread xiakaixu
于 2015/7/30 9:44, Alexei Starovoitov 写道:
> On 7/29/15 4:17 PM, Daniel Borkmann wrote:
>>> -if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY)
>>> +if (map->map_type >= BPF_MAP_TYPE_PROG_ARRAY)
>>>   /* prog_array stores refcnt-ed bpf_prog pointers
>>>* release them all when user space closes prog_array_fd
>>>*/
>>> -bpf_prog_array_map_clear(map);
>>> +bpf_fd_array_map_clear(map);
>>
>> When we are going to add a new map type to the eBPF framework that is not
>> an fd_array_map thing, this assumption of map->map_type >=
>> BPF_MAP_TYPE_PROG_ARRAY
>> might not hold then ...
> 
> Also I think here changing == to >= is probably unnecessary.
> prog_array needs to do it because of circular dependency
> whereas perf_event_array cannot have it.
> Even when we attach bpf prog to perf_event and then add it to
> perf_event_array used by the same prog, right?
> Please test such scenario just in case.

Not sure completely understand what you mean. You know, we can
attach bpf_prog to kprobe events. For now, we limit few event
types, only PERF_EVENT_RAW & PERF_EVENT_HARDWARE event can
be accessed in bpf_perf_event_read(). Seems like the dependency
scenario won't happen.
I will add the event decrement refcnt function to map_free in V5.
right?
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/4] bpf: Make the bpf_prog_array_map more generic

2015-07-31 Thread xiakaixu
于 2015/7/30 9:44, Alexei Starovoitov 写道:
 On 7/29/15 4:17 PM, Daniel Borkmann wrote:
 -if (map-map_type == BPF_MAP_TYPE_PROG_ARRAY)
 +if (map-map_type = BPF_MAP_TYPE_PROG_ARRAY)
   /* prog_array stores refcnt-ed bpf_prog pointers
* release them all when user space closes prog_array_fd
*/
 -bpf_prog_array_map_clear(map);
 +bpf_fd_array_map_clear(map);

 When we are going to add a new map type to the eBPF framework that is not
 an fd_array_map thing, this assumption of map-map_type =
 BPF_MAP_TYPE_PROG_ARRAY
 might not hold then ...
 
 Also I think here changing == to = is probably unnecessary.
 prog_array needs to do it because of circular dependency
 whereas perf_event_array cannot have it.
 Even when we attach bpf prog to perf_event and then add it to
 perf_event_array used by the same prog, right?
 Please test such scenario just in case.

Not sure completely understand what you mean. You know, we can
attach bpf_prog to kprobe events. For now, we limit few event
types, only PERF_EVENT_RAW  PERF_EVENT_HARDWARE event can
be accessed in bpf_perf_event_read(). Seems like the dependency
scenario won't happen.
I will add the event decrement refcnt function to map_free in V5.
right?
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/5] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

2015-07-24 Thread xiakaixu
于 2015/7/24 7:33, Daniel Borkmann 写道:
> On 07/22/2015 10:09 AM, Kaixu Xia wrote:
>> Previous patch v1 url:
>> https://lkml.org/lkml/2015/7/17/287
> 
> [ Sorry to chime in late, just noticed this series now as I wasn't in Cc for
>   the core BPF changes. More below ... ]

Sorry about this, will add you to the CC list:) Welcome your comments.
> 
>> This patchset allows user read PMU events in the following way:
>>   1. Open the PMU using perf_event_open() (for each CPUs or for
>>  each processes he/she'd like to watch);
>>   2. Create a BPF_MAP_TYPE_PERF_EVENT_ARRAY BPF map;
>>   3. Insert FDs into the map with some key-value mapping scheme
>>  (i.e. cpuid -> event on that CPU);
>>   4. Load and attach eBPF programs as usual;
>>   5. In eBPF program, get the perf_event_map_fd and key (i.e.
>>  cpuid get from bpf_get_smp_processor_id()) then use
>>  bpf_perf_event_read() to read from it.
>>   6. Do anything he/her want.
>>
>> changes in V2:
>>   - put atomic_long_inc_not_zero() between fdget() and fdput();
>>   - limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE;
>>   - Only read the event counter on current CPU or on current
>> process;
>>   - add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the
>> pointer to the struct perf_event;
>>   - according to the perf_event_map_fd and key, the function
>> bpf_perf_event_read() can get the Hardware PMU counter value;
>>
>> Patch 5/5 is a simple example and shows how to use this new eBPF
>> programs ability. The PMU counter data can be found in
>> /sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU
>> value when 'kprobe/sys_write' sampling)
>>
>>$ cat /sys/kernel/debug/tracing/trace_pipe
>>$ ./tracex6
>> ...
>>   cat-677   [002] d..1   210.299270: : bpf count: CPU-2  5316659
>>   cat-677   [002] d..1   210.299316: : bpf count: CPU-2  5378639
>>   cat-677   [002] d..1   210.299362: : bpf count: CPU-2  5440654
>>   cat-677   [002] d..1   210.299408: : bpf count: CPU-2  5503211
>>   cat-677   [002] d..1   210.299454: : bpf count: CPU-2  5565438
>>   cat-677   [002] d..1   210.299500: : bpf count: CPU-2  5627433
>>   cat-677   [002] d..1   210.299547: : bpf count: CPU-2  5690033
>>   cat-677   [002] d..1   210.299593: : bpf count: CPU-2  5752184
>>   cat-677   [002] d..1   210.299639: : bpf count: CPU-2  5814543
>> <...>-548   [009] d..1   210.299667: : bpf count: CPU-9  
>> 605418074
>> <...>-548   [009] d..1   210.299692: : bpf count: CPU-9  
>> 605452692
>>   cat-677   [002] d..1   210.299700: : bpf count: CPU-2  5896319
>> <...>-548   [009] d..1   210.299710: : bpf count: CPU-9  
>> 605477824
>> <...>-548   [009] d..1   210.299728: : bpf count: CPU-9  
>> 605501726
>> <...>-548   [009] d..1   210.299745: : bpf count: CPU-9  
>> 605525279
>> <...>-548   [009] d..1   210.299762: : bpf count: CPU-9  
>> 605547817
>> <...>-548   [009] d..1   210.299778: : bpf count: CPU-9  
>> 605570433
>> <...>-548   [009] d..1   210.299795: : bpf count: CPU-9  
>> 605592743
>> ...
>>
>> The detail of patches is as follow:
>>
>> Patch 1/5 introduces a new bpf map type. This map only stores the
>> pointer to struct perf_event;
>>
>> Patch 2/5 introduces a map_traverse_elem() function for further use;
>>
>> Patch 3/5 convets event file descriptors into perf_event structure when
>> add new element to the map;
> 
> So far all the map backends are of generic nature, knowing absolutely nothing
> about a particular consumer/subsystem of eBPF (tc, socket filters, etc). The
> tail call is a bit special, but nevertheless generic for each user and [very]
> useful, so it makes sense to inherit from the array map and move the code 
> there.
> 
> I don't really like that we start add new _special_-cased maps here into the
> eBPF core code, it seems quite hacky. :( From your rather terse commit 
> description
> where you introduce the maps, I failed to see a detailed elaboration on this 
> i.e.
> why it cannot be abstracted any different?

It will be very useful that giving the eBPF programs the ablility to access
hardware PMU counter, just as I mentioned in V1 commit message.
Of course, there are some special code when creating the perf_event type map
in V2, but you will find less special code in the next version(V3). I have
reused most of the prog_array map implementation. We can make the perf_event
array map more generic in the future.

BR.
>  
>> Patch 4/5 implement function bpf_perf_event_read() that get the selected
>> hardware PMU conuter;
>>
>> Patch 5/5 give a simple example.
>>
>> Kaixu Xia (5):
>>bpf: Add new bpf map type to store the pointer to struct perf_event
>>bpf: Add function map->ops->map_traverse_elem() to traverse map elems
>>bpf: Save the pointer to struct perf_event to map
>>

Re: [PATCH v2 0/5] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

2015-07-24 Thread xiakaixu
于 2015/7/24 7:33, Daniel Borkmann 写道:
 On 07/22/2015 10:09 AM, Kaixu Xia wrote:
 Previous patch v1 url:
 https://lkml.org/lkml/2015/7/17/287
 
 [ Sorry to chime in late, just noticed this series now as I wasn't in Cc for
   the core BPF changes. More below ... ]

Sorry about this, will add you to the CC list:) Welcome your comments.
 
 This patchset allows user read PMU events in the following way:
   1. Open the PMU using perf_event_open() (for each CPUs or for
  each processes he/she'd like to watch);
   2. Create a BPF_MAP_TYPE_PERF_EVENT_ARRAY BPF map;
   3. Insert FDs into the map with some key-value mapping scheme
  (i.e. cpuid - event on that CPU);
   4. Load and attach eBPF programs as usual;
   5. In eBPF program, get the perf_event_map_fd and key (i.e.
  cpuid get from bpf_get_smp_processor_id()) then use
  bpf_perf_event_read() to read from it.
   6. Do anything he/her want.

 changes in V2:
   - put atomic_long_inc_not_zero() between fdget() and fdput();
   - limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE;
   - Only read the event counter on current CPU or on current
 process;
   - add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the
 pointer to the struct perf_event;
   - according to the perf_event_map_fd and key, the function
 bpf_perf_event_read() can get the Hardware PMU counter value;

 Patch 5/5 is a simple example and shows how to use this new eBPF
 programs ability. The PMU counter data can be found in
 /sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU
 value when 'kprobe/sys_write' sampling)

$ cat /sys/kernel/debug/tracing/trace_pipe
$ ./tracex6
 ...
   cat-677   [002] d..1   210.299270: : bpf count: CPU-2  5316659
   cat-677   [002] d..1   210.299316: : bpf count: CPU-2  5378639
   cat-677   [002] d..1   210.299362: : bpf count: CPU-2  5440654
   cat-677   [002] d..1   210.299408: : bpf count: CPU-2  5503211
   cat-677   [002] d..1   210.299454: : bpf count: CPU-2  5565438
   cat-677   [002] d..1   210.299500: : bpf count: CPU-2  5627433
   cat-677   [002] d..1   210.299547: : bpf count: CPU-2  5690033
   cat-677   [002] d..1   210.299593: : bpf count: CPU-2  5752184
   cat-677   [002] d..1   210.299639: : bpf count: CPU-2  5814543
 ...-548   [009] d..1   210.299667: : bpf count: CPU-9  
 605418074
 ...-548   [009] d..1   210.299692: : bpf count: CPU-9  
 605452692
   cat-677   [002] d..1   210.299700: : bpf count: CPU-2  5896319
 ...-548   [009] d..1   210.299710: : bpf count: CPU-9  
 605477824
 ...-548   [009] d..1   210.299728: : bpf count: CPU-9  
 605501726
 ...-548   [009] d..1   210.299745: : bpf count: CPU-9  
 605525279
 ...-548   [009] d..1   210.299762: : bpf count: CPU-9  
 605547817
 ...-548   [009] d..1   210.299778: : bpf count: CPU-9  
 605570433
 ...-548   [009] d..1   210.299795: : bpf count: CPU-9  
 605592743
 ...

 The detail of patches is as follow:

 Patch 1/5 introduces a new bpf map type. This map only stores the
 pointer to struct perf_event;

 Patch 2/5 introduces a map_traverse_elem() function for further use;

 Patch 3/5 convets event file descriptors into perf_event structure when
 add new element to the map;
 
 So far all the map backends are of generic nature, knowing absolutely nothing
 about a particular consumer/subsystem of eBPF (tc, socket filters, etc). The
 tail call is a bit special, but nevertheless generic for each user and [very]
 useful, so it makes sense to inherit from the array map and move the code 
 there.
 
 I don't really like that we start add new _special_-cased maps here into the
 eBPF core code, it seems quite hacky. :( From your rather terse commit 
 description
 where you introduce the maps, I failed to see a detailed elaboration on this 
 i.e.
 why it cannot be abstracted any different?

It will be very useful that giving the eBPF programs the ablility to access
hardware PMU counter, just as I mentioned in V1 commit message.
Of course, there are some special code when creating the perf_event type map
in V2, but you will find less special code in the next version(V3). I have
reused most of the prog_array map implementation. We can make the perf_event
array map more generic in the future.

BR.
  
 Patch 4/5 implement function bpf_perf_event_read() that get the selected
 hardware PMU conuter;

 Patch 5/5 give a simple example.

 Kaixu Xia (5):
bpf: Add new bpf map type to store the pointer to struct perf_event
bpf: Add function map-ops-map_traverse_elem() to traverse map elems
bpf: Save the pointer to struct perf_event to map
bpf: Implement function bpf_perf_event_read() that get the selected
  hardware PMU conuter
samples/bpf: example of get selected PMU counter value

   include/linux/bpf.h|   6 +++
 

Re: [PATCH v3 1/3] bpf: Add new bpf map type to store the pointer to struct perf_event

2015-07-23 Thread xiakaixu
于 2015/7/24 6:54, Alexei Starovoitov 写道:
> On 7/23/15 2:42 AM, Kaixu Xia wrote:
>> Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'.
>> This map only stores the pointer to struct perf_event. The
>> user space event FDs from perf_event_open() syscall are converted
>> to the pointer to struct perf_event and stored in map.
> ...
>> +static struct bpf_map *perf_event_array_map_alloc(union bpf_attr *attr)
>> +{
>> +/* only the pointer to struct perf_event can be stored in
>> + * perf_event_array map
>> + */
>> +if (attr->value_size != sizeof(u32))
>> +return ERR_PTR(-EINVAL);
>> +
>> +return array_map_alloc(attr);
>> +}
> 
> since it's exactly the same as prog_array_map_alloc(),
> just rename it to something like 'fd_array_map_alloc'
> and use for both types.
> 
>> +static int perf_event_array_map_get_next_key(struct bpf_map *map, void *key,
>> + void *next_key)
>> +{
>> +return -EINVAL;
>> +}
>> +
>> +static void *perf_event_array_map_lookup_elem(struct bpf_map *map, void 
>> *key)
>> +{
>> +return NULL;
>> +}
> 
> same for the above two.
> rename prog_array_map_* into fd_array_map_* and use for both map types.
> 
>> +static struct perf_event *convert_map_with_perf_event(void *value)
>> +{
>> +struct perf_event *event;
>> +u32 fd;
>> +
>> +fd = *(u32 *)value;
>> +
>> +event = perf_event_get(fd);
>> +if (IS_ERR(event))
>> +return NULL;
> 
> don't lose error code, do 'return event' instead.
> 
>> +
>> +/* limit the event type to PERF_TYPE_RAW
>> + * and PERF_TYPE_HARDWARE.
>> + */
>> +if (event->attr.type != PERF_TYPE_RAW &&
>> +event->attr.type != PERF_TYPE_HARDWARE)
>> +return NULL;
> 
> perf_event refcnt leak? need to do put_event.
> and return ERR_PTR(-EINVAL)
> 
>> +
>> +return event;
>> +}
>> +
>> +/* only called from syscall */
>> +static int perf_event_array_map_update_elem(struct bpf_map *map, void *key,
>> +void *value, u64 map_flags)
>> +{
>> +struct bpf_array *array = container_of(map, struct bpf_array, map);
>> +struct perf_event *event;
>> +u32 index = *(u32 *)key;
>> +
>> +if (map_flags != BPF_ANY)
>> +return -EINVAL;
>> +
>> +if (index >= array->map.max_entries)
>> +return -E2BIG;
>> +
>> +/* check if the value is already stored */
>> +if (array->events[index])
>> +return -EINVAL;
>> +
>> +/* convert the fd to the pointer to struct perf_event */
>> +event = convert_map_with_perf_event(value);
> 
> imo helper name is misleading and it's too short to be separate
> function. Just inline it and you can reuse 'index' variable.
> 
>> +if (!event)
>> +return -EBADF;
>> +
>> +xchg(array->events + index, event);
> 
> refcnt leak of old event! Please think it through.
> This type of bugs I shouldn't be finding.

Maybe the commit message is not elaborate. Here I prevent
user space from updating the existed event, so the return
value of xchg() is NULL and no refcnt leak of old event.
I will do the same as prog_array in next version.
> 
>> +static int perf_event_array_map_delete_elem(struct bpf_map *map, void *key)
>> +{
>> +return -EINVAL;
>> +}
> 
> no way to dec refcnt of perf_event from user space?
> why not to do the same as prog_array_delete?

Will follow them in V4.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-07-23 Thread xiakaixu
于 2015/7/24 6:56, Alexei Starovoitov 写道:
> On 7/23/15 2:42 AM, Kaixu Xia wrote:
>> According to the perf_event_map_fd and index, the function
>> bpf_perf_event_read() can convert the corresponding map
>> value to the pointer to struct perf_event and return the
>> Hardware PMU counter value.
>>
>> Signed-off-by: Kaixu Xia 
> ...
>> +static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
>> +{
>> +struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
>> +struct bpf_array *array = container_of(map, struct bpf_array, map);
>> +struct perf_event *event;
>> +
>> +if (index >= array->map.max_entries)
>> +return -E2BIG;
>> +
>> +event = array->events[index];
>> +if (!event)
>> +return -EBADF;
> 
> probably ENOENT makes more sense here.
> 
>> +
>> +if (event->state != PERF_EVENT_STATE_ACTIVE)
>> +return -ENOENT;
> 
> and -EINVAL here?

Yeah, the errno is better.

Thanks!
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/3] samples/bpf: example of get selected PMU counter value

2015-07-23 Thread xiakaixu
于 2015/7/24 6:59, Alexei Starovoitov 写道:
> On 7/23/15 2:42 AM, Kaixu Xia wrote:
>> This is a simple example and shows how to use the new ability
>> to get the selected Hardware PMU counter value.
>>
>> Signed-off-by: Kaixu Xia 
> ...
>> +struct bpf_map_def SEC("maps") my_map = {
>> +.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
>> +.key_size = sizeof(int),
>> +.value_size = sizeof(unsigned long),
>> +.max_entries = 32,
>> +};
> 
> wait. how did it work here? value_size should be u32.

I tested the whole thing on ARM board. You are ringt, it
should be u32.
When create the array map, we choose the array->elem_size as
round_up(attr->value_size, 8), why 8?

Thanks!
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-07-23 Thread xiakaixu
于 2015/7/24 6:56, Alexei Starovoitov 写道:
 On 7/23/15 2:42 AM, Kaixu Xia wrote:
 According to the perf_event_map_fd and index, the function
 bpf_perf_event_read() can convert the corresponding map
 value to the pointer to struct perf_event and return the
 Hardware PMU counter value.

 Signed-off-by: Kaixu Xia xiaka...@huawei.com
 ...
 +static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
 +{
 +struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
 +struct bpf_array *array = container_of(map, struct bpf_array, map);
 +struct perf_event *event;
 +
 +if (index = array-map.max_entries)
 +return -E2BIG;
 +
 +event = array-events[index];
 +if (!event)
 +return -EBADF;
 
 probably ENOENT makes more sense here.
 
 +
 +if (event-state != PERF_EVENT_STATE_ACTIVE)
 +return -ENOENT;
 
 and -EINVAL here?

Yeah, the errno is better.

Thanks!
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/3] samples/bpf: example of get selected PMU counter value

2015-07-23 Thread xiakaixu
于 2015/7/24 6:59, Alexei Starovoitov 写道:
 On 7/23/15 2:42 AM, Kaixu Xia wrote:
 This is a simple example and shows how to use the new ability
 to get the selected Hardware PMU counter value.

 Signed-off-by: Kaixu Xia xiaka...@huawei.com
 ...
 +struct bpf_map_def SEC(maps) my_map = {
 +.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
 +.key_size = sizeof(int),
 +.value_size = sizeof(unsigned long),
 +.max_entries = 32,
 +};
 
 wait. how did it work here? value_size should be u32.

I tested the whole thing on ARM board. You are ringt, it
should be u32.
When create the array map, we choose the array-elem_size as
round_up(attr-value_size, 8), why 8?

Thanks!
 
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/3] bpf: Add new bpf map type to store the pointer to struct perf_event

2015-07-23 Thread xiakaixu
于 2015/7/24 6:54, Alexei Starovoitov 写道:
 On 7/23/15 2:42 AM, Kaixu Xia wrote:
 Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'.
 This map only stores the pointer to struct perf_event. The
 user space event FDs from perf_event_open() syscall are converted
 to the pointer to struct perf_event and stored in map.
 ...
 +static struct bpf_map *perf_event_array_map_alloc(union bpf_attr *attr)
 +{
 +/* only the pointer to struct perf_event can be stored in
 + * perf_event_array map
 + */
 +if (attr-value_size != sizeof(u32))
 +return ERR_PTR(-EINVAL);
 +
 +return array_map_alloc(attr);
 +}
 
 since it's exactly the same as prog_array_map_alloc(),
 just rename it to something like 'fd_array_map_alloc'
 and use for both types.
 
 +static int perf_event_array_map_get_next_key(struct bpf_map *map, void *key,
 + void *next_key)
 +{
 +return -EINVAL;
 +}
 +
 +static void *perf_event_array_map_lookup_elem(struct bpf_map *map, void 
 *key)
 +{
 +return NULL;
 +}
 
 same for the above two.
 rename prog_array_map_* into fd_array_map_* and use for both map types.
 
 +static struct perf_event *convert_map_with_perf_event(void *value)
 +{
 +struct perf_event *event;
 +u32 fd;
 +
 +fd = *(u32 *)value;
 +
 +event = perf_event_get(fd);
 +if (IS_ERR(event))
 +return NULL;
 
 don't lose error code, do 'return event' instead.
 
 +
 +/* limit the event type to PERF_TYPE_RAW
 + * and PERF_TYPE_HARDWARE.
 + */
 +if (event-attr.type != PERF_TYPE_RAW 
 +event-attr.type != PERF_TYPE_HARDWARE)
 +return NULL;
 
 perf_event refcnt leak? need to do put_event.
 and return ERR_PTR(-EINVAL)
 
 +
 +return event;
 +}
 +
 +/* only called from syscall */
 +static int perf_event_array_map_update_elem(struct bpf_map *map, void *key,
 +void *value, u64 map_flags)
 +{
 +struct bpf_array *array = container_of(map, struct bpf_array, map);
 +struct perf_event *event;
 +u32 index = *(u32 *)key;
 +
 +if (map_flags != BPF_ANY)
 +return -EINVAL;
 +
 +if (index = array-map.max_entries)
 +return -E2BIG;
 +
 +/* check if the value is already stored */
 +if (array-events[index])
 +return -EINVAL;
 +
 +/* convert the fd to the pointer to struct perf_event */
 +event = convert_map_with_perf_event(value);
 
 imo helper name is misleading and it's too short to be separate
 function. Just inline it and you can reuse 'index' variable.
 
 +if (!event)
 +return -EBADF;
 +
 +xchg(array-events + index, event);
 
 refcnt leak of old event! Please think it through.
 This type of bugs I shouldn't be finding.

Maybe the commit message is not elaborate. Here I prevent
user space from updating the existed event, so the return
value of xchg() is NULL and no refcnt leak of old event.
I will do the same as prog_array in next version.
 
 +static int perf_event_array_map_delete_elem(struct bpf_map *map, void *key)
 +{
 +return -EINVAL;
 +}
 
 no way to dec refcnt of perf_event from user space?
 why not to do the same as prog_array_delete?

Will follow them in V4.
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/5] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-07-22 Thread xiakaixu
于 2015/7/23 10:22, Alexei Starovoitov 写道:
> On 7/22/15 7:12 PM, xiakaixu wrote:
>> So you mean like this?
>>
>> u64 bpf_perf_event_read(u64 r1, u64 index,...)
>> {
>>struct bpf_perf_event_array *array = (void *) (long) r1;
>>struct perf_event *event;
>>...
>>event = array->events[index];
>>...
>> }
> 
> yes. the only thing needed is to add:
> if (index >= array->map.max_entries)
> return -E2BIG;
> before accessing array->events[index];
> 
>>> >
>>>> >>+const struct bpf_func_proto bpf_perf_event_read_proto = {
>>>> >>+.func= bpf_perf_event_read,
>>>> >>+.gpl_only= false,
>>>> >>+.ret_type= RET_INTEGER,
>>>> >>+.arg1_type= ARG_CONST_MAP_PTR,
>>>> >>+.arg2_type= ARG_PTR_TO_MAP_KEY,
>>> >
>>> >make it arg2_type = ARG_ANYTHING then you'll just index
>>> >into array the way prog_array does and similar to bpf_tail_call.
> 
>>   ARG_ANYTHING means any (initialized) argument is ok, but we here
> 
> correct.
> 
>>   really want is map key. So I'm not sure ARG_ANYTHING is suitable.
>>   You know ARG_ANYTHING is not checked enough in verifier.
> 
> why? during perf_event_array creation time we check that key_size == u32
> so we can accept any integer.
> ARG_PTR_TO_MAP_KEY forces program author to use stack instead of
> passing index directly. Direct index is obviously faster.

Copy that. We will follow them in V3.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/5] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-07-22 Thread xiakaixu
于 2015/7/23 9:14, Alexei Starovoitov 写道:
> On 7/22/15 1:09 AM, Kaixu Xia wrote:
>> According to the perf_event_map_fd and key, the function
>> bpf_perf_event_read() can convert the corresponding map
>> value to the pointer to struct perf_event and return the
>> Hardware PMU counter value.
>>
>> The key can't be passed to bpf_perf_event_read() directly
>> because the function argument constraint is lacked.
> 
> I don't understand above sentence.
> 
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 69a1f6b..e3bb181 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -250,6 +250,8 @@ enum bpf_func_id {
>>* Return: 0 on success
>>*/
>>   BPF_FUNC_get_current_comm,
>> +
>> +BPF_FUNC_perf_event_read,/* u64 bpf_perf_event_read(, ) */
> 
> no need for extra empty line.
> 
>> +
>> +static u64 bpf_perf_event_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
>> +{
>> +struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
>> +void *key = (void *) (unsigned long) r2;
>> +struct perf_event *event;
>> +void *ptr;
>> +
>> +if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
>> +return -EINVAL;
> 
> please check this statically in verifier instead of in run-time.
> 
>> +
>> +rcu_read_lock();
> 
> unnecessary.
> 
>> +ptr = map->ops->map_lookup_elem(map, key);
>> +rcu_read_unlock();
>> +if (!ptr || !(*(unsigned long *)ptr))
>> +return -EBADF;
> 
> all these casts can be removed. First cast of 'r1' into
> perf_event_array will be enough.

So you mean like this?

u64 bpf_perf_event_read(u64 r1, u64 index,...)
{
  struct bpf_perf_event_array *array = (void *) (long) r1;
  struct perf_event *event;
  ...
  event = array->events[index];
  ...
}
> 
>> +const struct bpf_func_proto bpf_perf_event_read_proto = {
>> +.func= bpf_perf_event_read,
>> +.gpl_only= false,
>> +.ret_type= RET_INTEGER,
>> +.arg1_type= ARG_CONST_MAP_PTR,
>> +.arg2_type= ARG_PTR_TO_MAP_KEY,
> 
> make it arg2_type = ARG_ANYTHING then you'll just index
> into array the way prog_array does and similar to bpf_tail_call.

 ARG_ANYTHING means any (initialized) argument is ok, but we here
 really want is map key. So I'm not sure ARG_ANYTHING is suitable.
 You know ARG_ANYTHING is not checked enough in verifier.

Thanks.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/5] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-07-22 Thread xiakaixu
于 2015/7/23 9:14, Alexei Starovoitov 写道:
 On 7/22/15 1:09 AM, Kaixu Xia wrote:
 According to the perf_event_map_fd and key, the function
 bpf_perf_event_read() can convert the corresponding map
 value to the pointer to struct perf_event and return the
 Hardware PMU counter value.

 The key can't be passed to bpf_perf_event_read() directly
 because the function argument constraint is lacked.
 
 I don't understand above sentence.
 
 diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
 index 69a1f6b..e3bb181 100644
 --- a/include/uapi/linux/bpf.h
 +++ b/include/uapi/linux/bpf.h
 @@ -250,6 +250,8 @@ enum bpf_func_id {
* Return: 0 on success
*/
   BPF_FUNC_get_current_comm,
 +
 +BPF_FUNC_perf_event_read,/* u64 bpf_perf_event_read(map, key) */
 
 no need for extra empty line.
 
 +
 +static u64 bpf_perf_event_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
 +{
 +struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
 +void *key = (void *) (unsigned long) r2;
 +struct perf_event *event;
 +void *ptr;
 +
 +if (map-map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY)
 +return -EINVAL;
 
 please check this statically in verifier instead of in run-time.
 
 +
 +rcu_read_lock();
 
 unnecessary.
 
 +ptr = map-ops-map_lookup_elem(map, key);
 +rcu_read_unlock();
 +if (!ptr || !(*(unsigned long *)ptr))
 +return -EBADF;
 
 all these casts can be removed. First cast of 'r1' into
 perf_event_array will be enough.

So you mean like this?

u64 bpf_perf_event_read(u64 r1, u64 index,...)
{
  struct bpf_perf_event_array *array = (void *) (long) r1;
  struct perf_event *event;
  ...
  event = array-events[index];
  ...
}
 
 +const struct bpf_func_proto bpf_perf_event_read_proto = {
 +.func= bpf_perf_event_read,
 +.gpl_only= false,
 +.ret_type= RET_INTEGER,
 +.arg1_type= ARG_CONST_MAP_PTR,
 +.arg2_type= ARG_PTR_TO_MAP_KEY,
 
 make it arg2_type = ARG_ANYTHING then you'll just index
 into array the way prog_array does and similar to bpf_tail_call.

 ARG_ANYTHING means any (initialized) argument is ok, but we here
 really want is map key. So I'm not sure ARG_ANYTHING is suitable.
 You know ARG_ANYTHING is not checked enough in verifier.

Thanks.
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/5] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter

2015-07-22 Thread xiakaixu
于 2015/7/23 10:22, Alexei Starovoitov 写道:
 On 7/22/15 7:12 PM, xiakaixu wrote:
 So you mean like this?

 u64 bpf_perf_event_read(u64 r1, u64 index,...)
 {
struct bpf_perf_event_array *array = (void *) (long) r1;
struct perf_event *event;
...
event = array-events[index];
...
 }
 
 yes. the only thing needed is to add:
 if (index = array-map.max_entries)
 return -E2BIG;
 before accessing array-events[index];
 
 
 +const struct bpf_func_proto bpf_perf_event_read_proto = {
 +.func= bpf_perf_event_read,
 +.gpl_only= false,
 +.ret_type= RET_INTEGER,
 +.arg1_type= ARG_CONST_MAP_PTR,
 +.arg2_type= ARG_PTR_TO_MAP_KEY,
 
 make it arg2_type = ARG_ANYTHING then you'll just index
 into array the way prog_array does and similar to bpf_tail_call.
 
   ARG_ANYTHING means any (initialized) argument is ok, but we here
 
 correct.
 
   really want is map key. So I'm not sure ARG_ANYTHING is suitable.
   You know ARG_ANYTHING is not checked enough in verifier.
 
 why? during perf_event_array creation time we check that key_size == u32
 so we can accept any integer.
 ARG_PTR_TO_MAP_KEY forces program author to use stack instead of
 passing index directly. Direct index is obviously faster.

Copy that. We will follow them in V3.
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf: remove the extra validity check on nr_pages

2015-01-29 Thread xiakaixu
于 2015/1/27 17:55, Kaixu Xia 写道:
ping...

> The function is_power_of_2() also do the check on nr_pages,
> so the first check performed is unnecessary. On the other
> hand, the key point is to ensure @nr_pages is a power-of-two
> number and mostly @nr_pages is a nonzero value, so in the
> most cases, the function is_power_of_2() will be called.
> 
> Signed-off-by: Kaixu Xia 
> ---
>  kernel/events/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 882f835..abb57c2 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4268,7 +4268,7 @@ static int perf_mmap(struct file *file, struct 
> vm_area_struct *vma)
>* If we have rb pages ensure they're a power-of-two number, so we
>* can do bitmasks instead of modulo.
>*/
> - if (nr_pages != 0 && !is_power_of_2(nr_pages))
> + if (!is_power_of_2(nr_pages))
>   return -EINVAL;
>  
>   if (vma_size != PAGE_SIZE * (1 + nr_pages))
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf: remove the extra validity check on nr_pages

2015-01-29 Thread xiakaixu
于 2015/1/27 17:55, Kaixu Xia 写道:
ping...

 The function is_power_of_2() also do the check on nr_pages,
 so the first check performed is unnecessary. On the other
 hand, the key point is to ensure @nr_pages is a power-of-two
 number and mostly @nr_pages is a nonzero value, so in the
 most cases, the function is_power_of_2() will be called.
 
 Signed-off-by: Kaixu Xia xiaka...@huawei.com
 ---
  kernel/events/core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/kernel/events/core.c b/kernel/events/core.c
 index 882f835..abb57c2 100644
 --- a/kernel/events/core.c
 +++ b/kernel/events/core.c
 @@ -4268,7 +4268,7 @@ static int perf_mmap(struct file *file, struct 
 vm_area_struct *vma)
* If we have rb pages ensure they're a power-of-two number, so we
* can do bitmasks instead of modulo.
*/
 - if (nr_pages != 0  !is_power_of_2(nr_pages))
 + if (!is_power_of_2(nr_pages))
   return -EINVAL;
  
   if (vma_size != PAGE_SIZE * (1 + nr_pages))
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ARM: cns3xxx: fix allmodconfig panic in pci driver

2014-09-03 Thread xiakaixu
于 2014/9/3 21:27, Arnd Bergmann 写道:
> On Wednesday 03 September 2014 21:18:12 Xia Kaixu wrote:
>> From: Xia Kaixu 
>>
>> The kernel panic occurs when running an allmodconfig kernel on 
>> OMAP4460. The inicall "cns3xxx_pcie_init" does not check which
>> hardware it's running on and just tries to access to its specific 
>> registers. Now call it from .init_late callback from the two
>> machine descriptors.
>>
>> Signed-off-by: Xia Kaixu 
>> Signed-off-by: Arnd Bergmann 
>> Cc: Anton Vorontsov 
>> Cc: Felix Fietkau 
>> Cc: Imre Kaloz 
>> Cc: linaro-ker...@lists.linaro.org
>> Cc: linux-arm-ker...@lists.infradead.org
> 
> Hi Kaixu,
> 
> it seems this time all the Cc's worked. One thing I just noticed is that
> you confused the 'Signed-off-by' lines. I did look at the patch earlier
> before you sent it out but unfortunately missed that.
> 
> You can read again in Documentation/SubmittingPatches about how
> they work, but the short version is that you must never put someone
> else's name in Signed-off-by under a patch you write yourself.
> When we apply the patch, I (or whoever else does) will put that
> line below yours to document who handled the patch.

  Sure, you are right. Thanks for the reminder!
> 
> Otherwise, the patch looks good to me, thanks a lot for doing it.
> 
>   Arnd
> 
> ___
> linaro-kernel mailing list
> linaro-ker...@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-kernel
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ARM: cns3xxx: fix allmodconfig panic in pci driver

2014-09-03 Thread xiakaixu
于 2014/9/3 21:27, Arnd Bergmann 写道:
 On Wednesday 03 September 2014 21:18:12 Xia Kaixu wrote:
 From: Xia Kaixu kaixu@linaro.org

 The kernel panic occurs when running an allmodconfig kernel on 
 OMAP4460. The inicall cns3xxx_pcie_init does not check which
 hardware it's running on and just tries to access to its specific 
 registers. Now call it from .init_late callback from the two
 machine descriptors.

 Signed-off-by: Xia Kaixu kaixu@linaro.org
 Signed-off-by: Arnd Bergmann a...@arndb.de
 Cc: Anton Vorontsov an...@enomsg.org
 Cc: Felix Fietkau n...@openwrt.org
 Cc: Imre Kaloz ka...@openwrt.org
 Cc: linaro-ker...@lists.linaro.org
 Cc: linux-arm-ker...@lists.infradead.org
 
 Hi Kaixu,
 
 it seems this time all the Cc's worked. One thing I just noticed is that
 you confused the 'Signed-off-by' lines. I did look at the patch earlier
 before you sent it out but unfortunately missed that.
 
 You can read again in Documentation/SubmittingPatches about how
 they work, but the short version is that you must never put someone
 else's name in Signed-off-by under a patch you write yourself.
 When we apply the patch, I (or whoever else does) will put that
 line below yours to document who handled the patch.

  Sure, you are right. Thanks for the reminder!
 
 Otherwise, the patch looks good to me, thanks a lot for doing it.
 
   Arnd
 
 ___
 linaro-kernel mailing list
 linaro-ker...@lists.linaro.org
 http://lists.linaro.org/mailman/listinfo/linaro-kernel
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Does perf support different length of user-space hw_breakpoint?

2014-05-13 Thread xiakaixu
于 2014/5/13 23:06, Frederic Weisbecker 写道:
> On Tue, May 13, 2014 at 02:00:46PM +0200, Jiri Olsa wrote:
>> On Mon, May 12, 2014 at 03:52:54PM +0800, xiakaixu wrote:
>>> Hi guys,
>>>
>>> Does perf support different length of user-space hw_breakpoint,
>>> such as  HW_BREAKPOINT_LEN_1/HW_BREAKPOINT_LEN_2/HW_BREAKPOINT_LEN_4/
>>> HW_BREAKPOINT_LEN_8?
>>>
>>> Seems perf only support HW_BREAKPOINT_LEN_4/sizeof(long)
>>> by default from the source code and simple test.
>>
>> right..
>>
>> /*
>>  * We should find a nice way to override the access length
>>  * Provide some defaults for now
>>  */
>> if (attr.bp_type == HW_BREAKPOINT_X)
>> attr.bp_len = sizeof(long);
>> else
>> attr.bp_len = HW_BREAKPOINT_LEN_4;
>>
>>>
>>> May I have your opinions if I want to trace different bytes of
>>> hw_breakpoint addr?
>>
>> I guess that depends on what you want to do ;-)
> 
> Ah, I have a patchset from Jacob Shin and Suravee Suthikulpanit that does
> that. Also it has been hanging around for too long by my fault. I'm posting
> it now.

Thanks for your reply!
Hopefully I can get it ASAP.
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Does perf support different length of user-space hw_breakpoint?

2014-05-13 Thread xiakaixu
ping
于 2014/5/12 17:00, xiakaixu 写道:
> 于 2014/5/12 16:48, Peter Zijlstra 写道:
>> On Mon, May 12, 2014 at 04:20:44PM +0800, kaixu xia wrote:
>>> 2014-05-12 16:05 GMT+08:00 Peter Zijlstra :
>>>
>>>> On Mon, May 12, 2014 at 03:52:54PM +0800, xiakaixu wrote:
>>>>> Hi guys,
>>>>>
>>>>> Does perf support different length of user-space hw_breakpoint,
>>>>> such as  HW_BREAKPOINT_LEN_1/HW_BREAKPOINT_LEN_2/HW_BREAKPOINT_LEN_4/
>>>>> HW_BREAKPOINT_LEN_8?
>>>>>
>>>>> Seems perf only support HW_BREAKPOINT_LEN_4/sizeof(long)
>>>>> by default from the source code and simple test.
>>>>>
>>>>> May I have your opinions if I want to trace different bytes of
>>>>> hw_breakpoint addr?
>>>>
>>>> Frederic?
>>>>
>>>
>>> Sorry, can not fully understand it. Can you give more details on that?
>>
>> I asked Frederic to answer your question :-)
>>
> Thanks:-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Does perf support different length of user-space hw_breakpoint?

2014-05-13 Thread xiakaixu
ping
于 2014/5/12 17:00, xiakaixu 写道:
 于 2014/5/12 16:48, Peter Zijlstra 写道:
 On Mon, May 12, 2014 at 04:20:44PM +0800, kaixu xia wrote:
 2014-05-12 16:05 GMT+08:00 Peter Zijlstra pet...@infradead.org:

 On Mon, May 12, 2014 at 03:52:54PM +0800, xiakaixu wrote:
 Hi guys,

 Does perf support different length of user-space hw_breakpoint,
 such as  HW_BREAKPOINT_LEN_1/HW_BREAKPOINT_LEN_2/HW_BREAKPOINT_LEN_4/
 HW_BREAKPOINT_LEN_8?

 Seems perf only support HW_BREAKPOINT_LEN_4/sizeof(long)
 by default from the source code and simple test.

 May I have your opinions if I want to trace different bytes of
 hw_breakpoint addr?

 Frederic?


 Sorry, can not fully understand it. Can you give more details on that?

 I asked Frederic to answer your question :-)

 Thanks:-)
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-perf-users in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Does perf support different length of user-space hw_breakpoint?

2014-05-13 Thread xiakaixu
于 2014/5/13 23:06, Frederic Weisbecker 写道:
 On Tue, May 13, 2014 at 02:00:46PM +0200, Jiri Olsa wrote:
 On Mon, May 12, 2014 at 03:52:54PM +0800, xiakaixu wrote:
 Hi guys,

 Does perf support different length of user-space hw_breakpoint,
 such as  HW_BREAKPOINT_LEN_1/HW_BREAKPOINT_LEN_2/HW_BREAKPOINT_LEN_4/
 HW_BREAKPOINT_LEN_8?

 Seems perf only support HW_BREAKPOINT_LEN_4/sizeof(long)
 by default from the source code and simple test.

 right..

 /*
  * We should find a nice way to override the access length
  * Provide some defaults for now
  */
 if (attr.bp_type == HW_BREAKPOINT_X)
 attr.bp_len = sizeof(long);
 else
 attr.bp_len = HW_BREAKPOINT_LEN_4;


 May I have your opinions if I want to trace different bytes of
 hw_breakpoint addr?

 I guess that depends on what you want to do ;-)
 
 Ah, I have a patchset from Jacob Shin and Suravee Suthikulpanit that does
 that. Also it has been hanging around for too long by my fault. I'm posting
 it now.

Thanks for your reply!
Hopefully I can get it ASAP.
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Does perf support different length of user-space hw_breakpoint?

2014-05-12 Thread xiakaixu
于 2014/5/12 16:48, Peter Zijlstra 写道:
> On Mon, May 12, 2014 at 04:20:44PM +0800, kaixu xia wrote:
>> 2014-05-12 16:05 GMT+08:00 Peter Zijlstra :
>>
>>> On Mon, May 12, 2014 at 03:52:54PM +0800, xiakaixu wrote:
>>>> Hi guys,
>>>>
>>>> Does perf support different length of user-space hw_breakpoint,
>>>> such as  HW_BREAKPOINT_LEN_1/HW_BREAKPOINT_LEN_2/HW_BREAKPOINT_LEN_4/
>>>> HW_BREAKPOINT_LEN_8?
>>>>
>>>> Seems perf only support HW_BREAKPOINT_LEN_4/sizeof(long)
>>>> by default from the source code and simple test.
>>>>
>>>> May I have your opinions if I want to trace different bytes of
>>>> hw_breakpoint addr?
>>>
>>> Frederic?
>>>
>>
>> Sorry, can not fully understand it. Can you give more details on that?
> 
> I asked Frederic to answer your question :-)
> 
Thanks:-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Does perf support different length of user-space hw_breakpoint?

2014-05-12 Thread xiakaixu
Hi guys,

Does perf support different length of user-space hw_breakpoint,
such as  HW_BREAKPOINT_LEN_1/HW_BREAKPOINT_LEN_2/HW_BREAKPOINT_LEN_4/
HW_BREAKPOINT_LEN_8?

Seems perf only support HW_BREAKPOINT_LEN_4/sizeof(long)
by default from the source code and simple test.

May I have your opinions if I want to trace different bytes of
hw_breakpoint addr?

Thanks,

Xia Kaixu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Does perf support different length of user-space hw_breakpoint?

2014-05-12 Thread xiakaixu
Hi guys,

Does perf support different length of user-space hw_breakpoint,
such as  HW_BREAKPOINT_LEN_1/HW_BREAKPOINT_LEN_2/HW_BREAKPOINT_LEN_4/
HW_BREAKPOINT_LEN_8?

Seems perf only support HW_BREAKPOINT_LEN_4/sizeof(long)
by default from the source code and simple test.

May I have your opinions if I want to trace different bytes of
hw_breakpoint addr?

Thanks,

Xia Kaixu

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Does perf support different length of user-space hw_breakpoint?

2014-05-12 Thread xiakaixu
于 2014/5/12 16:48, Peter Zijlstra 写道:
 On Mon, May 12, 2014 at 04:20:44PM +0800, kaixu xia wrote:
 2014-05-12 16:05 GMT+08:00 Peter Zijlstra pet...@infradead.org:

 On Mon, May 12, 2014 at 03:52:54PM +0800, xiakaixu wrote:
 Hi guys,

 Does perf support different length of user-space hw_breakpoint,
 such as  HW_BREAKPOINT_LEN_1/HW_BREAKPOINT_LEN_2/HW_BREAKPOINT_LEN_4/
 HW_BREAKPOINT_LEN_8?

 Seems perf only support HW_BREAKPOINT_LEN_4/sizeof(long)
 by default from the source code and simple test.

 May I have your opinions if I want to trace different bytes of
 hw_breakpoint addr?

 Frederic?


 Sorry, can not fully understand it. Can you give more details on that?
 
 I asked Frederic to answer your question :-)
 
Thanks:-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf tools: Remove extra '/' character in events file path

2014-04-27 Thread xiakaixu
于 2014/4/28 8:14, Namhyung Kim 写道:
> Hi xiakaixu,
> 
> (Adding Jiri and Boris to CC)

   OK.
   thanks,
> 
>> The array debugfs_known_mountpoints[] will cause extra '/'
>> character output.
>> Remove it.
>>
>> pre:
>> $ perf probe -l
>> /sys/kernel/debug//tracing/uprobe_events file does not exist -
>> please rebuild kernel with CONFIG_UPROBE_EVENTS.
>>
>> post:
>> $ perf probe -l
>> /sys/kernel/debug/tracing/uprobe_events file does not exist -
>> please rebuild kernel with CONFIG_UPROBE_EVENTS.
> 
> Looks like all of its callers already provide a '/' after the debugfs
> mountpoint, so
> 
>   Acked-by: Namhyung Kim 
> 
> Thanks,
> Namhyung
> 
>>  
>> Signed-off-by: Xia Kaixu 
>> ---
>>  tools/lib/api/fs/debugfs.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/lib/api/fs/debugfs.c b/tools/lib/api/fs/debugfs.c
>> index 7c43479..a74fba6 100644
>> --- a/tools/lib/api/fs/debugfs.c
>> +++ b/tools/lib/api/fs/debugfs.c
>> @@ -12,8 +12,8 @@
>>  char debugfs_mountpoint[PATH_MAX + 1] = "/sys/kernel/debug";
>>
>>  static const char * const debugfs_known_mountpoints[] = {
>> -"/sys/kernel/debug/",
>> -"/debug/",
>> +"/sys/kernel/debug",
>> +"/debug",
>>  0,
>>  };
>>
>> -- 1.8.5.5
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf tools: Remove extra '/' character in events file path

2014-04-27 Thread xiakaixu
于 2014/4/28 8:14, Namhyung Kim 写道:
 Hi xiakaixu,
 
 (Adding Jiri and Boris to CC)

   OK.
   thanks,
 
 The array debugfs_known_mountpoints[] will cause extra '/'
 character output.
 Remove it.

 pre:
 $ perf probe -l
 /sys/kernel/debug//tracing/uprobe_events file does not exist -
 please rebuild kernel with CONFIG_UPROBE_EVENTS.

 post:
 $ perf probe -l
 /sys/kernel/debug/tracing/uprobe_events file does not exist -
 please rebuild kernel with CONFIG_UPROBE_EVENTS.
 
 Looks like all of its callers already provide a '/' after the debugfs
 mountpoint, so
 
   Acked-by: Namhyung Kim namhy...@kernel.org
 
 Thanks,
 Namhyung
 
  
 Signed-off-by: Xia Kaixu xiaka...@huawei.com
 ---
  tools/lib/api/fs/debugfs.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

 diff --git a/tools/lib/api/fs/debugfs.c b/tools/lib/api/fs/debugfs.c
 index 7c43479..a74fba6 100644
 --- a/tools/lib/api/fs/debugfs.c
 +++ b/tools/lib/api/fs/debugfs.c
 @@ -12,8 +12,8 @@
  char debugfs_mountpoint[PATH_MAX + 1] = /sys/kernel/debug;

  static const char * const debugfs_known_mountpoints[] = {
 -/sys/kernel/debug/,
 -/debug/,
 +/sys/kernel/debug,
 +/debug,
  0,
  };

 -- 1.8.5.5
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf tools: Remove extra '/' character in events file path

2014-04-26 Thread xiakaixu
The array debugfs_known_mountpoints[] will cause extra '/'
character output.
Remove it.

pre:
$ perf probe -l
/sys/kernel/debug//tracing/uprobe_events file does not exist -
please rebuild kernel with CONFIG_UPROBE_EVENTS.

post:
$ perf probe -l
/sys/kernel/debug/tracing/uprobe_events file does not exist -
please rebuild kernel with CONFIG_UPROBE_EVENTS.

Signed-off-by: Xia Kaixu 
---
 tools/lib/api/fs/debugfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/lib/api/fs/debugfs.c b/tools/lib/api/fs/debugfs.c
index 7c43479..a74fba6 100644
--- a/tools/lib/api/fs/debugfs.c
+++ b/tools/lib/api/fs/debugfs.c
@@ -12,8 +12,8 @@
 char debugfs_mountpoint[PATH_MAX + 1] = "/sys/kernel/debug";

 static const char * const debugfs_known_mountpoints[] = {
-   "/sys/kernel/debug/",
-   "/debug/",
+   "/sys/kernel/debug",
+   "/debug",
0,
 };

-- 1.8.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf tools: Remove extra '/' character in events file path

2014-04-26 Thread xiakaixu
The array debugfs_known_mountpoints[] will cause extra '/'
character output.
Remove it.

pre:
$ perf probe -l
/sys/kernel/debug//tracing/uprobe_events file does not exist -
please rebuild kernel with CONFIG_UPROBE_EVENTS.

post:
$ perf probe -l
/sys/kernel/debug/tracing/uprobe_events file does not exist -
please rebuild kernel with CONFIG_UPROBE_EVENTS.

Signed-off-by: Xia Kaixu xiaka...@huawei.com
---
 tools/lib/api/fs/debugfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/lib/api/fs/debugfs.c b/tools/lib/api/fs/debugfs.c
index 7c43479..a74fba6 100644
--- a/tools/lib/api/fs/debugfs.c
+++ b/tools/lib/api/fs/debugfs.c
@@ -12,8 +12,8 @@
 char debugfs_mountpoint[PATH_MAX + 1] = /sys/kernel/debug;

 static const char * const debugfs_known_mountpoints[] = {
-   /sys/kernel/debug/,
-   /debug/,
+   /sys/kernel/debug,
+   /debug,
0,
 };

-- 1.8.5.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf tools: Remove extra '/' character in events file path

2014-04-21 Thread xiakaixu
The array debugfs_known_mountpoints[] will cause extra '/'
character output.
Remove it.

pre:
$ perf probe -l
/sys/kernel/debug//tracing/uprobe_events file does not exist -
please rebuild kernel with CONFIG_UPROBE_EVENTS.

post:
$ perf probe -l
/sys/kernel/debug/tracing/uprobe_events file does not exist -
please rebuild kernel with CONFIG_UPROBE_EVENTS.

Signed-off-by: Xia Kaixu 
---
 tools/lib/api/fs/debugfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/lib/api/fs/debugfs.c b/tools/lib/api/fs/debugfs.c
index 7c43479..a74fba6 100644
--- a/tools/lib/api/fs/debugfs.c
+++ b/tools/lib/api/fs/debugfs.c
@@ -12,8 +12,8 @@
 char debugfs_mountpoint[PATH_MAX + 1] = "/sys/kernel/debug";

 static const char * const debugfs_known_mountpoints[] = {
-   "/sys/kernel/debug/",
-   "/debug/",
+   "/sys/kernel/debug",
+   "/debug",
0,
 };

-- 
1.8.5.5

.




.




___
kernel.openeuler mailing list
kernel.openeu...@huawei.com
http://rnd-openeuler.huawei.com/mailman/listinfo/kernel.openeuler




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf tools: Remove extra '/' character in events file path

2014-04-21 Thread xiakaixu
The array debugfs_known_mountpoints[] will cause extra '/'
character output.
Remove it.

pre:
$ perf probe -l
/sys/kernel/debug//tracing/uprobe_events file does not exist -
please rebuild kernel with CONFIG_UPROBE_EVENTS.

post:
$ perf probe -l
/sys/kernel/debug/tracing/uprobe_events file does not exist -
please rebuild kernel with CONFIG_UPROBE_EVENTS.

Signed-off-by: Xia Kaixu xiaka...@huawei.com
---
 tools/lib/api/fs/debugfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/lib/api/fs/debugfs.c b/tools/lib/api/fs/debugfs.c
index 7c43479..a74fba6 100644
--- a/tools/lib/api/fs/debugfs.c
+++ b/tools/lib/api/fs/debugfs.c
@@ -12,8 +12,8 @@
 char debugfs_mountpoint[PATH_MAX + 1] = /sys/kernel/debug;

 static const char * const debugfs_known_mountpoints[] = {
-   /sys/kernel/debug/,
-   /debug/,
+   /sys/kernel/debug,
+   /debug,
0,
 };

-- 
1.8.5.5

.




.




___
kernel.openeuler mailing list
kernel.openeu...@huawei.com
http://rnd-openeuler.huawei.com/mailman/listinfo/kernel.openeuler




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A Bug Inquiry in linux/tools/perf/builtin-record.c

2014-03-02 Thread xiakaixu
于 2014/2/27 10:53, xiakaixu 写道:
> Hi Namhyung,
> 
> 于 2014/2/26 16:03, Namhyung Kim 写道:
>> Hi xiakaixu,
>>
>>> 于 2014/2/19 9:48, xiakaixu 写道:
>>>> Hi all,
>>>>
>>>> There is a bug found in my work when running "perf record". The basic 
>>>> information
>>>> is here. As we know, perf record is a parent process and the programme 
>>>> traced is
>>>> a child process when running "perf record". Sometimes the child process 
>>>> become
>>>> zombie state and disappear until the parent process is killed. The bug 
>>>> stays in linux/
>>>> tools/perf/builtin-record.c.
>>>> *
>>>> static int __cmd_record(struct perf_record *rec, int argc, const char 
>>>> **argv)
>>>> ..
>>>>if (hits == rec->samples) {
>>>>  if (done)
>>>>  break;
>>>>  err = poll(evsel_list->pollfd, 
>>>> evsel_list->nr_fds, -1);
>>>>  waking++;
>>>>  }
>>>> ..
>>>> *
>>>> The parent process still call the function
>>>> poll(evsel_list->pollfd, evsel_list->nr_fds, -1) when the child process 
>>>> has exited
>>>> already, which caused a zombie process.
>>>>
>>>> May I have your opinion ?
>>>> Waiting for your reply!
>>
>> Do you have a real bug report based on this?

> yeah, of course we have it, I'll be glad to provide it if necessary.

>>
>> AFAIK perf record installed a signal handler for SIGCHLD so it'll set
>> the 'done' variable when child exits and then break the loop.

> yes, you are right. Though the 'done' varible will be set when child exits,
> there is time gap between "if(done)" statement and "poll(...)" function.
> The 'done' variable won't be judge when child exits in this time gap.
> You know this time gap is instruction level, so this bug is small probability.
> My solution is adding a while(...) statement outside poll(...) function.
>>

>> Thanks,
>> Namhyung
>> .
>>
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A Bug Inquiry in linux/tools/perf/builtin-record.c

2014-03-02 Thread xiakaixu
于 2014/2/27 10:53, xiakaixu 写道:
 Hi Namhyung,
 
 于 2014/2/26 16:03, Namhyung Kim 写道:
 Hi xiakaixu,

 于 2014/2/19 9:48, xiakaixu 写道:
 Hi all,

 There is a bug found in my work when running perf record. The basic 
 information
 is here. As we know, perf record is a parent process and the programme 
 traced is
 a child process when running perf record. Sometimes the child process 
 become
 zombie state and disappear until the parent process is killed. The bug 
 stays in linux/
 tools/perf/builtin-record.c.
 *
 static int __cmd_record(struct perf_record *rec, int argc, const char 
 **argv)
 ..
if (hits == rec-samples) {
  if (done)
  break;
  err = poll(evsel_list-pollfd, 
 evsel_list-nr_fds, -1);
  waking++;
  }
 ..
 *
 The parent process still call the function
 poll(evsel_list-pollfd, evsel_list-nr_fds, -1) when the child process 
 has exited
 already, which caused a zombie process.

 May I have your opinion ?
 Waiting for your reply!

 Do you have a real bug report based on this?

 yeah, of course we have it, I'll be glad to provide it if necessary.


 AFAIK perf record installed a signal handler for SIGCHLD so it'll set
 the 'done' variable when child exits and then break the loop.

 yes, you are right. Though the 'done' varible will be set when child exits,
 there is time gap between if(done) statement and poll(...) function.
 The 'done' variable won't be judge when child exits in this time gap.
 You know this time gap is instruction level, so this bug is small probability.
 My solution is adding a while(...) statement outside poll(...) function.


 Thanks,
 Namhyung
 .

 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A Bug Inquiry in linux/tools/perf/builtin-record.c

2014-02-26 Thread xiakaixu
Hi Namhyung,

于 2014/2/26 16:03, Namhyung Kim 写道:
> Hi xiakaixu,
> 
>> 于 2014/2/19 9:48, xiakaixu 写道:
>>> Hi all,
>>>
>>> There is a bug found in my work when running "perf record". The basic 
>>> information
>>> is here. As we know, perf record is a parent process and the programme 
>>> traced is
>>> a child process when running "perf record". Sometimes the child process 
>>> become
>>> zombie state and disappear until the parent process is killed. The bug 
>>> stays in linux/
>>> tools/perf/builtin-record.c.
>>> *
>>> static int __cmd_record(struct perf_record *rec, int argc, const char 
>>> **argv)
>>> ..
>>> if (hits == rec->samples) {
>>>  if (done)
>>>  break;
>>>  err = poll(evsel_list->pollfd, evsel_list->nr_fds, 
>>> -1);
>>>  waking++;
>>>  }
>>> ..
>>> *
>>> The parent process still call the function
>>> poll(evsel_list->pollfd, evsel_list->nr_fds, -1) when the child process has 
>>> exited
>>> already, which caused a zombie process.
>>>
>>> May I have your opinion ?
>>> Waiting for your reply!
> 
> Do you have a real bug report based on this?
yeah, of course we have it, I'll be glad to provide it if necessary.
> 
> AFAIK perf record installed a signal handler for SIGCHLD so it'll set
> the 'done' variable when child exits and then break the loop.
yes, you are right. Though the 'done' varible will be set when child exits,
there is time gap between "if(done)" statement and "poll(...)" function.
The 'done' variable won't be judge when child exits in this time gap.
You know this time gap is instruction level, so this bug is small probability.
My solution is adding a while(...) statement outside poll(...) function.
>
> Thanks,
> Namhyung
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A Bug Inquiry in linux/tools/perf/builtin-record.c

2014-02-26 Thread xiakaixu
Hi Namhyung,

于 2014/2/26 16:03, Namhyung Kim 写道:
 Hi xiakaixu,
 
 于 2014/2/19 9:48, xiakaixu 写道:
 Hi all,

 There is a bug found in my work when running perf record. The basic 
 information
 is here. As we know, perf record is a parent process and the programme 
 traced is
 a child process when running perf record. Sometimes the child process 
 become
 zombie state and disappear until the parent process is killed. The bug 
 stays in linux/
 tools/perf/builtin-record.c.
 *
 static int __cmd_record(struct perf_record *rec, int argc, const char 
 **argv)
 ..
 if (hits == rec-samples) {
  if (done)
  break;
  err = poll(evsel_list-pollfd, evsel_list-nr_fds, 
 -1);
  waking++;
  }
 ..
 *
 The parent process still call the function
 poll(evsel_list-pollfd, evsel_list-nr_fds, -1) when the child process has 
 exited
 already, which caused a zombie process.

 May I have your opinion ?
 Waiting for your reply!
 
 Do you have a real bug report based on this?
yeah, of course we have it, I'll be glad to provide it if necessary.
 
 AFAIK perf record installed a signal handler for SIGCHLD so it'll set
 the 'done' variable when child exits and then break the loop.
yes, you are right. Though the 'done' varible will be set when child exits,
there is time gap between if(done) statement and poll(...) function.
The 'done' variable won't be judge when child exits in this time gap.
You know this time gap is instruction level, so this bug is small probability.
My solution is adding a while(...) statement outside poll(...) function.

 Thanks,
 Namhyung
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A Bug Inquiry in linux/tools/perf/builtin-record.c

2014-02-24 Thread xiakaixu
于 2014/2/19 9:48, xiakaixu 写道:
> Hi all,
> 
> There is a bug found in my work when running "perf record". The basic 
> information
> is here. As we know, perf record is a parent process and the programme traced 
> is
> a child process when running "perf record". Sometimes the child process become
> zombie state and disappear until the parent process is killed. The bug stays 
> in linux/
> tools/perf/builtin-record.c.
> *
> static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
> ..
>   if (hits == rec->samples) {
>  if (done)
>  break;
>  err = poll(evsel_list->pollfd, evsel_list->nr_fds, 
> -1);
>  waking++;
>  }
> ..
> *
> The parent process still call the function
> poll(evsel_list->pollfd, evsel_list->nr_fds, -1) when the child process has 
> exited
> already, which caused a zombie process.
> 
> May I have your opinion ?
> Waiting for your reply!
> 
> Best Regards
> Kaixu Xia
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: A Bug Inquiry in linux/tools/perf/builtin-record.c

2014-02-24 Thread xiakaixu
于 2014/2/19 9:48, xiakaixu 写道:
 Hi all,
 
 There is a bug found in my work when running perf record. The basic 
 information
 is here. As we know, perf record is a parent process and the programme traced 
 is
 a child process when running perf record. Sometimes the child process become
 zombie state and disappear until the parent process is killed. The bug stays 
 in linux/
 tools/perf/builtin-record.c.
 *
 static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
 ..
   if (hits == rec-samples) {
  if (done)
  break;
  err = poll(evsel_list-pollfd, evsel_list-nr_fds, 
 -1);
  waking++;
  }
 ..
 *
 The parent process still call the function
 poll(evsel_list-pollfd, evsel_list-nr_fds, -1) when the child process has 
 exited
 already, which caused a zombie process.
 
 May I have your opinion ?
 Waiting for your reply!
 
 Best Regards
 Kaixu Xia
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


A Bug Inquiry in linux/tools/perf/builtin-record.c

2014-02-18 Thread xiakaixu
Hi all,

There is a bug found in my work when running "perf record". The basic 
information
is here. As we know, perf record is a parent process and the programme traced is
a child process when running "perf record". Sometimes the child process become
zombie state and disappear until the parent process is killed. The bug stays in 
linux/
tools/perf/builtin-record.c.
*
static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
..
if (hits == rec->samples) {
 if (done)
 break;
 err = poll(evsel_list->pollfd, evsel_list->nr_fds, -1);
 waking++;
 }
..
*
The parent process still call the function
poll(evsel_list->pollfd, evsel_list->nr_fds, -1) when the child process has 
exited
already, which caused a zombie process.

May I have your opinion ?
Waiting for your reply!

Best Regards
Kaixu Xia

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


A Bug Inquiry in linux/tools/perf/builtin-record.c

2014-02-18 Thread xiakaixu
Hi all,

There is a bug found in my work when running perf record. The basic 
information
is here. As we know, perf record is a parent process and the programme traced is
a child process when running perf record. Sometimes the child process become
zombie state and disappear until the parent process is killed. The bug stays in 
linux/
tools/perf/builtin-record.c.
*
static int __cmd_record(struct perf_record *rec, int argc, const char **argv)
..
if (hits == rec-samples) {
 if (done)
 break;
 err = poll(evsel_list-pollfd, evsel_list-nr_fds, -1);
 waking++;
 }
..
*
The parent process still call the function
poll(evsel_list-pollfd, evsel_list-nr_fds, -1) when the child process has 
exited
already, which caused a zombie process.

May I have your opinion ?
Waiting for your reply!

Best Regards
Kaixu Xia

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/