Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, Jul 14, 2014 at 12:50:50PM -0700, Tim Chen wrote: There is a generic multi-buffer infrastructure portion that manages pulling and queuing jobs on the crypto workqueue, and it is separated out in patch 1 of the patchset. There's one very weird multi-line comment in that patch. The other portions are algorithm specific that defines algorithm specific data structure and does the crypto computation for a particular algorithm, mostly in assemblies and C glue code. The infrastructure code is meant to be reused for other similar multi-buffer algorithms. The flushing part that uses the sched thing is sha1 specific, even though it strikes me as not being so. Flushing buffers on idle seems like a 'generic' thing. We use nr_running_cpu to check whether there are other tasks running on the *current* cpu, (not for another cpu), And yet, the function allows you do to exactly that.. to decide if we should flush and compute crypto jobs accumulated. If there's nobody else running, we can take advantage of available cpu cycles on the cpu we are running on to do computation on the existing jobs in a SIMD mannner. Waiting a bit longer may accumulate more jobs to process in parallel in a single SIMD instruction, but will have more delay. So you already have an idle notifier (which is x86 only, we should fix that I suppose), and you then double check there really isn't anything else running. How much, if anything, does that second check buy you? There's just not a single word on that. Also, there is not a word on the latency vs throughput tradeoff you make. I can imagine that for very short idle durations you loose, not win with this thing. So for now I still see no reason for doing this. Also, I wonder about SMT, the point of this is to make best use of the SIMD pipelines, does it still make sense to use siblings at the same time even though you're running hand crafted ASM to stuff the pipelines to the brim? Should this thing be SMT aware and not gather queues for both siblings? pgpPiBvaLyXVd.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote: So you already have an idle notifier (which is x86 only, we should fix that I suppose), and you then double check there really isn't anything else running. Note that we've already done a large part of the expense of going idle by the time we call that idle notifier -- in specific, we've reprogrammed the clock to stop the tick. Its really wasteful to then generate work again, which means we have to again reprogram the clock etc. pgp6Nc_x_nwSe.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Tue, Jul 15, 2014 at 04:45:25PM +0200, Mike Galbraith wrote: 3.0.101-default3.753363 usecs/loop -- avg 3.770737 530.4 KHz 1.000 3.14.10-default4.145348 usecs/loop -- avg 4.139987 483.1 KHz.910 1.000 3.15.4-default 4.355594 usecs/loop -- avg 4.351961 459.6 KHz.866 .9511.000 3.16.0-default 4.537279 usecs/loop -- avg 4.543532 440.2 KHz.829 .911 .957 3.0.101-smp3.692377 usecs/loop -- avg 3.690774 541.9 KHz 1.000 3.14.10-smp4.010009 usecs/loop -- avg 4.009019 498.9 KHz.920 3.15.4-smp 3.882398 usecs/loop -- avg 3.884095 514.9 KHz.950 3.16.0-master 4.061003 usecs/loop -- avg 4.058244 492.8 KHz.909 Urgh,.. I need to go fix that :/ pgpReCZArzLnp.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Tue, 2014-07-15 at 21:03 +0200, Peter Zijlstra wrote: On Tue, Jul 15, 2014 at 08:06:55PM +0200, Mike Galbraith wrote: On Tue, 2014-07-15 at 16:53 +0200, Peter Zijlstra wrote: On Tue, Jul 15, 2014 at 04:45:25PM +0200, Mike Galbraith wrote: 3.0.101-default3.753363 usecs/loop -- avg 3.770737 530.4 KHz 1.000 3.14.10-default4.145348 usecs/loop -- avg 4.139987 483.1 KHz .910 1.000 3.15.4-default 4.355594 usecs/loop -- avg 4.351961 459.6 KHz .866 .9511.000 3.16.0-default 4.537279 usecs/loop -- avg 4.543532 440.2 KHz .829 .911 .957 3.0.101-smp3.692377 usecs/loop -- avg 3.690774 541.9 KHz 1.000 3.14.10-smp4.010009 usecs/loop -- avg 4.009019 498.9 KHz .920 3.15.4-smp 3.882398 usecs/loop -- avg 3.884095 514.9 KHz .950 3.16.0-master 4.061003 usecs/loop -- avg 4.058244 492.8 KHz .909 Urgh,.. I need to go fix that :/ I'm poking about. It's not just one thing 'course, just lots of change adding up to less than wonderful. Idle changes are costing some, for obese config, avg goop. The select_next_task() reorganization appears to be costing, but eyeballing, I can see no excuse for that at all. How is the idle stuff costing, cpu-affine pipe-test should pretty much peg a cpu at 100%, right? Or did I mis-understand and are you running a loose pipe-test? Exactly, I'm measuring the cost of popping in and out of idle while trying to reclaim overlap (which doesn't exist in pipe-test case). -Mike -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Tue, 15 Jul 2014, Tim Chen wrote: On Tue, 2014-07-15 at 14:59 +0200, Thomas Gleixner wrote: On Tue, 15 Jul 2014, Peter Zijlstra wrote: On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote: So you already have an idle notifier (which is x86 only, we should fix that I suppose), and you then double check there really isn't anything else running. Note that we've already done a large part of the expense of going idle by the time we call that idle notifier -- in specific, we've reprogrammed the clock to stop the tick. Its really wasteful to then generate work again, which means we have to again reprogram the clock etc. Doing anything which is not related to idle itself in the idle notifier is just plain wrong. I don't like the kicking the multi-buffer job flush using idle_notifier path either. I'll try another version of the patch by doing this in the multi-buffer job handler path. If that stuff wants to utilize idle slots, we really need to come up with a generic and general solution. Otherwise we'll grow those warts all over the architecture space, with slightly different ways of wreckaging the world an some more. This whole attidute of people thinking that they need their own specialized scheduling around the real scheduler is a PITA. All this stuff is just damanging any sensible approach of power saving, load balancing, etc. What we really want is infrastructure, which allows the scheduler to actively query the async work situation and based on the results actively decide when to process it and where. I agree with you. It will be great if we have such infrastructure. You are heartly invited to come up with that. :) -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, 2014-07-14 at 12:16 +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 01:33:04PM -0700, Tim Chen wrote: This function will help a thread decide if it wants to to do work that can be delayed, to accumulate more tasks for more efficient batch processing later. However, if no other tasks are running on the cpu, it can take advantgae of the available cpu cycles to complete the tasks for immediate processing to minimize delay, otherwise it will yield. Ugh.. and ignore topology and everything else. Yet another scheduler on top of the scheduler. We have the padata muck, also only ever used by crypto. We have the workqueue nonsense, used all over the place And we have btrfs doing their own padata like muck. And I'm sure there's at least one more out there, just because. Why do we want yet another thing? I'm inclined to go NAK and get people to reduce the amount of async queueing and processing crap. The mult-buffer class of crypto algorithms is by nature asynchronous. The algorithm gathers several crypto jobs, and put the buffer from each job in a data lane of the SIMD register. This allows for parallel processing and increases throughput. The gathering of the crypto jobs is an async process and queuing is necessary for this class of algorithm. Tim -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, Jul 14, 2014 at 09:10:14AM -0700, Tim Chen wrote: On Mon, 2014-07-14 at 12:16 +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 01:33:04PM -0700, Tim Chen wrote: This function will help a thread decide if it wants to to do work that can be delayed, to accumulate more tasks for more efficient batch processing later. However, if no other tasks are running on the cpu, it can take advantgae of the available cpu cycles to complete the tasks for immediate processing to minimize delay, otherwise it will yield. Ugh.. and ignore topology and everything else. Yet another scheduler on top of the scheduler. We have the padata muck, also only ever used by crypto. We have the workqueue nonsense, used all over the place And we have btrfs doing their own padata like muck. And I'm sure there's at least one more out there, just because. Why do we want yet another thing? I'm inclined to go NAK and get people to reduce the amount of async queueing and processing crap. The mult-buffer class of crypto algorithms is by nature asynchronous. The algorithm gathers several crypto jobs, and put the buffer from each job in a data lane of the SIMD register. This allows for parallel processing and increases throughput. The gathering of the crypto jobs is an async process and queuing is necessary for this class of algorithm. How is that related to me saying we've got too much of this crap already? pgpjuRya9kDxR.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, 2014-07-14 at 18:14 +0200, Peter Zijlstra wrote: On Mon, Jul 14, 2014 at 09:10:14AM -0700, Tim Chen wrote: On Mon, 2014-07-14 at 12:16 +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 01:33:04PM -0700, Tim Chen wrote: This function will help a thread decide if it wants to to do work that can be delayed, to accumulate more tasks for more efficient batch processing later. However, if no other tasks are running on the cpu, it can take advantgae of the available cpu cycles to complete the tasks for immediate processing to minimize delay, otherwise it will yield. Ugh.. and ignore topology and everything else. Yet another scheduler on top of the scheduler. We have the padata muck, also only ever used by crypto. We have the workqueue nonsense, used all over the place And we have btrfs doing their own padata like muck. And I'm sure there's at least one more out there, just because. Why do we want yet another thing? I'm inclined to go NAK and get people to reduce the amount of async queueing and processing crap. The mult-buffer class of crypto algorithms is by nature asynchronous. The algorithm gathers several crypto jobs, and put the buffer from each job in a data lane of the SIMD register. This allows for parallel processing and increases throughput. The gathering of the crypto jobs is an async process and queuing is necessary for this class of algorithm. How is that related to me saying we've got too much of this crap already? I was trying to explain why the algorithm is implemented this way because of its batching nature. There is a whole class of async algorithm that can provide substantial speedup by doing batch processing and uses workqueue. The multi-buffer sha1 version has 2.2x speedup over existing AVX2 version, and can have even more speedup when AVX3 comes round. Workqueue is a natural way to implement this. I don't think a throughput speedup of 2.2x is crap. We are not inventing anything new, but ask for a very simple helper function to know if there's something else running on our cpu to help us make a better decision of whether we should flush the batched jobs immediately. And also asynchronous crypto interface is already used substantially in crypto and has a well established infrastructure. Thanks. Tim -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Sat, 2014-07-12 at 13:25 +0400, Kirill Tkhai wrote: +unsigned long nr_running_cpu(int cpu) +{ + if (cpumask_test_cpu(cpu, cpu_online_mask)) + return cpu_rq(cpu)-nr_running; + else + return 0; +} + Offline cpu should have nr_running equal to 0. We park last enqueued thread (migration_thread) at the end of take_cpu_down(). So, it's enough to return cpu_rq(cpu)-nr_running. Thanks. This seems reasonable. Tim -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, Jul 14, 2014 at 10:05:34AM -0700, Tim Chen wrote: I was trying to explain why the algorithm is implemented this way because of its batching nature. There is a whole class of async algorithm that can provide substantial speedup by doing batch processing and uses workqueue. The multi-buffer sha1 version has 2.2x speedup over existing AVX2 version, and can have even more speedup when AVX3 comes round. Workqueue is a natural way to implement this. I don't think a throughput speedup of 2.2x is crap. We are not inventing anything new, but ask for a very simple helper function to know if there's something else running on our cpu to help us make a better decision of whether we should flush the batched jobs immediately. And also asynchronous crypto interface is already used substantially in crypto and has a well established infrastructure. The crap I was talking about is that there's a metric ton of 'async' interfaces all different. Your multi-buffer thing isn't generic either, it seems lmiited to sha1. It does not reuse padata, it does not extend workqueues, it does not remove the btrfs nonsense, it adds yet anotehr thing. pgpnmUH25Ciz9.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, 2014-07-14 at 20:17 +0200, Peter Zijlstra wrote: On Mon, Jul 14, 2014 at 10:05:34AM -0700, Tim Chen wrote: I was trying to explain why the algorithm is implemented this way because of its batching nature. There is a whole class of async algorithm that can provide substantial speedup by doing batch processing and uses workqueue. The multi-buffer sha1 version has 2.2x speedup over existing AVX2 version, and can have even more speedup when AVX3 comes round. Workqueue is a natural way to implement this. I don't think a throughput speedup of 2.2x is crap. We are not inventing anything new, but ask for a very simple helper function to know if there's something else running on our cpu to help us make a better decision of whether we should flush the batched jobs immediately. And also asynchronous crypto interface is already used substantially in crypto and has a well established infrastructure. The crap I was talking about is that there's a metric ton of 'async' interfaces all different. Async interfaces when used appropriately, actually speed things up substantially for crypto. We actually have a case with ecyrptfs not using the async crypto interface, causing cpu to stall and slowing things down substantially with AES-NI. And async interface with workqueue speed things up (30% to 35% on encryption with SSD). http://marc.info/?l=ecryptfs-usersm=136520541407248 http://www.spinics.net/lists/ecryptfs/msg00228.html Your multi-buffer thing isn't generic either, it seems lmiited to sha1. We actually have many other multi-buffer crypto algorithms already published for encryption and other IPSec usages. So multi-buffer algorithm is not just limited to SHA1. We hope to port those to the kernel crypto library eventually. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf It does not reuse padata, padata tries to speed things up by parallelizing jobs to *multiple* cpus. Whereas multi-buffer tries to speed things up by speeding things up by using multiple data lanes in SIMD register in a *single* cpu. These two usages are complementary but not the same. it does not extend workqueues, Why do I need to extend workqueues if the existing ones already meet my needs? it does not remove the btrfs nonsense, Not much I can do about btrfs as I don't understand the issues there. it adds yet anotehr thing. Thanks. Tim -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, Jul 14, 2014 at 12:08:28PM -0700, Tim Chen wrote: On Mon, 2014-07-14 at 20:17 +0200, Peter Zijlstra wrote: Your multi-buffer thing isn't generic either, it seems lmiited to sha1. We actually have many other multi-buffer crypto algorithms already published for encryption and other IPSec usages. So multi-buffer algorithm is not just limited to SHA1. We hope to port those to the kernel crypto library eventually. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf That's all nice and such; but the code as I've seen in these patches is very much sha1 specific. The mb part isn't separated out. It does not reuse padata, padata tries to speed things up by parallelizing jobs to *multiple* cpus. Whereas multi-buffer tries to speed things up by speeding things up by using multiple data lanes in SIMD register in a *single* cpu. These two usages are complementary but not the same. And if its single cpu, wth do you need that nr_running thing for another cpu for? Also, this difference wasn't clear to me. I still loathe all the async work, because it makes a mockery of accounting etc.. but that's a story for another day I suppose :-( pgpco9662seN9.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, 2014-07-14 at 21:15 +0200, Peter Zijlstra wrote: On Mon, Jul 14, 2014 at 12:08:28PM -0700, Tim Chen wrote: On Mon, 2014-07-14 at 20:17 +0200, Peter Zijlstra wrote: Your multi-buffer thing isn't generic either, it seems lmiited to sha1. We actually have many other multi-buffer crypto algorithms already published for encryption and other IPSec usages. So multi-buffer algorithm is not just limited to SHA1. We hope to port those to the kernel crypto library eventually. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf That's all nice and such; but the code as I've seen in these patches is very much sha1 specific. The mb part isn't separated out. There is a generic multi-buffer infrastructure portion that manages pulling and queuing jobs on the crypto workqueue, and it is separated out in patch 1 of the patchset. The other portions are algorithm specific that defines algorithm specific data structure and does the crypto computation for a particular algorithm, mostly in assemblies and C glue code. The infrastructure code is meant to be reused for other similar multi-buffer algorithms. It does not reuse padata, padata tries to speed things up by parallelizing jobs to *multiple* cpus. Whereas multi-buffer tries to speed things up by speeding things up by using multiple data lanes in SIMD register in a *single* cpu. These two usages are complementary but not the same. And if its single cpu, wth do you need that nr_running thing for another cpu for? We use nr_running_cpu to check whether there are other tasks running on the *current* cpu, (not for another cpu), to decide if we should flush and compute crypto jobs accumulated. If there's nobody else running, we can take advantage of available cpu cycles on the cpu we are running on to do computation on the existing jobs in a SIMD mannner. Waiting a bit longer may accumulate more jobs to process in parallel in a single SIMD instruction, but will have more delay. Also, this difference wasn't clear to me. I still loathe all the async work, because it makes a mockery of accounting etc.. but that's a story for another day I suppose :-( Thanks. Tim -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Sat, 2014-07-12 at 07:21 -0700, Tadeusz Struk wrote: On 07/11/2014 01:33 PM, Tim Chen wrote: +unsigned long nr_running_cpu(int cpu) +{ + if (cpumask_test_cpu(cpu, cpu_online_mask)) + return cpu_rq(cpu)-nr_running; + else + return 0; +} + EXPORT_SYMBOL? Yes, thanks. Tim -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On 12.07.2014 00:33, Tim Chen wrote: This function will help a thread decide if it wants to to do work that can be delayed, to accumulate more tasks for more efficient batch processing later. However, if no other tasks are running on the cpu, it can take advantgae of the available cpu cycles to complete the tasks for immediate processing to minimize delay, otherwise it will yield. Signed-off-by: Tim Chen tim.c.c...@linux.intel.com --- include/linux/sched.h | 1 + kernel/sched/core.c | 8 2 files changed, 9 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7cb07fd..0884250 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -168,6 +168,7 @@ extern int nr_threads; DECLARE_PER_CPU(unsigned long, process_counts); extern int nr_processes(void); extern unsigned long nr_running(void); +extern unsigned long nr_running_cpu(int cpu); extern unsigned long nr_iowait(void); extern unsigned long nr_iowait_cpu(int cpu); extern unsigned long this_cpu_load(void); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9cae286..d5bb8e6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2283,6 +2283,14 @@ unsigned long nr_running(void) return sum; } +unsigned long nr_running_cpu(int cpu) +{ + if (cpumask_test_cpu(cpu, cpu_online_mask)) + return cpu_rq(cpu)-nr_running; + else + return 0; +} + Offline cpu should have nr_running equal to 0. We park last enqueued thread (migration_thread) at the end of take_cpu_down(). So, it's enough to return cpu_rq(cpu)-nr_running. unsigned long long nr_context_switches(void) { int i; Thanks, Kirill -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html