[PATCH RESEND v2 STABLE 4.4] futex: fix irq self-deadlock and satisfy assertion
From: Thomas Schoebel-Theuer This patch and problem analysis is specific for 4.4 LTS, due to incomplete backporting of other fixes. Later LTS series have different backports. Since v4.4.257 when CONFIG_PROVE_LOCKING=y the following triggers right after reboot of our pre-life systems which equal our production setup: Mar 03 11:27:33 icpu-test-bap10 kernel: = Mar 03 11:27:33 icpu-test-bap10 kernel: [ INFO: inconsistent lock state ] Mar 03 11:27:33 icpu-test-bap10 kernel: 4.4.259-rc1-grsec+ #730 Not tainted Mar 03 11:27:33 icpu-test-bap10 kernel: - Mar 03 11:27:33 icpu-test-bap10 kernel: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. Mar 03 11:27:33 icpu-test-bap10 kernel: apache2-ssl/9310 [HC0[0]:SC0[0]:HE1:SE1] takes: Mar 03 11:27:33 icpu-test-bap10 kernel: (>pi_lock){?.-.-.}, at: [] pi_state_update_owner+0x51/0xd7 Mar 03 11:27:33 icpu-test-bap10 kernel: {IN-HARDIRQ-W} state was registered at: Mar 03 11:27:33 icpu-test-bap10 kernel: [] __lock_acquire+0x3a7/0xe4a Mar 03 11:27:33 icpu-test-bap10 kernel: [] lock_acquire+0x18d/0x1bc Mar 03 11:27:33 icpu-test-bap10 kernel: [] _raw_spin_lock_irqsave+0x3e/0x50 Mar 03 11:27:33 icpu-test-bap10 kernel: [] try_to_wake_up+0x2c/0x210 Mar 03 11:27:33 icpu-test-bap10 kernel: [] default_wake_function+0xd/0xf Mar 03 11:27:33 icpu-test-bap10 kernel: [] autoremove_wake_function+0x11/0x35 Mar 03 11:27:33 icpu-test-bap10 kernel: [] __wake_up_common+0x48/0x7c Mar 03 11:27:33 icpu-test-bap10 kernel: [] __wake_up+0x34/0x46 Mar 03 11:27:33 icpu-test-bap10 kernel: [] megasas_complete_int_cmd+0x31/0x33 Mar 03 11:27:33 icpu-test-bap10 kernel: [] megasas_complete_cmd+0x570/0x57b Mar 03 11:27:33 icpu-test-bap10 kernel: [] complete_cmd_fusion+0x23e/0x33d Mar 03 11:27:33 icpu-test-bap10 kernel: [] megasas_isr_fusion+0x67/0x74 Mar 03 11:27:33 icpu-test-bap10 kernel: [] handle_irq_event_percpu+0x134/0x311 Mar 03 11:27:33 icpu-test-bap10 kernel: [] handle_irq_event+0x33/0x51 Mar 03 11:27:33 icpu-test-bap10 kernel: [] handle_edge_irq+0xa3/0xc2 Mar 03 11:27:33 icpu-test-bap10 kernel: [] handle_irq+0xf9/0x101 Mar 03 11:27:33 icpu-test-bap10 kernel: [] do_IRQ+0x80/0xf5 Mar 03 11:27:33 icpu-test-bap10 kernel: [] ret_from_intr+0x0/0x20 Mar 03 11:27:33 icpu-test-bap10 kernel: [] arch_cpu_idle+0xa/0xc Mar 03 11:27:33 icpu-test-bap10 kernel: [] default_idle_call+0x1e/0x20 Mar 03 11:27:33 icpu-test-bap10 kernel: [] cpu_startup_entry+0x141/0x22f Mar 03 11:27:33 icpu-test-bap10 kernel: [] rest_init+0x135/0x13b Mar 03 11:27:33 icpu-test-bap10 kernel: [] start_kernel+0x3fa/0x40a Mar 03 11:27:33 icpu-test-bap10 kernel: [] x86_64_start_reservations+0x2a/0x2c Mar 03 11:27:33 icpu-test-bap10 kernel: [] x86_64_start_kernel+0x11f/0x12c Mar 03 11:27:33 icpu-test-bap10 kernel: irq event stamp: 1457 Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last enabled at (1457): [] get_user_pages_fast+0xeb/0x14f Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last disabled at (1456): [] get_user_pages_fast+0x5f/0x14f Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last enabled at (1446): [] release_sock+0x142/0x14d Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last disabled at (1444): [] release_sock+0x34/0x14d Mar 03 11:27:33 icpu-test-bap10 kernel: other info that might help us debug this: Mar 03 11:27:33 icpu-test-bap10 kernel: Possible unsafe locking scenario: Mar 03 11:27:33 icpu-test-bap10 kernel:CPU0 Mar 03 11:27:33 icpu-test-bap10 kernel: Mar 03 11:27:33 icpu-test-bap10 kernel: lock(>pi_lock); Mar 03 11:27:33 icpu-test-bap10 kernel: Mar 03 11:27:33 icpu-test-bap10 kernel: lock(>pi_lock); Mar 03 11:27:33 icpu-test-bap10 kernel: *** DEADLOCK *** Mar 03 11:27:33 icpu-test-bap10 kernel: 2 locks held by apache2-ssl/9310: Mar 03 11:27:33 icpu-test-bap10 kernel: #0: (&(&(__futex_data.queues)[i].lock)->rlock){+.+...}, at: [] do Mar 03 11:27:33 icpu-test-bap10 kernel: #1: (>wait_lock){+.+...}, at: [] do_futex+0x639/0x809 Mar 03 11:27:33 icpu-test-bap10 kernel: stack backtrace: Mar 03 11:27:33 icpu-test-bap10 kernel: CPU: 13 PID: 9310 UID: 99 Comm: apache2-ssl Not tainted 4.4.259-rc1-grsec+ #730 Mar 03 11:27:33 icpu-test-bap10 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.11.0 11/02/2019 Mar 03 11:27:33 icpu-test-bap10 kernel: 883fb79bfc00 816f8fc2 883ffa66d300 Mar 03 11:27:33 icpu-test-bap10 kernel: 8eaa71f0 883fb79bfc50 81088484 Mar 03 11:27:33 icpu-test-bap10 kernel: 0001 0001 0002 883ffa66db58 Mar 03 11:27:33 icpu-test-bap10 kernel: Call Trace: Mar 03 11:27:33 icpu-test-bap10 kernel: [] dump_stack+0x94/0xca Mar 03 11:27:33 icpu-test-bap10 ke
[PATCH RESEND v2 STABLE 4.4] futex: fix spin_lock() / spin_unlock_irq() imbalance
From: Thomas Schoebel-Theuer This patch and problem analysis is specific for 4.4 LTS, due to incomplete backporting of other fixes. Later LTS series have different backports. The following is obviously incorrect: static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, struct futex_hash_bucket *hb) { [...] raw_spin_lock(_state->pi_mutex.wait_lock); [...] raw_spin_unlock_irq(_state->pi_mutex.wait_lock); [...] } The 4.4-specific fix should probably go in the direction of b4abf91047c, making everything irq-safe. Probably, backporting of b4abf91047c to 4.4 LTS could thus be another good idea. However, this might involve some more 4.4-specific work and require thorough testing: > git log --oneline v4.4..b4abf91047c -- kernel/futex.c > kernel/locking/rtmutex.c | wc -l 10 So this patch is just an obvious quickfix for now. Hint: the lock order is documented in 4.9.y and later. A similar documenting is missing in 4.4.y. Please somebody either backport also, or write a new description, if there would be some differences I cannot easily see at the moment. Without reliable docs, inspection of the locking correctness may become a pain. Signed-off-by: Thomas Schoebel-Theuer Cc: Thomas Gleixner Cc: Lee Jones Cc: Greg Kroah-Hartman Fixes: 394fc498142 Fixes: 6510e4a2d04 --- kernel/futex.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/futex.c b/kernel/futex.c index 70ad21bbb1d5..4a707bc7cceb 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1406,7 +1406,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, if (pi_state->owner != current) return -EINVAL; - raw_spin_lock(_state->pi_mutex.wait_lock); + raw_spin_lock_irq(_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(_state->pi_mutex); /* -- 2.26.2
[PATCH STABLE 4.4] futex: fix irq self-deadlock and satisfy assertion
From: Thomas Schoebel-Theuer Since v4.4.257 when CONFIG_PROVE_LOCKING=y the following triggers right after reboot of our pre-life systems which equal our production setup: Mar 03 11:27:33 icpu-test-bap10 kernel: = Mar 03 11:27:33 icpu-test-bap10 kernel: [ INFO: inconsistent lock state ] Mar 03 11:27:33 icpu-test-bap10 kernel: 4.4.259-rc1-grsec+ #730 Not tainted Mar 03 11:27:33 icpu-test-bap10 kernel: - Mar 03 11:27:33 icpu-test-bap10 kernel: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. Mar 03 11:27:33 icpu-test-bap10 kernel: apache2-ssl/9310 [HC0[0]:SC0[0]:HE1:SE1] takes: Mar 03 11:27:33 icpu-test-bap10 kernel: (>pi_lock){?.-.-.}, at: [] pi_state_update_owner+0x51/0xd7 Mar 03 11:27:33 icpu-test-bap10 kernel: {IN-HARDIRQ-W} state was registered at: Mar 03 11:27:33 icpu-test-bap10 kernel: [] __lock_acquire+0x3a7/0xe4a Mar 03 11:27:33 icpu-test-bap10 kernel: [] lock_acquire+0x18d/0x1bc Mar 03 11:27:33 icpu-test-bap10 kernel: [] _raw_spin_lock_irqsave+0x3e/0x50 Mar 03 11:27:33 icpu-test-bap10 kernel: [] try_to_wake_up+0x2c/0x210 Mar 03 11:27:33 icpu-test-bap10 kernel: [] default_wake_function+0xd/0xf Mar 03 11:27:33 icpu-test-bap10 kernel: [] autoremove_wake_function+0x11/0x35 Mar 03 11:27:33 icpu-test-bap10 kernel: [] __wake_up_common+0x48/0x7c Mar 03 11:27:33 icpu-test-bap10 kernel: [] __wake_up+0x34/0x46 Mar 03 11:27:33 icpu-test-bap10 kernel: [] megasas_complete_int_cmd+0x31/0x33 Mar 03 11:27:33 icpu-test-bap10 kernel: [] megasas_complete_cmd+0x570/0x57b Mar 03 11:27:33 icpu-test-bap10 kernel: [] complete_cmd_fusion+0x23e/0x33d Mar 03 11:27:33 icpu-test-bap10 kernel: [] megasas_isr_fusion+0x67/0x74 Mar 03 11:27:33 icpu-test-bap10 kernel: [] handle_irq_event_percpu+0x134/0x311 Mar 03 11:27:33 icpu-test-bap10 kernel: [] handle_irq_event+0x33/0x51 Mar 03 11:27:33 icpu-test-bap10 kernel: [] handle_edge_irq+0xa3/0xc2 Mar 03 11:27:33 icpu-test-bap10 kernel: [] handle_irq+0xf9/0x101 Mar 03 11:27:33 icpu-test-bap10 kernel: [] do_IRQ+0x80/0xf5 Mar 03 11:27:33 icpu-test-bap10 kernel: [] ret_from_intr+0x0/0x20 Mar 03 11:27:33 icpu-test-bap10 kernel: [] arch_cpu_idle+0xa/0xc Mar 03 11:27:33 icpu-test-bap10 kernel: [] default_idle_call+0x1e/0x20 Mar 03 11:27:33 icpu-test-bap10 kernel: [] cpu_startup_entry+0x141/0x22f Mar 03 11:27:33 icpu-test-bap10 kernel: [] rest_init+0x135/0x13b Mar 03 11:27:33 icpu-test-bap10 kernel: [] start_kernel+0x3fa/0x40a Mar 03 11:27:33 icpu-test-bap10 kernel: [] x86_64_start_reservations+0x2a/0x2c Mar 03 11:27:33 icpu-test-bap10 kernel: [] x86_64_start_kernel+0x11f/0x12c Mar 03 11:27:33 icpu-test-bap10 kernel: irq event stamp: 1457 Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last enabled at (1457): [] get_user_pages_fast+0xeb/0x14f Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last disabled at (1456): [] get_user_pages_fast+0x5f/0x14f Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last enabled at (1446): [] release_sock+0x142/0x14d Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last disabled at (1444): [] release_sock+0x34/0x14d Mar 03 11:27:33 icpu-test-bap10 kernel: other info that might help us debug this: Mar 03 11:27:33 icpu-test-bap10 kernel: Possible unsafe locking scenario: Mar 03 11:27:33 icpu-test-bap10 kernel:CPU0 Mar 03 11:27:33 icpu-test-bap10 kernel: Mar 03 11:27:33 icpu-test-bap10 kernel: lock(>pi_lock); Mar 03 11:27:33 icpu-test-bap10 kernel: Mar 03 11:27:33 icpu-test-bap10 kernel: lock(>pi_lock); Mar 03 11:27:33 icpu-test-bap10 kernel: *** DEADLOCK *** Mar 03 11:27:33 icpu-test-bap10 kernel: 2 locks held by apache2-ssl/9310: Mar 03 11:27:33 icpu-test-bap10 kernel: #0: (&(&(__futex_data.queues)[i].lock)->rlock){+.+...}, at: [] do Mar 03 11:27:33 icpu-test-bap10 kernel: #1: (>wait_lock){+.+...}, at: [] do_futex+0x639/0x809 Mar 03 11:27:33 icpu-test-bap10 kernel: stack backtrace: Mar 03 11:27:33 icpu-test-bap10 kernel: CPU: 13 PID: 9310 UID: 99 Comm: apache2-ssl Not tainted 4.4.259-rc1-grsec+ #730 Mar 03 11:27:33 icpu-test-bap10 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.11.0 11/02/2019 Mar 03 11:27:33 icpu-test-bap10 kernel: 883fb79bfc00 816f8fc2 883ffa66d300 Mar 03 11:27:33 icpu-test-bap10 kernel: 8eaa71f0 883fb79bfc50 81088484 Mar 03 11:27:33 icpu-test-bap10 kernel: 0001 0001 0002 883ffa66db58 Mar 03 11:27:33 icpu-test-bap10 kernel: Call Trace: Mar 03 11:27:33 icpu-test-bap10 kernel: [] dump_stack+0x94/0xca Mar 03 11:27:33 icpu-test-bap10 kernel: [] print_usage_bug+0x1bc/0x1d1 Mar 03 11:27:33 icpu-test-bap10 kernel: [] ? check_usage_forwards+0x98/0x98 Mar 03 11:27:33 icpu-test-ba
[PATCH STABLE 4.4] futex: fix spin_lock() / spin_unlock_irq() imbalance
From: Thomas Schoebel-Theuer The following is obviously incorrect: static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, struct futex_hash_bucket *hb) { [...] raw_spin_lock(_state->pi_mutex.wait_lock); [...] raw_spin_unlock_irq(_state->pi_mutex.wait_lock); [...] } The 4.4-specific fix should probably go into the direction of b4abf91047c. Probably, backporting of b4abf91047c to 4.4 LTS could be another good idea. However, this might involve some more 4.4-specific work and require thorough testing: > git log --oneline v4.4..b4abf91047c -- kernel/futex.c > kernel/locking/rtmutex.c | wc -l 10 So this patch is just an obvious quickfix for now. Signed-off-by: Thomas Schoebel-Theuer Cc: Thomas Gleixner Cc: Lee Jones Cc: Greg Kroah-Hartman Fixes: 394fc498142 Fixes: 6510e4a2d04 --- kernel/futex.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/futex.c b/kernel/futex.c index 70ad21bbb1d5..4a707bc7cceb 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1406,7 +1406,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this, if (pi_state->owner != current) return -EINVAL; - raw_spin_lock(_state->pi_mutex.wait_lock); + raw_spin_lock_irq(_state->pi_mutex.wait_lock); new_owner = rt_mutex_next_owner(_state->pi_mutex); /* -- 2.26.2
[PATCH] sched/wait: fix endless kthread loop at timeout
From: Thomas Schoebel-Theuer Scenario, possible since kernel 4.11.x and later: 1) kthread calls a waiting function with a timeout, and blocks. 2) kthread_stop() is called by somebody else. 3) The waiting condition does not change for a long time. 4) Nothing happens => normally the timeout would be reached by the kthread. However, the && in wait_woken() now prevents any call to schedule_timeout(). As a consequence, the timeout value will never be decreased, resulting not only in never reaching the timeout, but also in an endless loop, burning the CPU in kernel mode. This fix ensures the following semantics: kthread_should_stop() is treated as equivalent to a timeout. This is beneficial because most users do not want to wait for the timeout, but to stop the kthread as soon as possible. It appears that this semantics was probably intended (otherwise the check is_kthread_should_stop() would not make much sense), but just went wrong due to the bug. Here is an example, triggered by external kernel module MARS on a production kernel. However, the problem can be triggered by other kthreads and on newer kernels, and also in very different scenarios, not only during tcp_revcmsg(). In the following example, the kthread simply waits for network packets to arrive, but in the test scenario the network had been blocked underneath by a firewall rule in order to trigger the bug: Mar 08 07:40:08 icpu5133 kernel: watchdog: BUG: soft lockup - CPU#29 stuck for 23s! [mars_receiver8.:8139] Mar 08 07:40:08 icpu5133 kernel: Modules linked in: mars(-) ip6table_mangle ip6table_raw iptable_raw ip_set_bitmap_port xt_DSCP xt_multiport ip_set_hash_ip xt_own Mar 08 07:40:08 icpu5133 kernel: irq event stamp: 300719885 Mar 08 07:40:08 icpu5133 kernel: hardirqs last enabled at (300719883): [] _raw_spin_unlock_irqrestore+0x3d/0x4f Mar 08 07:40:08 icpu5133 kernel: hardirqs last disabled at (300719885): [] apic_timer_interrupt+0x82/0x90 Mar 08 07:40:08 icpu5133 kernel: softirqs last enabled at (300719878): [] lock_sock_nested+0x50/0x98 Mar 08 07:40:08 icpu5133 kernel: softirqs last disabled at (300719884): [] release_sock+0x16/0xda Mar 08 07:40:08 icpu5133 kernel: CPU: 29 PID: 8139 Comm: mars_receiver8. Not tainted 4.14.104+ #121 Mar 08 07:40:08 icpu5133 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.5.5 08/16/2017 Mar 08 07:40:08 icpu5133 kernel: task: 88bf82764fc0 task.stack: c9001243 Mar 08 07:40:08 icpu5133 kernel: RIP: 0010:arch_local_irq_restore+0x2/0x8 Mar 08 07:40:08 icpu5133 kernel: RSP: 0018:c90012433b78 EFLAGS: 0246 ORIG_RAX: ff10 Mar 08 07:40:08 icpu5133 kernel: RAX: RBX: 88bf82764fc0 RCX: fec792b4 Mar 08 07:40:08 icpu5133 kernel: RDX: c18b50d3 RSI: RDI: 0246 Mar 08 07:40:08 icpu5133 kernel: RBP: 0001 R08: 0001 R09: Mar 08 07:40:08 icpu5133 kernel: R10: c90012433b08 R11: c90012433ba8 R12: 0246 Mar 08 07:40:08 icpu5133 kernel: R13: 819df735 R14: 0001 R15: 88bf82765818 Mar 08 07:40:08 icpu5133 kernel: FS: () GS:88c05fb8() knlGS: Mar 08 07:40:08 icpu5133 kernel: CS: 0010 DS: ES: CR0: 80050033 Mar 08 07:40:08 icpu5133 kernel: CR2: 55abd12eb688 CR3: 0241e006 CR4: 001606e0 Mar 08 07:40:08 icpu5133 kernel: Call Trace: Mar 08 07:40:08 icpu5133 kernel: lock_release+0x32f/0x33b Mar 08 07:40:08 icpu5133 kernel: release_sock+0x90/0xda Mar 08 07:40:08 icpu5133 kernel: sk_wait_data+0x7f/0x13f Mar 08 07:40:08 icpu5133 kernel: ? prepare_to_wait_exclusive+0xc1/0xc1 Mar 08 07:40:08 icpu5133 kernel: tcp_recvmsg+0x4e6/0x91a Mar 08 07:40:08 icpu5133 kernel: ? flush_signals+0x2b/0x6a Mar 08 07:40:08 icpu5133 kernel: ? lock_acquire+0x20a/0x25a Mar 08 07:40:08 icpu5133 kernel: inet_recvmsg+0x8d/0xc0 Mar 08 07:40:08 icpu5133 kernel: kernel_recvmsg+0x8f/0xaa Mar 08 07:40:08 icpu5133 kernel: ? ___might_sleep+0xf2/0x256 Mar 08 07:40:08 icpu5133 kernel: mars_recv_raw+0x22a/0x4da [mars] Mar 08 07:40:08 icpu5133 kernel: desc_recv_struct+0x40/0x375 [mars] Mar 08 07:40:08 icpu5133 kernel: receiver_thread+0xa2/0x61a [mars] Mar 08 07:40:08 icpu5133 kernel: ? _hash_insert+0x160/0x160 [mars] Mar 08 07:40:08 icpu5133 kernel: ? kthread+0x1a6/0x1ae Mar 08 07:40:08 icpu5133 kernel: kthread+0x1a6/0x1ae Mar 08 07:40:08 icpu5133 kernel: ? __list_del_entry+0x60/0x60 Mar 08 07:40:08 icpu5133 kernel: ret_from_fork+0x3a/0x50 Mar 08 07:40:08 icpu5133 kernel: Code: ee e8 c5 17 00 00 48 85 db 75 0e 31 f6 48 c7 c7 c0 5f 53 82 e8 68 b9 58 00 48 89 5b 58 58 5b 5d c3 9c 58 0f 1f 44 00 00 c3 Signed-off-by: Thomas Schoebel-Theuer --- kernel/sched/wait.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index c1e566a114ca..08f121154a91 100644 --- a/kernel/sched/wait.c +
Re: Can we drop upstream Linux x32 support?
On 12/14/18 22:41, Thomas Schöbel-Theuer wrote: On 12/14/18 22:24, Andy Lutomirski wrote: I'm talking about x32, which is a different beast. So from my viewpoint the mentioned roadmap / timing requirements will remain the same, whatever you are dropping. Enterprise-critical use cases will probably need to be migrated to KVM/qemu together with their old kernel versions, anyway (because the original hardware will be no longer available in a few decades). Here is a systematic approach to the problem. AFAICS legacy 32bit userspace code (which exists in some notable masses) can be executed at least in the following ways: 1) natively on 32bit-capable hardware, under 32bit kernels. Besides legacy hardware, this also encompasses most current Intel / AMD 64bit hardware in 32bit compatibility mode. 2) under 64bit kernels, using the 32bit compat layer from practically any kernel version. 3) under KVM/qemu. When you just drop 1), users have a fair chance by migrating to any of the other two possibilities. As explained, a time frame of ~5 years should work for the vast majority. If you clearly explain the migration paths to your users (and to the press), I think it will be acceptable. [side note: I know of a single legacy instance which is now ~20 years old, but makes a revenue of several millions per month. These guys have large quantities of legacy hardware in stock. And they have enough money to hire a downstream maintainer in case of emergency.] Fatal problems would only arise if you would drop all three possibilities in the very long term. In ~100 years, possibility 3) should be sufficient for handling use cases like preservation of historic documents. The latter is roughly equivalent to running binary-only MSDOS, Windows NT, and similar, even in 100 years, and even non-natively under future hardware architectures.
Re: [PATCH] acpi / apei: fix NULL deref during init
On 12/14/18 21:24, Borislav Petkov wrote: Because apei_resources_fini() happens under the same condition check and if arch_apei_filter_addr was false, it should not become true, all of a sudden. Or? Hi Borislav, please take a look at the stacktrace. For some reason, and only at that specific hardware, the condition is false, there but later the indicated error exit is taken whose message you can see immediately before the stack trace. So this should documents the one observed case where the NULL deref is actually happening. Of course, it would be possible to develop another solution, but this one appears the simplest and safest to me (minimum changes to the logic). I have tested the patch on that specifc hardware: I have verified that the patch does not trigger the NULL deref anymore. Of course, on any other hardware we have tested, the bug did not trigger at all. If you don't have that specific hardware, you probably cannot easily trigger / verify the problem. If you need access to the specfic hardware, talk to me in a private conversation. Cheers, Thomas
[PATCH] acpi / apei: fix NULL deref during init
Since commit commit d91525eb8ee6 ("ACPI, EINJ: Enhance error injection tolerance level"), starting with kernel 4.0, the following happens during boot of a specific old hardware: APEI: Can not request [mem 0x0009c2f2-0x0009c2fc] for APEI ERST registers BUG: unable to handle kernel NULL pointer dereference at (null) IP: [] __list_del_entry+0x5c/0x98 PGD 0 Oops: [#1] SMP Modules linked in: CPU: 0 PID: 1 UID: 0 Comm: swapper/0 Not tainted 4.4.0-ui18344.004-uiabi1-infong-amd64 #1 Hardware name: IBM IBM eServer BladeCenter HS12 -[8028Z5S]-/Server Blade, BIOS -[N1E150AUS-1.11]- 11/04/2010 task: 88021fe4e040 ti: 88021fe7c000 task.ti: 88021fe7c000 RIP: 0010:[] [] __list_del_entry+0x5c/0x98 RSP: :88021fe7fd18 EFLAGS: 00010207 RAX: RBX: 88021fe7fde0 RCX: 88021fe7fde0 RDX: 819bd040 RSI: dead0200 RDI: 88021fe7fde0 RBP: 88021fe7fd18 R08: R09: R10: 816ce240 R11: 0001 R12: 819bd040 R13: 88021fe7fda0 R14: 88021d2cd840 R15: FS: () GS:88022fc0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 019b6000 CR4: 00040670 Stack: 88021fe7fd30 81343dd7 88021fe7fde0 88021fe7fd58 813931c0 88021fe7fda0 88021fe7fe00 88021d2cd840 88021fe7fd70 813931e5 ffea 88021fe7fdf0 Call Trace: [] list_del+0xd/0x25 [] apei_res_clean+0x1f/0x37 [] apei_resources_fini+0xd/0x19 [] apei_resources_request+0x24f/0x268 [] ? apei_exec_for_each_entry+0x77/0x8e [] ? setup_erst_disable+0x12/0x12 [] erst_init+0xed/0x2ca [] ? do_one_initcall+0x8c/0x174 [] ? setup_erst_disable+0x12/0x12 [] ? setup_erst_disable+0x12/0x12 [] do_one_initcall+0xe9/0x174 [] ? parse_args+0x161/0x296 [] kernel_init_freeable+0x169/0x1f6 [] ? do_early_param+0x88/0x88 [] ? rest_init+0x79/0x79 [] kernel_init+0x9/0xd5 [] ret_from_fork+0x55/0x80 [] ? rest_init+0x79/0x79 Code: 02 00 00 00 00 ad de 48 39 f0 75 1f 49 89 c0 48 c7 c2 38 de 8e 81 be 38 00 00 00 48 c7 c7 13 dd 8e 81 31 c0 e8 94 36 d0 ff eb 3a <48> 8b 30 48 39 fe 74 11 49 89 f0 48 c7 c2 6c de 8e 81 be 3b 00 RIP [] __list_del_entry+0x5c/0x98 RSP CR2: ---[ end trace 3610e544cef27e81 ]--- Kernel panic - not syncing: Attempted to kill init! exitcode=0x0009 Reason is a conditional initialization of variable arch_res, which happens only under a specific precondition. When the condition is false, the variable remains uninitialized. This may later trigger a splat, e.g. when some error path is taken. Solution: do the initialisation unconditionally. Also as a safeguard. Fixes: d91525eb8ee6a622ce476955fe1a2530ade87c83 Signed-off-by: Thomas Schoebel-Theuer --- drivers/acpi/apei/apei-base.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/acpi/apei/apei-base.c b/drivers/acpi/apei/apei-base.c index da370e1d31f4..ef931b8a0b11 100644 --- a/drivers/acpi/apei/apei-base.c +++ b/drivers/acpi/apei/apei-base.c @@ -494,8 +494,8 @@ int apei_resources_request(struct apei_resources *resources, if (rc) goto nvs_res_fini; + apei_resources_init(_res); if (arch_apei_filter_addr) { - apei_resources_init(_res); rc = apei_get_arch_resources(_res); if (rc) goto arch_res_fini; -- 2.12.3
Re: [RFC 00/32] State of MARS Reo-Redundancy Module
Typo correction: On 12/30/2016 11:57 PM, Thomas Schoebel-Theuer wrote: standalone servers with local hardware RAIDs. They are hosting about 500 MARS resources (originally DRBD resources) just for the web servers; This must read 2500. Somehow the leading "2" was eaten at wraparound.
Re: [RFC 00/32] State of MARS Reo-Redundancy Module
Typo correction: On 12/30/2016 11:57 PM, Thomas Schoebel-Theuer wrote: standalone servers with local hardware RAIDs. They are hosting about 500 MARS resources (originally DRBD resources) just for the web servers; This must read 2500. Somehow the leading "2" was eaten at wraparound.
[RFC 28/32] mars: add new module mars_proc
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/mars/mars_proc.c | 389 ++ drivers/staging/mars/mars/mars_proc.h | 34 +++ 2 files changed, 423 insertions(+) create mode 100644 drivers/staging/mars/mars/mars_proc.c create mode 100644 drivers/staging/mars/mars/mars_proc.h diff --git a/drivers/staging/mars/mars/mars_proc.c b/drivers/staging/mars/mars/mars_proc.c new file mode 100644 index ..84b4dfc82211 --- /dev/null +++ b/drivers/staging/mars/mars/mars_proc.c @@ -0,0 +1,389 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include +#include + +#include "strategy.h" +#include "mars_proc.h" +#include +#include +#include +#include +#include +#include +#include + +xio_info_fn xio_info; + +static +int trigger_sysctl_handler( + struct ctl_table *table, + int write, + void __user *buffer, + size_t *length, + loff_t *ppos) +{ + ssize_t res = 0; + size_t len = *length; + + XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos); + + if (!len || *ppos > 0) + goto done; + + if (write) { + char tmp[8] = {}; + + res = len; /* fake consumption of all data */ + + if (len > 7) + len = 7; + if (!copy_from_user(tmp, buffer, len)) { + int code = 0; + int status = kstrtoint(tmp, 10, ); + + /* the return value from ssanf() does not matter */ + (void)status; + if (code > 0) + local_trigger(); + if (code > 1) + remote_trigger(); + } + } else { + char *answer = "MARS module not operational\n"; + char *tmp = NULL; + int mylen; + + if (xio_info) { + answer = "internal error while determining xio_info\n"; + tmp = xio_info(); + if (tmp) + answer = tmp; + } + + mylen = strlen(answer); + if (len > mylen) + len = mylen; + res = len; + if (copy_to_user(buffer, answer, len)) { + XIO_ERR("write %ld bytes at %p failed\n", len, buffer); + res = -EFAULT; + } + brick_string_free(tmp); + } + +done: + XIO_DBG("res = %ld\n", res); + *length = res; + if (res >= 0) { + *ppos += res; + return 0; + } + return res; +} + +static +int lamport_sysctl_handler( + struct ctl_table *table, + int write, + void __user *buffer, + size_t *length, + loff_t *ppos) +{ + ssize_t res = 0; + size_t len = *length; + int my_len = 128; + char *tmp = brick_string_alloc(my_len); + struct timespec know = CURRENT_TIME; + struct timespec lnow; + + XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos); + + if (!len || *ppos > 0) + goto done; + + if (write) + return -EINVAL; + + get_lamport(); + + res = scnprintf( + tmp, + my_len, + "CURRENT_TIME=%ld.%09ld\nlamport_now=%ld.%09ld\n", + know.tv_sec, know.tv_nsec, + lnow.tv_sec, lnow.tv_nsec + ); + + if (copy_to_user(buffer, tmp, res)) { + XIO_ERR("write %ld bytes at %p failed\n", res, buffer); + res = -EFAULT; + } + brick_string_free(tmp); + +done: + XIO_DBG("res = %ld\n", res); + *length = res; + if (res >= 0) { + *ppos += res; + return 0; + } + return res; +} + +#ifdef CTL_UNNUMBERED +#define _CTL_NAME .ctl_name = CTL_UNNUMBERED, +#define _CTL_STRATEGY(handler) .strategy = , +#else
[RFC 02/32] mars: add new module brick_say
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/brick_say.c | 920 +++ include/linux/brick/brick_say.h | 89 2 files changed, 1009 insertions(+) create mode 100644 drivers/staging/mars/brick_say.c create mode 100644 include/linux/brick/brick_say.h diff --git a/drivers/staging/mars/brick_say.c b/drivers/staging/mars/brick_say.c new file mode 100644 index ..f3bb49a0dfc3 --- /dev/null +++ b/drivers/staging/mars/brick_say.c @@ -0,0 +1,920 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include +#include + +/***/ + +/* messaging */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#ifndef GFP_BRICK +#define GFP_BRICK GFP_NOIO +#endif + +#define SAY_ORDER 0 +#define SAY_BUFMAX (PAGE_SIZE << SAY_ORDER) +#define SAY_BUF_LIMIT (SAY_BUFMAX - 1500) +#define MAX_FILELEN16 +#define MAX_IDS1000 + +const char *say_class[MAX_SAY_CLASS] = { + [SAY_DEBUG] = "debug", + [SAY_INFO] = "info", + [SAY_WARN] = "warn", + [SAY_ERROR] = "error", + [SAY_FATAL] = "fatal", + [SAY_TOTAL] = "total", +}; + +int brick_say_logging = 1; + +module_param_named(say_logging, brick_say_logging, int, 0); +int brick_say_debug; + +module_param_named(say_debug, brick_say_debug, int, 0); + +int brick_say_syslog_min = 1; +int brick_say_syslog_max = -1; +int brick_say_syslog_flood_class = 3; +int brick_say_syslog_flood_limit = 20; +int brick_say_syslog_flood_recovery = 300; + +int delay_say_on_overflow = +#ifdef CONFIG_MARS_DEBUG + 1; +#else + 0; +#endif + +static atomic_t say_alloc_channels = ATOMIC_INIT(0); +static atomic_t say_alloc_names = ATOMIC_INIT(0); +static atomic_t say_alloc_pages = ATOMIC_INIT(0); + +static unsigned long flood_start_jiffies; +static int flood_count; + +struct say_channel { + char *ch_name; + struct say_channel *ch_next; + + /* protect against concurrent writes */ + spinlock_t ch_lock[MAX_SAY_CLASS]; + char *ch_buf[MAX_SAY_CLASS][2]; + + short ch_index[MAX_SAY_CLASS]; + struct file *ch_filp[MAX_SAY_CLASS][2]; + int ch_overflow[MAX_SAY_CLASS]; + bool ch_written[MAX_SAY_CLASS]; + bool ch_rollover; + bool ch_must_exist; + bool ch_is_dir; + bool ch_delete; + int ch_status_written; + int ch_id_max; + void *ch_ids[MAX_IDS]; + + wait_queue_head_t ch_progress; +}; + +struct say_channel *default_channel; + +static struct say_channel *channel_list; + +static rwlock_t say_lock = __RW_LOCK_UNLOCKED(say_lock); + +static struct task_struct *say_thread; + +static DECLARE_WAIT_QUEUE_HEAD(say_event); + +bool say_dirty; + +#define use_atomic() \ + ((preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK)) != 0 || irqs_disabled()) + +static +void wait_channel(struct say_channel *ch, int class) +{ + if (delay_say_on_overflow && ch->ch_index[class] > SAY_BUF_LIMIT) { + if (!use_atomic()) { + say_dirty = true; + wake_up_interruptible(_event); + wait_event_interruptible_timeout( + ch->ch_progress, ch->ch_index[class] < SAY_BUF_LIMIT, HZ / 10); + } + } +} + +static +struct say_channel *find_channel(const void *id) +{ + struct say_channel *res = default_channel; + struct say_channel *ch; + + read_lock(_lock); + for (ch = channel_list; ch; ch = ch->ch_next) { + int i; + + for (i = 0; i < ch->ch_id_max; i++) { + if (ch->ch_ids[i] == id) { + res = ch; + goto found; + } + } + } +found: + read_unlock(_lock); + return res; +} + +static +void _remove_binding(struct task_str
[RFC 28/32] mars: add new module mars_proc
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/mars/mars_proc.c | 389 ++ drivers/staging/mars/mars/mars_proc.h | 34 +++ 2 files changed, 423 insertions(+) create mode 100644 drivers/staging/mars/mars/mars_proc.c create mode 100644 drivers/staging/mars/mars/mars_proc.h diff --git a/drivers/staging/mars/mars/mars_proc.c b/drivers/staging/mars/mars/mars_proc.c new file mode 100644 index ..84b4dfc82211 --- /dev/null +++ b/drivers/staging/mars/mars/mars_proc.c @@ -0,0 +1,389 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include +#include + +#include "strategy.h" +#include "mars_proc.h" +#include +#include +#include +#include +#include +#include +#include + +xio_info_fn xio_info; + +static +int trigger_sysctl_handler( + struct ctl_table *table, + int write, + void __user *buffer, + size_t *length, + loff_t *ppos) +{ + ssize_t res = 0; + size_t len = *length; + + XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos); + + if (!len || *ppos > 0) + goto done; + + if (write) { + char tmp[8] = {}; + + res = len; /* fake consumption of all data */ + + if (len > 7) + len = 7; + if (!copy_from_user(tmp, buffer, len)) { + int code = 0; + int status = kstrtoint(tmp, 10, ); + + /* the return value from ssanf() does not matter */ + (void)status; + if (code > 0) + local_trigger(); + if (code > 1) + remote_trigger(); + } + } else { + char *answer = "MARS module not operational\n"; + char *tmp = NULL; + int mylen; + + if (xio_info) { + answer = "internal error while determining xio_info\n"; + tmp = xio_info(); + if (tmp) + answer = tmp; + } + + mylen = strlen(answer); + if (len > mylen) + len = mylen; + res = len; + if (copy_to_user(buffer, answer, len)) { + XIO_ERR("write %ld bytes at %p failed\n", len, buffer); + res = -EFAULT; + } + brick_string_free(tmp); + } + +done: + XIO_DBG("res = %ld\n", res); + *length = res; + if (res >= 0) { + *ppos += res; + return 0; + } + return res; +} + +static +int lamport_sysctl_handler( + struct ctl_table *table, + int write, + void __user *buffer, + size_t *length, + loff_t *ppos) +{ + ssize_t res = 0; + size_t len = *length; + int my_len = 128; + char *tmp = brick_string_alloc(my_len); + struct timespec know = CURRENT_TIME; + struct timespec lnow; + + XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos); + + if (!len || *ppos > 0) + goto done; + + if (write) + return -EINVAL; + + get_lamport(); + + res = scnprintf( + tmp, + my_len, + "CURRENT_TIME=%ld.%09ld\nlamport_now=%ld.%09ld\n", + know.tv_sec, know.tv_nsec, + lnow.tv_sec, lnow.tv_nsec + ); + + if (copy_to_user(buffer, tmp, res)) { + XIO_ERR("write %ld bytes at %p failed\n", res, buffer); + res = -EFAULT; + } + brick_string_free(tmp); + +done: + XIO_DBG("res = %ld\n", res); + *length = res; + if (res >= 0) { + *ppos += res; + return 0; + } + return res; +} + +#ifdef CTL_UNNUMBERED +#define _CTL_NAME .ctl_name = CTL_UNNUMBERED, +#define _CTL_STRATEGY(handler) .strategy = , +#else +#def
[RFC 02/32] mars: add new module brick_say
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/brick_say.c | 920 +++ include/linux/brick/brick_say.h | 89 2 files changed, 1009 insertions(+) create mode 100644 drivers/staging/mars/brick_say.c create mode 100644 include/linux/brick/brick_say.h diff --git a/drivers/staging/mars/brick_say.c b/drivers/staging/mars/brick_say.c new file mode 100644 index ..f3bb49a0dfc3 --- /dev/null +++ b/drivers/staging/mars/brick_say.c @@ -0,0 +1,920 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include +#include + +/***/ + +/* messaging */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#ifndef GFP_BRICK +#define GFP_BRICK GFP_NOIO +#endif + +#define SAY_ORDER 0 +#define SAY_BUFMAX (PAGE_SIZE << SAY_ORDER) +#define SAY_BUF_LIMIT (SAY_BUFMAX - 1500) +#define MAX_FILELEN16 +#define MAX_IDS1000 + +const char *say_class[MAX_SAY_CLASS] = { + [SAY_DEBUG] = "debug", + [SAY_INFO] = "info", + [SAY_WARN] = "warn", + [SAY_ERROR] = "error", + [SAY_FATAL] = "fatal", + [SAY_TOTAL] = "total", +}; + +int brick_say_logging = 1; + +module_param_named(say_logging, brick_say_logging, int, 0); +int brick_say_debug; + +module_param_named(say_debug, brick_say_debug, int, 0); + +int brick_say_syslog_min = 1; +int brick_say_syslog_max = -1; +int brick_say_syslog_flood_class = 3; +int brick_say_syslog_flood_limit = 20; +int brick_say_syslog_flood_recovery = 300; + +int delay_say_on_overflow = +#ifdef CONFIG_MARS_DEBUG + 1; +#else + 0; +#endif + +static atomic_t say_alloc_channels = ATOMIC_INIT(0); +static atomic_t say_alloc_names = ATOMIC_INIT(0); +static atomic_t say_alloc_pages = ATOMIC_INIT(0); + +static unsigned long flood_start_jiffies; +static int flood_count; + +struct say_channel { + char *ch_name; + struct say_channel *ch_next; + + /* protect against concurrent writes */ + spinlock_t ch_lock[MAX_SAY_CLASS]; + char *ch_buf[MAX_SAY_CLASS][2]; + + short ch_index[MAX_SAY_CLASS]; + struct file *ch_filp[MAX_SAY_CLASS][2]; + int ch_overflow[MAX_SAY_CLASS]; + bool ch_written[MAX_SAY_CLASS]; + bool ch_rollover; + bool ch_must_exist; + bool ch_is_dir; + bool ch_delete; + int ch_status_written; + int ch_id_max; + void *ch_ids[MAX_IDS]; + + wait_queue_head_t ch_progress; +}; + +struct say_channel *default_channel; + +static struct say_channel *channel_list; + +static rwlock_t say_lock = __RW_LOCK_UNLOCKED(say_lock); + +static struct task_struct *say_thread; + +static DECLARE_WAIT_QUEUE_HEAD(say_event); + +bool say_dirty; + +#define use_atomic() \ + ((preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK)) != 0 || irqs_disabled()) + +static +void wait_channel(struct say_channel *ch, int class) +{ + if (delay_say_on_overflow && ch->ch_index[class] > SAY_BUF_LIMIT) { + if (!use_atomic()) { + say_dirty = true; + wake_up_interruptible(_event); + wait_event_interruptible_timeout( + ch->ch_progress, ch->ch_index[class] < SAY_BUF_LIMIT, HZ / 10); + } + } +} + +static +struct say_channel *find_channel(const void *id) +{ + struct say_channel *res = default_channel; + struct say_channel *ch; + + read_lock(_lock); + for (ch = channel_list; ch; ch = ch->ch_next) { + int i; + + for (i = 0; i < ch->ch_id_max; i++) { + if (ch->ch_ids[i] == id) { + res = ch; + goto found; + } + } + } +found: + read_unlock(_lock); + return res; +} + +static +void _remove_binding(struct task_struct *whom) +{ + struct say_channe
[RFC 09/32] mars: add new module lib_rank
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/lib/lib_rank.c | 87 +++ include/linux/brick/lib_rank.h | 136 2 files changed, 223 insertions(+) create mode 100644 drivers/staging/mars/lib/lib_rank.c create mode 100644 include/linux/brick/lib_rank.h diff --git a/drivers/staging/mars/lib/lib_rank.c b/drivers/staging/mars/lib/lib_rank.c new file mode 100644 index ..6327479039b6 --- /dev/null +++ b/drivers/staging/mars/lib/lib_rank.c @@ -0,0 +1,87 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* (c) 2012 Thomas Schoebel-Theuer */ + +#include +#include + +#include + +void ranking_compute(struct rank_data *rkd, const struct rank_info rki[], int x) +{ + int points = 0; + int i; + + for (i = 0; ; i++) { + int x0; + int x1; + int y0; + int y1; + + x0 = rki[i].rki_x; + if (x < x0) + break; + + x1 = rki[i + 1].rki_x; + + if (unlikely(x1 == RKI_DUMMY)) { + points = rki[i].rki_y; + break; + } + + if (x > x1) + continue; + + y0 = rki[i].rki_y; + y1 = rki[i + 1].rki_y; + + /* linear interpolation */ + points = ((long long)(x - x0) * (long long)(y1 - y0)) / (x1 - x0) + y0; + break; + } + rkd->rkd_tmp += points; +} + +int ranking_select(struct rank_data rkd[], int rkd_count) +{ + int res = -1; + long long max = LLONG_MIN / 2; + int i; + + for (i = 0; i < rkd_count; i++) { + struct rank_data *tmp = [i]; + long long rest = tmp->rkd_current_points; + + if (rest <= 0) + continue; + /* rest -= tmp->rkd_got; */ + if (rest > max) { + max = rest; + res = i; + } + } + /* Prevent underflow in the long term +* and reset the "clocks" after each round of +* weighted round-robin selection. +*/ + if (max < 0 && res >= 0) { + for (i = 0; i < rkd_count; i++) + rkd[i].rkd_got += max; + } + return res; +} diff --git a/include/linux/brick/lib_rank.h b/include/linux/brick/lib_rank.h new file mode 100644 index ..fa18fdf15597 --- /dev/null +++ b/include/linux/brick/lib_rank.h @@ -0,0 +1,136 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* (c) 2012 Thomas Schoebel-Theuer */ + +#ifndef LIB_RANK_H +#define LIB_RANK_H + +/* Generic round-robin scheduler based on ranking information. + */ + +#define RKI_DUMMY INT_MIN + +struct rank_info { + int rki_x; + int rki_y; +}; + +struct rank_data { + /* public readonly */ + long long rkd_current_points; + + /* private */ + long long rkd_tmp; + long long rkd_got; +}; + +/* Ranking phase. + * + * Calls should follow the following usage pattern: + * + * ranking_start(...); + * for (...) { + *ranking_compute([this_time], ...); + *// usually you need at least 1 call for each rkd[] element, + *// but you can call more often to include ranking information + *// from many different sources. + *// Note: instead / additionally, you may also use + *// ranking_add() or ranking_override(). + * } + * ranking_stop(...); + * + * = > now the new ra
[RFC 07/32] mars: add new module lib_pairing_heap
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- include/linux/brick/lib_pairing_heap.h | 109 + 1 file changed, 109 insertions(+) create mode 100644 include/linux/brick/lib_pairing_heap.h diff --git a/include/linux/brick/lib_pairing_heap.h b/include/linux/brick/lib_pairing_heap.h new file mode 100644 index ..9456e9ea348c --- /dev/null +++ b/include/linux/brick/lib_pairing_heap.h @@ -0,0 +1,109 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef PAIRING_HEAP_H +#define PAIRING_HEAP_H + +/* Algorithm: see http://en.wikipedia.org/wiki/Pairing_heap + * This is just an efficient translation from recursive to iterative form. + * + * Note: find_min() is so trivial that we don't implement it. + */ + +/* generic version: KEYDEF is kept separate, allowing you to + * embed this structure into other container structures already + * possessing some key (just provide an empty KEYDEF in this case). + */ +#define _PAIRING_HEAP_TYPEDEF(KEYTYPE, KEYDEF) \ + \ +struct pairing_heap_##KEYTYPE { \ + KEYDEF \ + struct pairing_heap_##KEYTYPE *next;\ + struct pairing_heap_##KEYTYPE *subheaps;\ +} + +/* less generic version: define the key inside. + */ +#define PAIRING_HEAP_TYPEDEF(KEYTYPE) \ + _PAIRING_HEAP_TYPEDEF(KEYTYPE, KEYTYPE key;) + +/* generic methods: allow arbitrary CMP() functions. + */ +#define _PAIRING_HEAP_FUNCTIONS(_STATIC, KEYTYPE, CMP) \ + \ +_STATIC \ +struct pairing_heap_##KEYTYPE *_ph_merge_##KEYTYPE(\ +struct pairing_heap_##KEYTYPE *heap1, struct pairing_heap_##KEYTYPE *heap2)\ +{ \ + if (!heap1) \ + return heap2; \ + if (!heap2) \ + return heap1; \ + if (CMP(heap1, heap2) < 0) {\ + heap2->next = heap1->subheaps; \ + heap1->subheaps = heap2;\ + return heap1; \ + } \ + heap1->next = heap2->subheaps; \ + heap2->subheaps = heap1;\ + return heap2; \ +} \ + \ +_STATIC \ +void ph_insert_##KEYTYPE(struct pairing_heap_##KEYTYPE **heap, struct pairing_heap_##KEYTYPE *new)\ +{ \ + new->next = NULL; \ + new->subheaps = NULL; \ + *heap = _ph_merge_##KEYTYPE(*heap, new);\ +} \ + \ +_STATIC \ +void ph_delete_min_##KEYTYPE(struct pairing_heap_##KEYTYPE **heap) \ +{ \ + struct pairing_heap_##KEYTYPE *tmplist = NULL; \ + struct pairing_heap_##KEYTYPE *ptr; \ + struct pairing_heap_##KEYTYPE *next;\ + struct pairing_heap_##KEYTYPE *res;
[RFC 06/32] mars: add new module brick
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/brick.c | 723 +++ include/linux/brick/brick.h | 620 + 2 files changed, 1343 insertions(+) create mode 100644 drivers/staging/mars/brick.c create mode 100644 include/linux/brick/brick.h diff --git a/drivers/staging/mars/brick.c b/drivers/staging/mars/brick.c new file mode 100644 index ..be741e896fc9 --- /dev/null +++ b/drivers/staging/mars/brick.c @@ -0,0 +1,723 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#define _STRATEGY + +#include +#include + +// + +/* init / exit functions */ + +void _generic_output_init( +struct generic_brick *brick, const struct generic_output_type *type, struct generic_output *output) +{ + output->brick = brick; + output->type = type; + output->ops = type->master_ops; + output->nr_connected = 0; + INIT_LIST_HEAD(>output_head); +} + +void _generic_output_exit(struct generic_output *output) +{ + list_del_init(>output_head); + output->brick = NULL; + output->type = NULL; + output->ops = NULL; + output->nr_connected = 0; +} + +int generic_brick_init(const struct generic_brick_type *type, struct generic_brick *brick) +{ + brick->aspect_context.brick_index = get_brick_nr(); + brick->type = type; + brick->ops = type->master_ops; + brick->nr_inputs = 0; + brick->nr_outputs = 0; + brick->power.off_led = true; + init_waitqueue_head(>power.event); + INIT_LIST_HEAD(>tmp_head); + return 0; +} + +void generic_brick_exit(struct generic_brick *brick) +{ + list_del_init(>tmp_head); + brick->type = NULL; + brick->ops = NULL; + brick->nr_inputs = 0; + brick->nr_outputs = 0; + put_brick_nr(brick->aspect_context.brick_index); +} + +int generic_input_init( +struct generic_brick *brick, int index, const struct generic_input_type *type, struct generic_input *input) +{ + if (index < 0 || index >= brick->type->max_inputs) + return -EINVAL; + if (brick->inputs[index]) + return -EEXIST; + input->brick = brick; + input->type = type; + input->connect = NULL; + INIT_LIST_HEAD(>input_head); + brick->inputs[index] = input; + brick->nr_inputs++; + return 0; +} + +void generic_input_exit(struct generic_input *input) +{ + list_del_init(>input_head); + input->brick = NULL; + input->type = NULL; + input->connect = NULL; +} + +int generic_output_init( +struct generic_brick *brick, int index, const struct generic_output_type *type, struct generic_output *output) +{ + if (index < 0 || index >= brick->type->max_outputs) + return -ENOMEM; + if (brick->outputs[index]) + return -EEXIST; + _generic_output_init(brick, type, output); + brick->outputs[index] = output; + brick->nr_outputs++; + return 0; +} + +int generic_size(const struct generic_brick_type *brick_type) +{ + int size = brick_type->brick_size; + int i; + + size += brick_type->max_inputs * sizeof(void *); + for (i = 0; i < brick_type->max_inputs; i++) + size += brick_type->default_input_types[i]->input_size; + size += brick_type->max_outputs * sizeof(void *); + for (i = 0; i < brick_type->max_outputs; i++) + size += brick_type->default_output_types[i]->output_size; + return size; +} + +int generic_connect(struct generic_input *input, struct generic_output *output) +{ + BRICK_DBG("generic_connect(input=%p, output=%p)\n", input, output); + if (unlikely(!input || !output)) + return -EINVAL; + if (unlikely(input->connect)) + return -EEXIST; + if (unlikely(!list_empty(>input_head))) + return -EINVAL; + /* helps only against the most common errors */ + if (unlikely(input->brick == output->brick)) +
[RFC 09/32] mars: add new module lib_rank
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/lib/lib_rank.c | 87 +++ include/linux/brick/lib_rank.h | 136 2 files changed, 223 insertions(+) create mode 100644 drivers/staging/mars/lib/lib_rank.c create mode 100644 include/linux/brick/lib_rank.h diff --git a/drivers/staging/mars/lib/lib_rank.c b/drivers/staging/mars/lib/lib_rank.c new file mode 100644 index ..6327479039b6 --- /dev/null +++ b/drivers/staging/mars/lib/lib_rank.c @@ -0,0 +1,87 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* (c) 2012 Thomas Schoebel-Theuer */ + +#include +#include + +#include + +void ranking_compute(struct rank_data *rkd, const struct rank_info rki[], int x) +{ + int points = 0; + int i; + + for (i = 0; ; i++) { + int x0; + int x1; + int y0; + int y1; + + x0 = rki[i].rki_x; + if (x < x0) + break; + + x1 = rki[i + 1].rki_x; + + if (unlikely(x1 == RKI_DUMMY)) { + points = rki[i].rki_y; + break; + } + + if (x > x1) + continue; + + y0 = rki[i].rki_y; + y1 = rki[i + 1].rki_y; + + /* linear interpolation */ + points = ((long long)(x - x0) * (long long)(y1 - y0)) / (x1 - x0) + y0; + break; + } + rkd->rkd_tmp += points; +} + +int ranking_select(struct rank_data rkd[], int rkd_count) +{ + int res = -1; + long long max = LLONG_MIN / 2; + int i; + + for (i = 0; i < rkd_count; i++) { + struct rank_data *tmp = [i]; + long long rest = tmp->rkd_current_points; + + if (rest <= 0) + continue; + /* rest -= tmp->rkd_got; */ + if (rest > max) { + max = rest; + res = i; + } + } + /* Prevent underflow in the long term +* and reset the "clocks" after each round of +* weighted round-robin selection. +*/ + if (max < 0 && res >= 0) { + for (i = 0; i < rkd_count; i++) + rkd[i].rkd_got += max; + } + return res; +} diff --git a/include/linux/brick/lib_rank.h b/include/linux/brick/lib_rank.h new file mode 100644 index ..fa18fdf15597 --- /dev/null +++ b/include/linux/brick/lib_rank.h @@ -0,0 +1,136 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* (c) 2012 Thomas Schoebel-Theuer */ + +#ifndef LIB_RANK_H +#define LIB_RANK_H + +/* Generic round-robin scheduler based on ranking information. + */ + +#define RKI_DUMMY INT_MIN + +struct rank_info { + int rki_x; + int rki_y; +}; + +struct rank_data { + /* public readonly */ + long long rkd_current_points; + + /* private */ + long long rkd_tmp; + long long rkd_got; +}; + +/* Ranking phase. + * + * Calls should follow the following usage pattern: + * + * ranking_start(...); + * for (...) { + *ranking_compute([this_time], ...); + *// usually you need at least 1 call for each rkd[] element, + *// but you can call more often to include ranking information + *// from many different sources. + *// Note: instead / additionally, you may also use + *// ranking_add() or ranking_override(). + * } + * ranking_stop(...); + * + * = > now the new ranking values are computed and already activ
[RFC 07/32] mars: add new module lib_pairing_heap
Signed-off-by: Thomas Schoebel-Theuer --- include/linux/brick/lib_pairing_heap.h | 109 + 1 file changed, 109 insertions(+) create mode 100644 include/linux/brick/lib_pairing_heap.h diff --git a/include/linux/brick/lib_pairing_heap.h b/include/linux/brick/lib_pairing_heap.h new file mode 100644 index ..9456e9ea348c --- /dev/null +++ b/include/linux/brick/lib_pairing_heap.h @@ -0,0 +1,109 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef PAIRING_HEAP_H +#define PAIRING_HEAP_H + +/* Algorithm: see http://en.wikipedia.org/wiki/Pairing_heap + * This is just an efficient translation from recursive to iterative form. + * + * Note: find_min() is so trivial that we don't implement it. + */ + +/* generic version: KEYDEF is kept separate, allowing you to + * embed this structure into other container structures already + * possessing some key (just provide an empty KEYDEF in this case). + */ +#define _PAIRING_HEAP_TYPEDEF(KEYTYPE, KEYDEF) \ + \ +struct pairing_heap_##KEYTYPE { \ + KEYDEF \ + struct pairing_heap_##KEYTYPE *next;\ + struct pairing_heap_##KEYTYPE *subheaps;\ +} + +/* less generic version: define the key inside. + */ +#define PAIRING_HEAP_TYPEDEF(KEYTYPE) \ + _PAIRING_HEAP_TYPEDEF(KEYTYPE, KEYTYPE key;) + +/* generic methods: allow arbitrary CMP() functions. + */ +#define _PAIRING_HEAP_FUNCTIONS(_STATIC, KEYTYPE, CMP) \ + \ +_STATIC \ +struct pairing_heap_##KEYTYPE *_ph_merge_##KEYTYPE(\ +struct pairing_heap_##KEYTYPE *heap1, struct pairing_heap_##KEYTYPE *heap2)\ +{ \ + if (!heap1) \ + return heap2; \ + if (!heap2) \ + return heap1; \ + if (CMP(heap1, heap2) < 0) {\ + heap2->next = heap1->subheaps; \ + heap1->subheaps = heap2;\ + return heap1; \ + } \ + heap1->next = heap2->subheaps; \ + heap2->subheaps = heap1;\ + return heap2; \ +} \ + \ +_STATIC \ +void ph_insert_##KEYTYPE(struct pairing_heap_##KEYTYPE **heap, struct pairing_heap_##KEYTYPE *new)\ +{ \ + new->next = NULL; \ + new->subheaps = NULL; \ + *heap = _ph_merge_##KEYTYPE(*heap, new);\ +} \ + \ +_STATIC \ +void ph_delete_min_##KEYTYPE(struct pairing_heap_##KEYTYPE **heap) \ +{ \ + struct pairing_heap_##KEYTYPE *tmplist = NULL; \ + struct pairing_heap_##KEYTYPE *ptr; \ + struct pairing_heap_##KEYTYPE *next;\ + struct pairing_heap_##KEYTYPE *res;
[RFC 06/32] mars: add new module brick
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/brick.c | 723 +++ include/linux/brick/brick.h | 620 + 2 files changed, 1343 insertions(+) create mode 100644 drivers/staging/mars/brick.c create mode 100644 include/linux/brick/brick.h diff --git a/drivers/staging/mars/brick.c b/drivers/staging/mars/brick.c new file mode 100644 index ..be741e896fc9 --- /dev/null +++ b/drivers/staging/mars/brick.c @@ -0,0 +1,723 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#define _STRATEGY + +#include +#include + +// + +/* init / exit functions */ + +void _generic_output_init( +struct generic_brick *brick, const struct generic_output_type *type, struct generic_output *output) +{ + output->brick = brick; + output->type = type; + output->ops = type->master_ops; + output->nr_connected = 0; + INIT_LIST_HEAD(>output_head); +} + +void _generic_output_exit(struct generic_output *output) +{ + list_del_init(>output_head); + output->brick = NULL; + output->type = NULL; + output->ops = NULL; + output->nr_connected = 0; +} + +int generic_brick_init(const struct generic_brick_type *type, struct generic_brick *brick) +{ + brick->aspect_context.brick_index = get_brick_nr(); + brick->type = type; + brick->ops = type->master_ops; + brick->nr_inputs = 0; + brick->nr_outputs = 0; + brick->power.off_led = true; + init_waitqueue_head(>power.event); + INIT_LIST_HEAD(>tmp_head); + return 0; +} + +void generic_brick_exit(struct generic_brick *brick) +{ + list_del_init(>tmp_head); + brick->type = NULL; + brick->ops = NULL; + brick->nr_inputs = 0; + brick->nr_outputs = 0; + put_brick_nr(brick->aspect_context.brick_index); +} + +int generic_input_init( +struct generic_brick *brick, int index, const struct generic_input_type *type, struct generic_input *input) +{ + if (index < 0 || index >= brick->type->max_inputs) + return -EINVAL; + if (brick->inputs[index]) + return -EEXIST; + input->brick = brick; + input->type = type; + input->connect = NULL; + INIT_LIST_HEAD(>input_head); + brick->inputs[index] = input; + brick->nr_inputs++; + return 0; +} + +void generic_input_exit(struct generic_input *input) +{ + list_del_init(>input_head); + input->brick = NULL; + input->type = NULL; + input->connect = NULL; +} + +int generic_output_init( +struct generic_brick *brick, int index, const struct generic_output_type *type, struct generic_output *output) +{ + if (index < 0 || index >= brick->type->max_outputs) + return -ENOMEM; + if (brick->outputs[index]) + return -EEXIST; + _generic_output_init(brick, type, output); + brick->outputs[index] = output; + brick->nr_outputs++; + return 0; +} + +int generic_size(const struct generic_brick_type *brick_type) +{ + int size = brick_type->brick_size; + int i; + + size += brick_type->max_inputs * sizeof(void *); + for (i = 0; i < brick_type->max_inputs; i++) + size += brick_type->default_input_types[i]->input_size; + size += brick_type->max_outputs * sizeof(void *); + for (i = 0; i < brick_type->max_outputs; i++) + size += brick_type->default_output_types[i]->output_size; + return size; +} + +int generic_connect(struct generic_input *input, struct generic_output *output) +{ + BRICK_DBG("generic_connect(input=%p, output=%p)\n", input, output); + if (unlikely(!input || !output)) + return -EINVAL; + if (unlikely(input->connect)) + return -EEXIST; + if (unlikely(!list_empty(>input_head))) + return -EINVAL; + /* helps only against the most common errors */ + if (unlikely(input->brick == output->brick)) + return -EDEADLK; +
[RFC 01/32] mars: add new module lamport
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/lamport.c | 61 ++ include/linux/brick/lamport.h | 26 ++ 2 files changed, 87 insertions(+) create mode 100644 drivers/staging/mars/lamport.c create mode 100644 include/linux/brick/lamport.h diff --git a/drivers/staging/mars/lamport.c b/drivers/staging/mars/lamport.c new file mode 100644 index ..373093f6e35f --- /dev/null +++ b/drivers/staging/mars/lamport.c @@ -0,0 +1,61 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include + +/* TODO: replace with spinlock if possible (first check) */ +struct semaphore lamport_sem = __SEMAPHORE_INITIALIZER(lamport_sem, 1); +struct timespec lamport_now = {}; + +void get_lamport(struct timespec *now) +{ + int diff; + + down(_sem); + + *now = CURRENT_TIME; + diff = timespec_compare(now, _now); + if (diff >= 0) { + timespec_add_ns(now, 1); + memcpy(_now, now, sizeof(lamport_now)); + timespec_add_ns(_now, 1); + } else { + timespec_add_ns(_now, 1); + memcpy(now, _now, sizeof(*now)); + } + + up(_sem); +} + +void set_lamport(struct timespec *old) +{ + int diff; + + down(_sem); + + diff = timespec_compare(old, _now); + if (diff >= 0) { + memcpy(_now, old, sizeof(lamport_now)); + timespec_add_ns(_now, 1); + } + + up(_sem); +} diff --git a/include/linux/brick/lamport.h b/include/linux/brick/lamport.h new file mode 100644 index ..9aac0ce01bb4 --- /dev/null +++ b/include/linux/brick/lamport.h @@ -0,0 +1,26 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef LAMPORT_H +#define LAMPORT_H + +#include + +extern void get_lamport(struct timespec *now); +extern void set_lamport(struct timespec *old); + +#endif -- 2.11.0
[RFC 11/32] mars: add new module lib_timing
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/lib/lib_timing.c | 68 + include/linux/brick/lib_timing.h | 182 ++ 2 files changed, 250 insertions(+) create mode 100644 drivers/staging/mars/lib/lib_timing.c create mode 100644 include/linux/brick/lib_timing.h diff --git a/drivers/staging/mars/lib/lib_timing.c b/drivers/staging/mars/lib/lib_timing.c new file mode 100644 index ..1996052cb647 --- /dev/null +++ b/drivers/staging/mars/lib/lib_timing.c @@ -0,0 +1,68 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include +#include + +#ifdef CONFIG_DEBUG_KERNEL + +int report_timing(struct timing_stats *tim, char *str, int maxlen) +{ + int len = 0; + int time = 1; + int resol = 1; + + static const char * const units[] = { + "us", + "ms", + "s", + "ERROR" + }; + const char *unit = units[0]; + int unit_index = 0; + int i; + + for (i = 0; i < TIMING_MAX; i++) { + int this_len = scnprintf( + + str, maxlen, "<%d%s = %d (%lld) ", resol, unit, tim->tim_count[i], ( + long long)tim->tim_count[i] * time); + + str += this_len; + len += this_len; + maxlen -= this_len; + if (maxlen <= 1) + break; + resol <<= 1; + time <<= 1; + if (resol >= 1000) { + resol = 1; + unit = units[++unit_index]; + } + } + return len; +} + +#endif /* CONFIG_DEBUG_KERNEL */ + +struct threshold global_io_threshold = { + .thr_limit = 30 * 100, /* 30 seconds */ + .thr_factor = 100, + .thr_plus = 0, +}; diff --git a/include/linux/brick/lib_timing.h b/include/linux/brick/lib_timing.h new file mode 100644 index ..7081d984a2ce --- /dev/null +++ b/include/linux/brick/lib_timing.h @@ -0,0 +1,182 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef LIB_TIMING_H +#define LIB_TIMING_H + +#include + +/* Simple infrastructure for timing of arbitrary operations and creation + * of some simple histogram statistics. + */ + +#define TIMING_MAX 24 + +struct timing_stats { +#ifdef CONFIG_DEBUG_KERNEL + int tim_count[TIMING_MAX]; + +#endif +}; + +#define _TIME_THIS(_stamp1, _stamp2, _CODE)\ + ({ \ + (_stamp1) = cpu_clock(raw_smp_processor_id()); \ + \ + _CODE; \ + \ + (_stamp2) = cpu_clock(raw_smp_processor_id()); \ + (_stamp2) - (_stamp1); \ + }) + +#define TIME_THIS(_CODE) \ + ({ \ + unsigned long long _stamp1; \ + unsigned long long _stamp2; \ + _TIME_THIS(_stamp1, _stamp2, _CODE);\ + }) + +#ifdef CONFIG_DEBUG_KERNEL + +#define _TIME_STATS(_timing, _stamp1, _stamp2, _CODE) \ + ({ \ +
[RFC 04/32] mars: add new module brick_checking
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- include/linux/brick/brick_checking.h | 107 +++ 1 file changed, 107 insertions(+) create mode 100644 include/linux/brick/brick_checking.h diff --git a/include/linux/brick/brick_checking.h b/include/linux/brick/brick_checking.h new file mode 100644 index ..957bd5227db9 --- /dev/null +++ b/include/linux/brick/brick_checking.h @@ -0,0 +1,107 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef BRICK_CHECKING_H +#define BRICK_CHECKING_H + +/***/ + +/* checking */ + +#if defined(CONFIG_MARS_DEBUG) || defined(CONFIG_MARS_CHECKS) +#define BRICK_CHECKING true +#else +#define BRICK_CHECKING false +#endif + +#define _CHECK_ATOMIC(atom, OP, minval) \ +do { \ + if (BRICK_CHECKING) { \ + int __test = atomic_read(atom); \ + if (unlikely(__test OP(minval))) { \ + atomic_set(atom, minval); \ + BRICK_ERR("%d: atomic " #atom " " #OP " " #minval " (%d)\n", __LINE__, __test);\ + } \ + } \ +} while (0) + +#define CHECK_ATOMIC(atom, minval) \ + _CHECK_ATOMIC(atom, <, minval) + +#define CHECK_HEAD_EMPTY(head) \ +do { \ + if (BRICK_CHECKING && unlikely(!list_empty(head) && (head)->next)) {\ + list_del_init(head);\ + BRICK_ERR("%d: list_head " #head " (%p) not empty\n", __LINE__, head);\ + } \ +} while (0) + +#ifdef CONFIG_MARS_DEBUG_MEM +#define CHECK_PTR_DEAD(ptr, label) \ +do { \ + if (BRICK_CHECKING && unlikely((ptr) == (void *)0x5a5a5a5a5a5a5a5a)) {\ + BRICK_FAT("%d: pointer '" #ptr "' is DEAD\n", __LINE__);\ + goto label; \ + } \ +} while (0) +#else +#define CHECK_PTR_DEAD(ptr, label) /*empty*/ +#endif + +#define CHECK_PTR_NULL(ptr, label) \ +do { \ + CHECK_PTR_DEAD(ptr, label); \ + if (BRICK_CHECKING && unlikely(!(ptr))) { \ + BRICK_FAT("%d: pointer '" #ptr "' is NULL\n", __LINE__);\ + goto label; \ + } \ +} while (0) + +#ifdef CONFIG_MARS_DEBUG +#define CHECK_PTR(ptr, label) \ +do { \ + CHECK_PTR_NULL(ptr, label); \ + if (BRICK_CHECKING && unlikely(!virt_addr_valid(ptr))) {\ + BRICK_FAT("%d: pointer '" #ptr "' (%p) is no valid virtual KERNEL address\n", __LINE__, ptr);\ + goto label; \ + } \ +} while (0) +#else +#define CHECK_PTR(ptr, label) CHECK_PTR_NULL(ptr, label) +#endif + +#define CHECK_ASPECT(a_ptr, o_ptr, label) \ +do { \ + if (BRICK_CHECKING && unlikely((a_ptr)->object != o_ptr)) { \ +
[RFC 01/32] mars: add new module lamport
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/lamport.c | 61 ++ include/linux/brick/lamport.h | 26 ++ 2 files changed, 87 insertions(+) create mode 100644 drivers/staging/mars/lamport.c create mode 100644 include/linux/brick/lamport.h diff --git a/drivers/staging/mars/lamport.c b/drivers/staging/mars/lamport.c new file mode 100644 index ..373093f6e35f --- /dev/null +++ b/drivers/staging/mars/lamport.c @@ -0,0 +1,61 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include + +/* TODO: replace with spinlock if possible (first check) */ +struct semaphore lamport_sem = __SEMAPHORE_INITIALIZER(lamport_sem, 1); +struct timespec lamport_now = {}; + +void get_lamport(struct timespec *now) +{ + int diff; + + down(_sem); + + *now = CURRENT_TIME; + diff = timespec_compare(now, _now); + if (diff >= 0) { + timespec_add_ns(now, 1); + memcpy(_now, now, sizeof(lamport_now)); + timespec_add_ns(_now, 1); + } else { + timespec_add_ns(_now, 1); + memcpy(now, _now, sizeof(*now)); + } + + up(_sem); +} + +void set_lamport(struct timespec *old) +{ + int diff; + + down(_sem); + + diff = timespec_compare(old, _now); + if (diff >= 0) { + memcpy(_now, old, sizeof(lamport_now)); + timespec_add_ns(_now, 1); + } + + up(_sem); +} diff --git a/include/linux/brick/lamport.h b/include/linux/brick/lamport.h new file mode 100644 index ..9aac0ce01bb4 --- /dev/null +++ b/include/linux/brick/lamport.h @@ -0,0 +1,26 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef LAMPORT_H +#define LAMPORT_H + +#include + +extern void get_lamport(struct timespec *now); +extern void set_lamport(struct timespec *old); + +#endif -- 2.11.0
[RFC 11/32] mars: add new module lib_timing
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/lib/lib_timing.c | 68 + include/linux/brick/lib_timing.h | 182 ++ 2 files changed, 250 insertions(+) create mode 100644 drivers/staging/mars/lib/lib_timing.c create mode 100644 include/linux/brick/lib_timing.h diff --git a/drivers/staging/mars/lib/lib_timing.c b/drivers/staging/mars/lib/lib_timing.c new file mode 100644 index ..1996052cb647 --- /dev/null +++ b/drivers/staging/mars/lib/lib_timing.c @@ -0,0 +1,68 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include +#include + +#ifdef CONFIG_DEBUG_KERNEL + +int report_timing(struct timing_stats *tim, char *str, int maxlen) +{ + int len = 0; + int time = 1; + int resol = 1; + + static const char * const units[] = { + "us", + "ms", + "s", + "ERROR" + }; + const char *unit = units[0]; + int unit_index = 0; + int i; + + for (i = 0; i < TIMING_MAX; i++) { + int this_len = scnprintf( + + str, maxlen, "<%d%s = %d (%lld) ", resol, unit, tim->tim_count[i], ( + long long)tim->tim_count[i] * time); + + str += this_len; + len += this_len; + maxlen -= this_len; + if (maxlen <= 1) + break; + resol <<= 1; + time <<= 1; + if (resol >= 1000) { + resol = 1; + unit = units[++unit_index]; + } + } + return len; +} + +#endif /* CONFIG_DEBUG_KERNEL */ + +struct threshold global_io_threshold = { + .thr_limit = 30 * 100, /* 30 seconds */ + .thr_factor = 100, + .thr_plus = 0, +}; diff --git a/include/linux/brick/lib_timing.h b/include/linux/brick/lib_timing.h new file mode 100644 index ..7081d984a2ce --- /dev/null +++ b/include/linux/brick/lib_timing.h @@ -0,0 +1,182 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef LIB_TIMING_H +#define LIB_TIMING_H + +#include + +/* Simple infrastructure for timing of arbitrary operations and creation + * of some simple histogram statistics. + */ + +#define TIMING_MAX 24 + +struct timing_stats { +#ifdef CONFIG_DEBUG_KERNEL + int tim_count[TIMING_MAX]; + +#endif +}; + +#define _TIME_THIS(_stamp1, _stamp2, _CODE)\ + ({ \ + (_stamp1) = cpu_clock(raw_smp_processor_id()); \ + \ + _CODE; \ + \ + (_stamp2) = cpu_clock(raw_smp_processor_id()); \ + (_stamp2) - (_stamp1); \ + }) + +#define TIME_THIS(_CODE) \ + ({ \ + unsigned long long _stamp1; \ + unsigned long long _stamp2; \ + _TIME_THIS(_stamp1, _stamp2, _CODE);\ + }) + +#ifdef CONFIG_DEBUG_KERNEL + +#define _TIME_STATS(_timing, _stamp1, _stamp2, _CODE) \ + ({ \ + unsi
[RFC 04/32] mars: add new module brick_checking
Signed-off-by: Thomas Schoebel-Theuer --- include/linux/brick/brick_checking.h | 107 +++ 1 file changed, 107 insertions(+) create mode 100644 include/linux/brick/brick_checking.h diff --git a/include/linux/brick/brick_checking.h b/include/linux/brick/brick_checking.h new file mode 100644 index ..957bd5227db9 --- /dev/null +++ b/include/linux/brick/brick_checking.h @@ -0,0 +1,107 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef BRICK_CHECKING_H +#define BRICK_CHECKING_H + +/***/ + +/* checking */ + +#if defined(CONFIG_MARS_DEBUG) || defined(CONFIG_MARS_CHECKS) +#define BRICK_CHECKING true +#else +#define BRICK_CHECKING false +#endif + +#define _CHECK_ATOMIC(atom, OP, minval) \ +do { \ + if (BRICK_CHECKING) { \ + int __test = atomic_read(atom); \ + if (unlikely(__test OP(minval))) { \ + atomic_set(atom, minval); \ + BRICK_ERR("%d: atomic " #atom " " #OP " " #minval " (%d)\n", __LINE__, __test);\ + } \ + } \ +} while (0) + +#define CHECK_ATOMIC(atom, minval) \ + _CHECK_ATOMIC(atom, <, minval) + +#define CHECK_HEAD_EMPTY(head) \ +do { \ + if (BRICK_CHECKING && unlikely(!list_empty(head) && (head)->next)) {\ + list_del_init(head);\ + BRICK_ERR("%d: list_head " #head " (%p) not empty\n", __LINE__, head);\ + } \ +} while (0) + +#ifdef CONFIG_MARS_DEBUG_MEM +#define CHECK_PTR_DEAD(ptr, label) \ +do { \ + if (BRICK_CHECKING && unlikely((ptr) == (void *)0x5a5a5a5a5a5a5a5a)) {\ + BRICK_FAT("%d: pointer '" #ptr "' is DEAD\n", __LINE__);\ + goto label; \ + } \ +} while (0) +#else +#define CHECK_PTR_DEAD(ptr, label) /*empty*/ +#endif + +#define CHECK_PTR_NULL(ptr, label) \ +do { \ + CHECK_PTR_DEAD(ptr, label); \ + if (BRICK_CHECKING && unlikely(!(ptr))) { \ + BRICK_FAT("%d: pointer '" #ptr "' is NULL\n", __LINE__);\ + goto label; \ + } \ +} while (0) + +#ifdef CONFIG_MARS_DEBUG +#define CHECK_PTR(ptr, label) \ +do { \ + CHECK_PTR_NULL(ptr, label); \ + if (BRICK_CHECKING && unlikely(!virt_addr_valid(ptr))) {\ + BRICK_FAT("%d: pointer '" #ptr "' (%p) is no valid virtual KERNEL address\n", __LINE__, ptr);\ + goto label; \ + } \ +} while (0) +#else +#define CHECK_PTR(ptr, label) CHECK_PTR_NULL(ptr, label) +#endif + +#define CHECK_ASPECT(a_ptr, o_ptr, label) \ +do { \ + if (BRICK_CHECKING && unlikely((a_ptr)->object != o_ptr)) { \ +
[RFC 05/32] mars: add new module meta
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- include/linux/brick/meta.h | 106 + 1 file changed, 106 insertions(+) create mode 100644 include/linux/brick/meta.h diff --git a/include/linux/brick/meta.h b/include/linux/brick/meta.h new file mode 100644 index ..a92b2b649c1f --- /dev/null +++ b/include/linux/brick/meta.h @@ -0,0 +1,106 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef META_H +#define META_H + +/***/ + +/* metadata descriptions */ + +/* The idea is to describe your C structures in such a way that + * transfers to disk or over a network become self-describing. + * + * In essence, this is a kind of version-independent marshalling. + * + * Advantage: + * When you extend your original C struct (and of course update the + * corresponding meta structure), old data on disk (or network peers + * running an old version of your program) will remain valid. + * Upon read, newly added fields missing in the old version will be simply + * not filled in and therefore remain zeroed (if you don't forget to + * initially clear your structures via memset() / initializers / etc). + * Note that this works only if you never rename or remove existing + * fields; you should only add new ones. + * [TODO: add macros for description of ignored / renamed fields to + * overcome this limitation] + * You may increase the size of integers, for example from 32bit to 64bit + * or even higher; sign extension will be automatically carried out + * when necessary. + * Also, you may change the order of fields, because the metadata interpreter + * will check each field individually; field offsets are automatically + * maintained. + * + * Disadvantage: this adds some (small) overhead. + */ + +enum field_type { + FIELD_DONE, + FIELD_REF, + FIELD_SUB, + FIELD_STRING, + FIELD_RAW, + FIELD_INT, + FIELD_UINT, +}; + +struct meta { + /* char field_name[MAX_FIELD_LEN]; */ + char *field_name; + + short field_type; + short field_data_size; + short field_transfer_size; + int field_offset; + const struct meta *field_ref; +}; + +#define _META_INI(NAME, STRUCT, TYPE, TSIZE) \ + .field_name = #NAME,\ + .field_type = TYPE, \ + .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \ + .field_transfer_size = (TSIZE), \ + .field_offset = offsetof(STRUCT, NAME) \ + +#define META_INI_TRANSFER(NAME, STRUCT, TYPE, TSIZE) \ + { _META_INI(NAME, STRUCT, TYPE, TSIZE) } + +#define META_INI(NAME, STRUCT, TYPE) \ + { _META_INI(NAME, STRUCT, TYPE, 0) } + +#define _META_INI_AIO(NAME, STRUCT, AIO) \ + .field_name = #NAME,\ + .field_type = FIELD_REF,\ + .field_data_size = sizeof(*(((STRUCT *)NULL)->NAME)), \ + .field_offset = offsetof(STRUCT, NAME), \ + .field_ref = AIO + +#define META_INI_AIO(NAME, STRUCT, AIO) { _META_INI_AIO(NAME, STRUCT, AIO) } + +#define _META_INI_SUB(NAME, STRUCT, SUB) \ + .field_name = #NAME,\ + .field_type = FIELD_SUB,\ + .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \ + .field_offset = offsetof(STRUCT, NAME), \ + .field_ref = SUB + +#define META_INI_SUB(NAME, STRUCT, SUB) { _META_INI_SUB(NAME, STRUCT, SUB) } + +extern const struct meta *find_meta(const struct meta *meta, const char *field_name); +/* extern void free_meta(void *data, const struct meta *meta); */ + +#endif -- 2.11.0
[RFC 00/32] State of MARS Reo-Redundancy Module
d start joining the MARS development in 2017, at least for helping me getting it upstream. I would be excited if I would be invited to the next kernel summit or a similar meeting. A happy new year from your devoted Thomas [1] https://github.com/schoebel/mars/blob/master/docu/MARS_GUUG2016.pdf [2] https://github.com/schoebel/mars Thomas Schoebel-Theuer (32): mars: add new module lamport mars: add new module brick_say mars: add new module brick_mem mars: add new module brick_checking mars: add new module meta mars: add new module brick mars: add new module lib_pairing_heap mars: add new module lib_queue mars: add new module lib_rank mars: add new module lib_limiter mars: add new module lib_timing mars: add new module vfs_compat mars: add new module xio mars: add new module xio_net mars: add new module lib_mapfree mars: add new module lib_log mars: add new module xio_bio mars: add new module xio_sio mars: add new module xio_client mars: add new module xio_if mars: add new module xio_copy mars: add new module xio_trans_logger mars: add new module xio_server mars: add new module strategy mars: add new module main_strategy mars: add new module net mars: add new module server_strategy mars: add new module mars_proc mars: add new module mars_main mars: add new module Makefile mars: add new module Kconfig mars: activate build drivers/staging/Kconfig|2 + drivers/staging/Makefile |1 + drivers/staging/mars/Kconfig | 266 + drivers/staging/mars/Makefile | 96 + drivers/staging/mars/brick.c | 723 +++ drivers/staging/mars/brick_mem.c | 1080 drivers/staging/mars/brick_say.c | 920 +++ drivers/staging/mars/lamport.c | 61 + drivers/staging/mars/lib/lib_limiter.c | 163 + drivers/staging/mars/lib/lib_rank.c| 87 + drivers/staging/mars/lib/lib_timing.c | 68 + drivers/staging/mars/mars/main_strategy.c | 2135 +++ drivers/staging/mars/mars/mars_main.c | 6160 drivers/staging/mars/mars/mars_proc.c | 389 ++ drivers/staging/mars/mars/mars_proc.h | 34 + drivers/staging/mars/mars/net.c| 109 + drivers/staging/mars/mars/server_strategy.c| 436 ++ drivers/staging/mars/mars/strategy.h | 239 + drivers/staging/mars/xio_bricks/lib_log.c | 506 ++ drivers/staging/mars/xio_bricks/lib_mapfree.c | 382 ++ drivers/staging/mars/xio_bricks/xio.c | 227 + drivers/staging/mars/xio_bricks/xio_bio.c | 845 +++ drivers/staging/mars/xio_bricks/xio_client.c | 1083 drivers/staging/mars/xio_bricks/xio_copy.c | 1005 drivers/staging/mars/xio_bricks/xio_if.c | 892 +++ drivers/staging/mars/xio_bricks/xio_net.c | 1849 ++ drivers/staging/mars/xio_bricks/xio_server.c | 493 ++ drivers/staging/mars/xio_bricks/xio_sio.c | 578 ++ drivers/staging/mars/xio_bricks/xio_trans_logger.c | 3410 +++ include/linux/brick/brick.h| 620 ++ include/linux/brick/brick_checking.h | 107 + include/linux/brick/brick_mem.h| 218 + include/linux/brick/brick_say.h| 89 + include/linux/brick/lamport.h | 26 + include/linux/brick/lib_limiter.h | 52 + include/linux/brick/lib_pairing_heap.h | 109 + include/linux/brick/lib_queue.h| 165 + include/linux/brick/lib_rank.h | 136 + include/linux/brick/lib_timing.h | 182 + include/linux/brick/meta.h | 106 + include/linux/brick/vfs_compat.h | 48 + include/linux/xio/lib_log.h| 333 ++ include/linux/xio/lib_mapfree.h| 84 + include/linux/xio/xio.h| 319 + include/linux/xio/xio_bio.h| 85 + include/linux/xio/xio_client.h | 105 + include/linux/xio/xio_copy.h | 115 + include/linux/xio/xio_if.h | 109 + include/linux/xio/xio_net.h| 177 + include/linux/xio/xio_server.h | 91 + include/linux/xio/xio_sio.h| 68 + include/linux/xio/xio_trans_logger.h | 271 + 52 files changed, 27854 insertions(+) create mode 100644 drivers/staging/mars/Kconfig create mode 100644 drivers/staging/mars/Makefile create mode 100644 drivers/staging/mars/brick.c create mode 100644 drivers/staging/mars/brick_mem.c create mode 100644 drivers/staging/mars/brick_say.c create mode 100644 drivers/staging/mars/lamport.c create m
[RFC 14/32] mars: add new module xio_net
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/xio_net.c | 1849 + include/linux/xio/xio_net.h | 177 +++ 2 files changed, 2026 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_net.c create mode 100644 include/linux/xio/xio_net.h diff --git a/drivers/staging/mars/xio_bricks/xio_net.c b/drivers/staging/mars/xio_bricks/xio_net.c new file mode 100644 index ..441eee1f3912 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_net.c @@ -0,0 +1,1849 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include + +#include +#include + +/**/ + +/* provisionary version detection */ + +#ifndef TCP_MAX_REORDERING +#define __HAS_IOV_ITER +#endif + +#ifdef sk_net_refcnt +/* see eeb1bd5c40edb0e2fd925c8535e2fdebdbc5cef2 */ +#define __HAS_STRUCT_NET +#endif + +/**/ + +#define USE_BUFFERING + +#define SEND_PROTO_VERSION 2 + +enum COMPRESS_TYPES { + COMPRESS_NONE = 0, + COMPRESS_LZO = 1, + /* insert further methods here */ +}; + +int xio_net_compress_data; + +const u16 net_global_flags = 0 +#ifdef __HAVE_LZO + | COMPRESS_LZO +#endif + ; + +/**/ + +/* Internal data structures for low-level transfer of C structures + * described by struct meta. + * Only these low-level fields need to have a fixed size like s64. + * The size and bytesex of the higher-level C structures is converted + * automatically; therefore classical "int" or "long long" etc is viable. + */ + +#define MAX_FIELD_LEN (32 + 16) + +/* Please keep this at a size of 64 bytes by + * reuse of *spare* fields. + */ +struct xio_desc_cache { + u8cache_sender_proto; + u8cache_recver_proto; + s8cache_is_bigendian; + u8cache_spare0; + s16 cache_items; + u16 cache_spare1; + u32 cache_spare2; + u32 cache_spare3; + u64 cache_spare4[4]; + u64 cache_sender_cookie; + u64 cache_recver_cookie; +}; + +/* Please keep this also at a size of 64 bytes by + * reuse of *spare* fields. + */ +struct xio_desc_item { + s8field_type; + s8field_spare0; + s16 field_data_size; + s16 field_sender_size; + s16 field_sender_offset; + s16 field_recver_size; + s16 field_recver_offset; + s32 field_spare; + char field_name[MAX_FIELD_LEN]; +}; + +/* This must not be mirror symmetric between big and little endian + */ +#define XIO_DESC_MAGIC 0x73D0A2EC6148F48Ell + +struct xio_desc_header { + u64 h_magic; + u64 h_cookie; + s16 h_meta_len; + s16 h_index; + u32 h_spare1; + u64 h_spare2; +}; + +#define MAX_INT_TRANSFER 16 + +/**/ + +/* Bytesex conversion / sign extension + */ + +#ifdef __LITTLE_ENDIAN +static const bool myself_is_bigendian; + +#endif +#ifdef __BIG_ENDIAN +static const bool myself_is_bigendian = true; + +#endif + +static inline +void swap_bytes(void *data, int len) +{ + char *a = data; + char *b = data + len - 1; + + while (a < b) { + char tmp = *a; + + *a = *b; + *b = tmp; + a++; + b--; + } +} + +#define SWAP_FIELD(x) swap_bytes(&(x), sizeof(x)) + +static inline +void swap_mc(struct xio_desc_cache *mc, int len) +{ + struct xio_desc_item *mi; + + SWAP_FIELD(mc->cache_sender_cookie); + SWAP_FIELD(mc->cache_recver_cookie); + SWAP_FIELD(mc->cache_items); + + len -= sizeof(*mc); + + for (mi = (void *)(mc + 1); len > 0; mi++, len -= sizeof(*mi)) { + SWAP_FIELD(mi->field_data_size); + SWAP_FIELD(mi->field_sender_size); + SWAP_FIELD(mi->field_sender_offset); + SWAP_FIELD(mi->field_recver_size); + SWAP_FIELD(mi->field_recver_offset); + } +} + +static inline +char get_sign(const void *data, int
[RFC 13/32] mars: add new module xio
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/xio.c | 227 include/linux/xio/xio.h | 319 ++ 2 files changed, 546 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio.c create mode 100644 include/linux/xio/xio.h diff --git a/drivers/staging/mars/xio_bricks/xio.c b/drivers/staging/mars/xio_bricks/xio.c new file mode 100644 index ..e58f11f497f9 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio.c @@ -0,0 +1,227 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include + +#include + +// + +/* infrastructure */ + +struct banning xio_global_ban = {}; +atomic_t xio_global_io_flying = ATOMIC_INIT(0); + +// + +/* object stuff */ + +const struct generic_object_type aio_type = { + .object_type_name = "aio", + .default_size = sizeof(struct aio_object), + .object_type_nr = OBJ_TYPE_AIO, +}; + +// + +/* brick stuff */ + +/***/ + +/* meta descriptions */ + +const struct meta xio_info_meta[] = { + META_INI(current_size,struct xio_info, FIELD_INT), + META_INI(tf_align,struct xio_info, FIELD_INT), + META_INI(tf_min_size, struct xio_info, FIELD_INT), + {} +}; + +const struct meta xio_aio_user_meta[] = { + META_INI(_object_cb.cb_error, struct aio_object, FIELD_INT), + META_INI(io_pos, struct aio_object, FIELD_INT), + META_INI(io_len, struct aio_object, FIELD_INT), + META_INI(io_may_write,struct aio_object, FIELD_INT), + META_INI(io_prio, struct aio_object, FIELD_INT), + META_INI(io_cs_mode, struct aio_object, FIELD_INT), + META_INI(io_timeout, struct aio_object, FIELD_INT), + META_INI(io_total_size, struct aio_object, FIELD_INT), + META_INI(io_checksum, struct aio_object, FIELD_RAW), + META_INI(io_flags, struct aio_object, FIELD_INT), + META_INI(io_rw,struct aio_object, FIELD_INT), + META_INI(io_id,struct aio_object, FIELD_INT), + META_INI(io_skip_sync,struct aio_object, FIELD_INT), + {} +}; + +const struct meta xio_timespec_meta[] = { + META_INI_TRANSFER(tv_sec, struct timespec, FIELD_UINT, 8), + META_INI_TRANSFER(tv_nsec, struct timespec, FIELD_UINT, 4), + {} +}; + +// + +/* crypto stuff */ + +#include +#include + +/* 896545098777564212b9e91af4c973f094649aa7 */ +#ifndef crt_hash +#define HAS_NEW_CRYPTO +#endif + +#ifdef HAS_NEW_CRYPTO + +/* Nor now, use shash. + * Later, asynchronous support should be added for full exploitation + * of crypto hardware. + */ +#include + +static struct crypto_shash *xio_tfm; +int xio_digest_size; + +struct mars_sdesc { + struct shash_desc shash; + char ctx[]; +}; + +void xio_digest(unsigned char *digest, void *data, int len) +{ + int size = sizeof(struct mars_sdesc) + crypto_shash_descsize(xio_tfm); + struct mars_sdesc *sdesc = brick_mem_alloc(size); + int status; + + sdesc->shash.tfm = xio_tfm; + sdesc->shash.flags = 0; + + memset(digest, 0, xio_digest_size); + status = crypto_shash_digest(>shash, data, len, digest); + if (unlikely(status < 0)) + XIO_ERR( + "cannot calculate cksum on %p len=%d, status=%d\n", +data, len, +status); + + brick_mem_free(sdesc); +} + +#else /* HAS_NEW_CRYPTO */ + +/* Old implementation, to disappear. + * Was a quick'n dirty lab prototype with unnecessary + * global variables and locking. + */ + +static struct crypto_hash *xio_tfm; +static struct semaphore tfm_sem; +int xio_digest_size; + +void xio_digest(unsigned char *digest, void *data, int len) +{ + struct hash_desc desc = { + .tfm = xio_tfm, + .flags = 0, + }; + struct scatterlist sg; + + memset(digest, 0, xio
[RFC 10/32] mars: add new module lib_limiter
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/lib/lib_limiter.c | 163 + include/linux/brick/lib_limiter.h | 52 +++ 2 files changed, 215 insertions(+) create mode 100644 drivers/staging/mars/lib/lib_limiter.c create mode 100644 include/linux/brick/lib_limiter.h diff --git a/drivers/staging/mars/lib/lib_limiter.c b/drivers/staging/mars/lib/lib_limiter.c new file mode 100644 index ..e77b74a0eae7 --- /dev/null +++ b/drivers/staging/mars/lib/lib_limiter.c @@ -0,0 +1,163 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include +#include +#include +#include + +#define LIMITER_TIME_RESOLUTIONNSEC_PER_SEC + +int rate_limit(struct rate_limiter *lim, int amount) +{ + int delay = 0; + long long now; + + now = cpu_clock(raw_smp_processor_id()); + + /* Compute the maximum delay along the path +* down to the root of the hierarchy tree. +*/ + while (lim) { + long long window = now - lim->lim_stamp; + + /* Sometimes, raw CPU clocks may do weired things... +* Smaller windows in the denominator than 1s could fake unrealistic rates. +*/ + if (unlikely(lim->lim_min_window <= 0)) + lim->lim_min_window = 1000; + if (unlikely(lim->lim_max_window <= lim->lim_min_window)) + lim->lim_max_window = lim->lim_min_window + 8000; + if (unlikely(window < (long long)lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000))) + window = (long long)lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000); + + /* Update total statistics. +* They will intentionally wrap around. +* Userspace must take care of that. +*/ + if (likely(amount > 0)) { + lim->lim_total_amount += amount; + lim->lim_total_ops++; + } + + /* Only use incremental accumulation at repeated calls, but +* never after longer pauses. +*/ + if (likely(lim->lim_stamp && + window < (long long)lim->lim_max_window * (LIMITER_TIME_RESOLUTION / 1000))) { + long long rate_raw; + int rate; + int max_rate; + + /* Races are possible, but taken into account. +* There is no real harm from rarely lost updates. +*/ + if (likely(amount > 0)) { + lim->lim_amount_accu += amount; + lim->lim_amount_cumul += amount; + lim->lim_ops_accu++; + lim->lim_ops_cumul++; + } + + /* compute amount values */ + rate_raw = lim->lim_amount_accu * LIMITER_TIME_RESOLUTION / window; + rate = rate_raw; + if (unlikely(rate_raw > INT_MAX)) + rate = INT_MAX; + lim->lim_amount_rate = rate; + + /* amount limit exceeded? */ + max_rate = lim->lim_max_amount_rate; + if (max_rate > 0 && rate > max_rate) { + int this_delay = ( + + window * rate / max_rate - window) / (LIMITER_TIME_RESOLUTION / 1000); + /* compute maximum */ + if (this_delay > delay && this_delay > 0) + delay = this_delay; + } + + /* compute ops values */ + rate_raw = lim->lim_ops_accu * LIMITER_TIME_RESOLUTION / window; + rate = rate_raw; + if (unlikely(rate_raw > INT_MAX)) + rate = INT_MAX; + lim->lim_ops_rate = rate; + +
[RFC 08/32] mars: add new module lib_queue
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- include/linux/brick/lib_queue.h | 165 1 file changed, 165 insertions(+) create mode 100644 include/linux/brick/lib_queue.h diff --git a/include/linux/brick/lib_queue.h b/include/linux/brick/lib_queue.h new file mode 100644 index ..72cd0a2710c2 --- /dev/null +++ b/include/linux/brick/lib_queue.h @@ -0,0 +1,165 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef LIB_QUEUE_H +#define LIB_QUEUE_H + +#define QUEUE_ANCHOR(PREFIX, KEYTYPE, HEAPTYPE) \ + /* parameters */\ + /* readonly from outside */ \ + atomic_t q_queued; \ + atomic_t q_flying; \ + atomic_t q_total; \ + /* tunables */ \ + int q_batchlen; \ + int q_io_prio; \ + bool q_ordering;\ + /* private */ \ + wait_queue_head_t *q_event; \ + spinlock_t q_lock; \ + struct list_head q_anchor; \ + struct pairing_heap_##HEAPTYPE *heap_high; \ + struct pairing_heap_##HEAPTYPE *heap_low; \ + long long q_last_insert; /* jiffies */ \ + KEYTYPE heap_margin;\ + KEYTYPE last_pos + +#define QUEUE_FUNCTIONS(PREFIX, ELEM_TYPE, HEAD, KEYFN, KEYCMP, HEAPTYPE)\ + \ +static inline \ +void q_##PREFIX##_trigger(struct PREFIX##_queue *q)\ +{ \ + if (q->q_event) { \ + wake_up_interruptible(q->q_event); \ + } \ +} \ + \ +static inline \ +void q_##PREFIX##_init(struct PREFIX##_queue *q) \ +{ \ + INIT_LIST_HEAD(>q_anchor); \ + q->heap_low = NULL; \ + q->heap_high = NULL;\ + spin_lock_init(>q_lock); \ + atomic_set(>q_queued, 0);\ + atomic_set(>q_flying, 0);\ +} \ + \ +static inline \ +void q_##PREFIX##_insert(struct PREFIX##_queue *q, ELEM_TYPE * elem) \ +{ \ + unsigned long flags;\ + \ + spin_lock_irqsave(>q_lock, flags); \ + \ + if (q->q_ordering) {\ + struct pairing_heap_##HEAPTYPE **use = >heap_high; \ + if (KEYCMP(KEYFN(elem), >heap_margin) <= 0) {\ + use = >heap_low; \ + }
[RFC 05/32] mars: add new module meta
Signed-off-by: Thomas Schoebel-Theuer --- include/linux/brick/meta.h | 106 + 1 file changed, 106 insertions(+) create mode 100644 include/linux/brick/meta.h diff --git a/include/linux/brick/meta.h b/include/linux/brick/meta.h new file mode 100644 index ..a92b2b649c1f --- /dev/null +++ b/include/linux/brick/meta.h @@ -0,0 +1,106 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef META_H +#define META_H + +/***/ + +/* metadata descriptions */ + +/* The idea is to describe your C structures in such a way that + * transfers to disk or over a network become self-describing. + * + * In essence, this is a kind of version-independent marshalling. + * + * Advantage: + * When you extend your original C struct (and of course update the + * corresponding meta structure), old data on disk (or network peers + * running an old version of your program) will remain valid. + * Upon read, newly added fields missing in the old version will be simply + * not filled in and therefore remain zeroed (if you don't forget to + * initially clear your structures via memset() / initializers / etc). + * Note that this works only if you never rename or remove existing + * fields; you should only add new ones. + * [TODO: add macros for description of ignored / renamed fields to + * overcome this limitation] + * You may increase the size of integers, for example from 32bit to 64bit + * or even higher; sign extension will be automatically carried out + * when necessary. + * Also, you may change the order of fields, because the metadata interpreter + * will check each field individually; field offsets are automatically + * maintained. + * + * Disadvantage: this adds some (small) overhead. + */ + +enum field_type { + FIELD_DONE, + FIELD_REF, + FIELD_SUB, + FIELD_STRING, + FIELD_RAW, + FIELD_INT, + FIELD_UINT, +}; + +struct meta { + /* char field_name[MAX_FIELD_LEN]; */ + char *field_name; + + short field_type; + short field_data_size; + short field_transfer_size; + int field_offset; + const struct meta *field_ref; +}; + +#define _META_INI(NAME, STRUCT, TYPE, TSIZE) \ + .field_name = #NAME,\ + .field_type = TYPE, \ + .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \ + .field_transfer_size = (TSIZE), \ + .field_offset = offsetof(STRUCT, NAME) \ + +#define META_INI_TRANSFER(NAME, STRUCT, TYPE, TSIZE) \ + { _META_INI(NAME, STRUCT, TYPE, TSIZE) } + +#define META_INI(NAME, STRUCT, TYPE) \ + { _META_INI(NAME, STRUCT, TYPE, 0) } + +#define _META_INI_AIO(NAME, STRUCT, AIO) \ + .field_name = #NAME,\ + .field_type = FIELD_REF,\ + .field_data_size = sizeof(*(((STRUCT *)NULL)->NAME)), \ + .field_offset = offsetof(STRUCT, NAME), \ + .field_ref = AIO + +#define META_INI_AIO(NAME, STRUCT, AIO) { _META_INI_AIO(NAME, STRUCT, AIO) } + +#define _META_INI_SUB(NAME, STRUCT, SUB) \ + .field_name = #NAME,\ + .field_type = FIELD_SUB,\ + .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \ + .field_offset = offsetof(STRUCT, NAME), \ + .field_ref = SUB + +#define META_INI_SUB(NAME, STRUCT, SUB) { _META_INI_SUB(NAME, STRUCT, SUB) } + +extern const struct meta *find_meta(const struct meta *meta, const char *field_name); +/* extern void free_meta(void *data, const struct meta *meta); */ + +#endif -- 2.11.0
[RFC 00/32] State of MARS Reo-Redundancy Module
d start joining the MARS development in 2017, at least for helping me getting it upstream. I would be excited if I would be invited to the next kernel summit or a similar meeting. A happy new year from your devoted Thomas [1] https://github.com/schoebel/mars/blob/master/docu/MARS_GUUG2016.pdf [2] https://github.com/schoebel/mars Thomas Schoebel-Theuer (32): mars: add new module lamport mars: add new module brick_say mars: add new module brick_mem mars: add new module brick_checking mars: add new module meta mars: add new module brick mars: add new module lib_pairing_heap mars: add new module lib_queue mars: add new module lib_rank mars: add new module lib_limiter mars: add new module lib_timing mars: add new module vfs_compat mars: add new module xio mars: add new module xio_net mars: add new module lib_mapfree mars: add new module lib_log mars: add new module xio_bio mars: add new module xio_sio mars: add new module xio_client mars: add new module xio_if mars: add new module xio_copy mars: add new module xio_trans_logger mars: add new module xio_server mars: add new module strategy mars: add new module main_strategy mars: add new module net mars: add new module server_strategy mars: add new module mars_proc mars: add new module mars_main mars: add new module Makefile mars: add new module Kconfig mars: activate build drivers/staging/Kconfig|2 + drivers/staging/Makefile |1 + drivers/staging/mars/Kconfig | 266 + drivers/staging/mars/Makefile | 96 + drivers/staging/mars/brick.c | 723 +++ drivers/staging/mars/brick_mem.c | 1080 drivers/staging/mars/brick_say.c | 920 +++ drivers/staging/mars/lamport.c | 61 + drivers/staging/mars/lib/lib_limiter.c | 163 + drivers/staging/mars/lib/lib_rank.c| 87 + drivers/staging/mars/lib/lib_timing.c | 68 + drivers/staging/mars/mars/main_strategy.c | 2135 +++ drivers/staging/mars/mars/mars_main.c | 6160 drivers/staging/mars/mars/mars_proc.c | 389 ++ drivers/staging/mars/mars/mars_proc.h | 34 + drivers/staging/mars/mars/net.c| 109 + drivers/staging/mars/mars/server_strategy.c| 436 ++ drivers/staging/mars/mars/strategy.h | 239 + drivers/staging/mars/xio_bricks/lib_log.c | 506 ++ drivers/staging/mars/xio_bricks/lib_mapfree.c | 382 ++ drivers/staging/mars/xio_bricks/xio.c | 227 + drivers/staging/mars/xio_bricks/xio_bio.c | 845 +++ drivers/staging/mars/xio_bricks/xio_client.c | 1083 drivers/staging/mars/xio_bricks/xio_copy.c | 1005 drivers/staging/mars/xio_bricks/xio_if.c | 892 +++ drivers/staging/mars/xio_bricks/xio_net.c | 1849 ++ drivers/staging/mars/xio_bricks/xio_server.c | 493 ++ drivers/staging/mars/xio_bricks/xio_sio.c | 578 ++ drivers/staging/mars/xio_bricks/xio_trans_logger.c | 3410 +++ include/linux/brick/brick.h| 620 ++ include/linux/brick/brick_checking.h | 107 + include/linux/brick/brick_mem.h| 218 + include/linux/brick/brick_say.h| 89 + include/linux/brick/lamport.h | 26 + include/linux/brick/lib_limiter.h | 52 + include/linux/brick/lib_pairing_heap.h | 109 + include/linux/brick/lib_queue.h| 165 + include/linux/brick/lib_rank.h | 136 + include/linux/brick/lib_timing.h | 182 + include/linux/brick/meta.h | 106 + include/linux/brick/vfs_compat.h | 48 + include/linux/xio/lib_log.h| 333 ++ include/linux/xio/lib_mapfree.h| 84 + include/linux/xio/xio.h| 319 + include/linux/xio/xio_bio.h| 85 + include/linux/xio/xio_client.h | 105 + include/linux/xio/xio_copy.h | 115 + include/linux/xio/xio_if.h | 109 + include/linux/xio/xio_net.h| 177 + include/linux/xio/xio_server.h | 91 + include/linux/xio/xio_sio.h| 68 + include/linux/xio/xio_trans_logger.h | 271 + 52 files changed, 27854 insertions(+) create mode 100644 drivers/staging/mars/Kconfig create mode 100644 drivers/staging/mars/Makefile create mode 100644 drivers/staging/mars/brick.c create mode 100644 drivers/staging/mars/brick_mem.c create mode 100644 drivers/staging/mars/brick_say.c create mode 100644 drivers/staging/mars/lamport.c create m
[RFC 14/32] mars: add new module xio_net
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_net.c | 1849 + include/linux/xio/xio_net.h | 177 +++ 2 files changed, 2026 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_net.c create mode 100644 include/linux/xio/xio_net.h diff --git a/drivers/staging/mars/xio_bricks/xio_net.c b/drivers/staging/mars/xio_bricks/xio_net.c new file mode 100644 index ..441eee1f3912 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_net.c @@ -0,0 +1,1849 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include + +#include +#include + +/**/ + +/* provisionary version detection */ + +#ifndef TCP_MAX_REORDERING +#define __HAS_IOV_ITER +#endif + +#ifdef sk_net_refcnt +/* see eeb1bd5c40edb0e2fd925c8535e2fdebdbc5cef2 */ +#define __HAS_STRUCT_NET +#endif + +/**/ + +#define USE_BUFFERING + +#define SEND_PROTO_VERSION 2 + +enum COMPRESS_TYPES { + COMPRESS_NONE = 0, + COMPRESS_LZO = 1, + /* insert further methods here */ +}; + +int xio_net_compress_data; + +const u16 net_global_flags = 0 +#ifdef __HAVE_LZO + | COMPRESS_LZO +#endif + ; + +/**/ + +/* Internal data structures for low-level transfer of C structures + * described by struct meta. + * Only these low-level fields need to have a fixed size like s64. + * The size and bytesex of the higher-level C structures is converted + * automatically; therefore classical "int" or "long long" etc is viable. + */ + +#define MAX_FIELD_LEN (32 + 16) + +/* Please keep this at a size of 64 bytes by + * reuse of *spare* fields. + */ +struct xio_desc_cache { + u8cache_sender_proto; + u8cache_recver_proto; + s8cache_is_bigendian; + u8cache_spare0; + s16 cache_items; + u16 cache_spare1; + u32 cache_spare2; + u32 cache_spare3; + u64 cache_spare4[4]; + u64 cache_sender_cookie; + u64 cache_recver_cookie; +}; + +/* Please keep this also at a size of 64 bytes by + * reuse of *spare* fields. + */ +struct xio_desc_item { + s8field_type; + s8field_spare0; + s16 field_data_size; + s16 field_sender_size; + s16 field_sender_offset; + s16 field_recver_size; + s16 field_recver_offset; + s32 field_spare; + char field_name[MAX_FIELD_LEN]; +}; + +/* This must not be mirror symmetric between big and little endian + */ +#define XIO_DESC_MAGIC 0x73D0A2EC6148F48Ell + +struct xio_desc_header { + u64 h_magic; + u64 h_cookie; + s16 h_meta_len; + s16 h_index; + u32 h_spare1; + u64 h_spare2; +}; + +#define MAX_INT_TRANSFER 16 + +/**/ + +/* Bytesex conversion / sign extension + */ + +#ifdef __LITTLE_ENDIAN +static const bool myself_is_bigendian; + +#endif +#ifdef __BIG_ENDIAN +static const bool myself_is_bigendian = true; + +#endif + +static inline +void swap_bytes(void *data, int len) +{ + char *a = data; + char *b = data + len - 1; + + while (a < b) { + char tmp = *a; + + *a = *b; + *b = tmp; + a++; + b--; + } +} + +#define SWAP_FIELD(x) swap_bytes(&(x), sizeof(x)) + +static inline +void swap_mc(struct xio_desc_cache *mc, int len) +{ + struct xio_desc_item *mi; + + SWAP_FIELD(mc->cache_sender_cookie); + SWAP_FIELD(mc->cache_recver_cookie); + SWAP_FIELD(mc->cache_items); + + len -= sizeof(*mc); + + for (mi = (void *)(mc + 1); len > 0; mi++, len -= sizeof(*mi)) { + SWAP_FIELD(mi->field_data_size); + SWAP_FIELD(mi->field_sender_size); + SWAP_FIELD(mi->field_sender_offset); + SWAP_FIELD(mi->field_recver_size); + SWAP_FIELD(mi->field_recver_offset); + } +} + +static inline +char get_sign(const void *data, int len, bool is_bigendian, bool is_
[RFC 13/32] mars: add new module xio
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio.c | 227 include/linux/xio/xio.h | 319 ++ 2 files changed, 546 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio.c create mode 100644 include/linux/xio/xio.h diff --git a/drivers/staging/mars/xio_bricks/xio.c b/drivers/staging/mars/xio_bricks/xio.c new file mode 100644 index ..e58f11f497f9 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio.c @@ -0,0 +1,227 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include + +#include + +// + +/* infrastructure */ + +struct banning xio_global_ban = {}; +atomic_t xio_global_io_flying = ATOMIC_INIT(0); + +// + +/* object stuff */ + +const struct generic_object_type aio_type = { + .object_type_name = "aio", + .default_size = sizeof(struct aio_object), + .object_type_nr = OBJ_TYPE_AIO, +}; + +// + +/* brick stuff */ + +/***/ + +/* meta descriptions */ + +const struct meta xio_info_meta[] = { + META_INI(current_size,struct xio_info, FIELD_INT), + META_INI(tf_align,struct xio_info, FIELD_INT), + META_INI(tf_min_size, struct xio_info, FIELD_INT), + {} +}; + +const struct meta xio_aio_user_meta[] = { + META_INI(_object_cb.cb_error, struct aio_object, FIELD_INT), + META_INI(io_pos, struct aio_object, FIELD_INT), + META_INI(io_len, struct aio_object, FIELD_INT), + META_INI(io_may_write,struct aio_object, FIELD_INT), + META_INI(io_prio, struct aio_object, FIELD_INT), + META_INI(io_cs_mode, struct aio_object, FIELD_INT), + META_INI(io_timeout, struct aio_object, FIELD_INT), + META_INI(io_total_size, struct aio_object, FIELD_INT), + META_INI(io_checksum, struct aio_object, FIELD_RAW), + META_INI(io_flags, struct aio_object, FIELD_INT), + META_INI(io_rw,struct aio_object, FIELD_INT), + META_INI(io_id,struct aio_object, FIELD_INT), + META_INI(io_skip_sync,struct aio_object, FIELD_INT), + {} +}; + +const struct meta xio_timespec_meta[] = { + META_INI_TRANSFER(tv_sec, struct timespec, FIELD_UINT, 8), + META_INI_TRANSFER(tv_nsec, struct timespec, FIELD_UINT, 4), + {} +}; + +// + +/* crypto stuff */ + +#include +#include + +/* 896545098777564212b9e91af4c973f094649aa7 */ +#ifndef crt_hash +#define HAS_NEW_CRYPTO +#endif + +#ifdef HAS_NEW_CRYPTO + +/* Nor now, use shash. + * Later, asynchronous support should be added for full exploitation + * of crypto hardware. + */ +#include + +static struct crypto_shash *xio_tfm; +int xio_digest_size; + +struct mars_sdesc { + struct shash_desc shash; + char ctx[]; +}; + +void xio_digest(unsigned char *digest, void *data, int len) +{ + int size = sizeof(struct mars_sdesc) + crypto_shash_descsize(xio_tfm); + struct mars_sdesc *sdesc = brick_mem_alloc(size); + int status; + + sdesc->shash.tfm = xio_tfm; + sdesc->shash.flags = 0; + + memset(digest, 0, xio_digest_size); + status = crypto_shash_digest(>shash, data, len, digest); + if (unlikely(status < 0)) + XIO_ERR( + "cannot calculate cksum on %p len=%d, status=%d\n", +data, len, +status); + + brick_mem_free(sdesc); +} + +#else /* HAS_NEW_CRYPTO */ + +/* Old implementation, to disappear. + * Was a quick'n dirty lab prototype with unnecessary + * global variables and locking. + */ + +static struct crypto_hash *xio_tfm; +static struct semaphore tfm_sem; +int xio_digest_size; + +void xio_digest(unsigned char *digest, void *data, int len) +{ + struct hash_desc desc = { + .tfm = xio_tfm, + .flags = 0, + }; + struct scatterlist sg; + + memset(digest, 0, xio_digest_size); + + /* TODO: u
[RFC 10/32] mars: add new module lib_limiter
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/lib/lib_limiter.c | 163 + include/linux/brick/lib_limiter.h | 52 +++ 2 files changed, 215 insertions(+) create mode 100644 drivers/staging/mars/lib/lib_limiter.c create mode 100644 include/linux/brick/lib_limiter.h diff --git a/drivers/staging/mars/lib/lib_limiter.c b/drivers/staging/mars/lib/lib_limiter.c new file mode 100644 index ..e77b74a0eae7 --- /dev/null +++ b/drivers/staging/mars/lib/lib_limiter.c @@ -0,0 +1,163 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include +#include +#include +#include + +#define LIMITER_TIME_RESOLUTIONNSEC_PER_SEC + +int rate_limit(struct rate_limiter *lim, int amount) +{ + int delay = 0; + long long now; + + now = cpu_clock(raw_smp_processor_id()); + + /* Compute the maximum delay along the path +* down to the root of the hierarchy tree. +*/ + while (lim) { + long long window = now - lim->lim_stamp; + + /* Sometimes, raw CPU clocks may do weired things... +* Smaller windows in the denominator than 1s could fake unrealistic rates. +*/ + if (unlikely(lim->lim_min_window <= 0)) + lim->lim_min_window = 1000; + if (unlikely(lim->lim_max_window <= lim->lim_min_window)) + lim->lim_max_window = lim->lim_min_window + 8000; + if (unlikely(window < (long long)lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000))) + window = (long long)lim->lim_min_window * (LIMITER_TIME_RESOLUTION / 1000); + + /* Update total statistics. +* They will intentionally wrap around. +* Userspace must take care of that. +*/ + if (likely(amount > 0)) { + lim->lim_total_amount += amount; + lim->lim_total_ops++; + } + + /* Only use incremental accumulation at repeated calls, but +* never after longer pauses. +*/ + if (likely(lim->lim_stamp && + window < (long long)lim->lim_max_window * (LIMITER_TIME_RESOLUTION / 1000))) { + long long rate_raw; + int rate; + int max_rate; + + /* Races are possible, but taken into account. +* There is no real harm from rarely lost updates. +*/ + if (likely(amount > 0)) { + lim->lim_amount_accu += amount; + lim->lim_amount_cumul += amount; + lim->lim_ops_accu++; + lim->lim_ops_cumul++; + } + + /* compute amount values */ + rate_raw = lim->lim_amount_accu * LIMITER_TIME_RESOLUTION / window; + rate = rate_raw; + if (unlikely(rate_raw > INT_MAX)) + rate = INT_MAX; + lim->lim_amount_rate = rate; + + /* amount limit exceeded? */ + max_rate = lim->lim_max_amount_rate; + if (max_rate > 0 && rate > max_rate) { + int this_delay = ( + + window * rate / max_rate - window) / (LIMITER_TIME_RESOLUTION / 1000); + /* compute maximum */ + if (this_delay > delay && this_delay > 0) + delay = this_delay; + } + + /* compute ops values */ + rate_raw = lim->lim_ops_accu * LIMITER_TIME_RESOLUTION / window; + rate = rate_raw; + if (unlikely(rate_raw > INT_MAX)) + rate = INT_MAX; + lim->lim_ops_rate = rate; + + /* ops limit
[RFC 08/32] mars: add new module lib_queue
Signed-off-by: Thomas Schoebel-Theuer --- include/linux/brick/lib_queue.h | 165 1 file changed, 165 insertions(+) create mode 100644 include/linux/brick/lib_queue.h diff --git a/include/linux/brick/lib_queue.h b/include/linux/brick/lib_queue.h new file mode 100644 index ..72cd0a2710c2 --- /dev/null +++ b/include/linux/brick/lib_queue.h @@ -0,0 +1,165 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef LIB_QUEUE_H +#define LIB_QUEUE_H + +#define QUEUE_ANCHOR(PREFIX, KEYTYPE, HEAPTYPE) \ + /* parameters */\ + /* readonly from outside */ \ + atomic_t q_queued; \ + atomic_t q_flying; \ + atomic_t q_total; \ + /* tunables */ \ + int q_batchlen; \ + int q_io_prio; \ + bool q_ordering;\ + /* private */ \ + wait_queue_head_t *q_event; \ + spinlock_t q_lock; \ + struct list_head q_anchor; \ + struct pairing_heap_##HEAPTYPE *heap_high; \ + struct pairing_heap_##HEAPTYPE *heap_low; \ + long long q_last_insert; /* jiffies */ \ + KEYTYPE heap_margin;\ + KEYTYPE last_pos + +#define QUEUE_FUNCTIONS(PREFIX, ELEM_TYPE, HEAD, KEYFN, KEYCMP, HEAPTYPE)\ + \ +static inline \ +void q_##PREFIX##_trigger(struct PREFIX##_queue *q)\ +{ \ + if (q->q_event) { \ + wake_up_interruptible(q->q_event); \ + } \ +} \ + \ +static inline \ +void q_##PREFIX##_init(struct PREFIX##_queue *q) \ +{ \ + INIT_LIST_HEAD(>q_anchor); \ + q->heap_low = NULL; \ + q->heap_high = NULL;\ + spin_lock_init(>q_lock); \ + atomic_set(>q_queued, 0);\ + atomic_set(>q_flying, 0);\ +} \ + \ +static inline \ +void q_##PREFIX##_insert(struct PREFIX##_queue *q, ELEM_TYPE * elem) \ +{ \ + unsigned long flags;\ + \ + spin_lock_irqsave(>q_lock, flags); \ + \ + if (q->q_ordering) {\ + struct pairing_heap_##HEAPTYPE **use = >heap_high; \ + if (KEYCMP(KEYFN(elem), >heap_margin) <= 0) {\ + use = >heap_low; \ + } \ + ph_insert_##HEAPTYPE
[RFC 19/32] mars: add new module xio_client
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/xio_client.c | 1083 ++ include/linux/xio/xio_client.h | 105 +++ 2 files changed, 1188 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_client.c create mode 100644 include/linux/xio/xio_client.h diff --git a/drivers/staging/mars/xio_bricks/xio_client.c b/drivers/staging/mars/xio_bricks/xio_client.c new file mode 100644 index ..209523378660 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_client.c @@ -0,0 +1,1083 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include + +#include + +/ own type definitions ***/ + +#include + +#define CLIENT_HASH_MAX(PAGE_SIZE / sizeof(struct list_head)) + +int xio_client_abort = 10; + +int max_client_channels = 1; + +int max_client_bulk = 16; + +/ own helper functions ***/ + +static int thread_count; + +static +void _do_resubmit(struct client_channel *ch) +{ + struct client_output *output = ch->output; + unsigned long flags; + + spin_lock_irqsave(>lock, flags); + if (!list_empty(>wait_list)) { + struct list_head *first = ch->wait_list.next; + struct list_head *last = ch->wait_list.prev; + struct list_head *old_start = output->aio_list.next; + +#define list_connect __list_del /* the original routine has a misleading name: in reality it is more general */ + list_connect(>aio_list, first); + list_connect(last, old_start); + INIT_LIST_HEAD(>wait_list); + } + spin_unlock_irqrestore(>lock, flags); +} + +static +void _kill_thread(struct client_threadinfo *ti, const char *name) +{ + struct task_struct *thread = ti->thread; + + if (thread) { + XIO_DBG("stopping %s thread\n", name); + ti->thread = NULL; + brick_thread_stop(thread); + } +} + +static +void _kill_channel(struct client_channel *ch) +{ + XIO_DBG("channel = %p\n", ch); + if (xio_socket_is_alive(>socket)) { + XIO_DBG("shutdown socket\n"); + xio_shutdown_socket(>socket); + } + _kill_thread(>receiver, "receiver"); + if (ch->is_open) { + XIO_DBG("close socket\n"); + xio_put_socket(>socket); + } + ch->recv_error = 0; + ch->is_used = false; + ch->is_open = false; + ch->is_connected = false; + /* Re-Submit any waiting requests +*/ + _do_resubmit(ch); +} + +static inline +void _kill_all_channels(struct client_bundle *bundle) +{ + int i; + + /* first pass: shutdown in parallel without waiting */ + for (i = 0; i < MAX_CLIENT_CHANNELS; i++) { + struct client_channel *ch = >channel[i]; + + if (xio_socket_is_alive(>socket)) { + XIO_DBG("shutdown socket %d\n", i); + xio_shutdown_socket(>socket); + } + } + /* separate pass (may wait) */ + for (i = 0; i < MAX_CLIENT_CHANNELS; i++) + _kill_channel(>channel[i]); +} + +static int receiver_thread(void *data); + +static +int _setup_channel(struct client_bundle *bundle, int ch_nr) +{ + struct client_channel *ch = >channel[ch_nr]; + struct sockaddr_storage src_sockaddr; + struct sockaddr_storage dst_sockaddr; + int status; + + ch->ch_nr = ch_nr; + if (unlikely(ch->receiver.thread)) { + XIO_WRN("receiver thread %d unexpectedly not dead\n", ch_nr); + _kill_thread(>receiver, "receiver"); + } + + status = xio_create_sockaddr(_sockaddr, my_id()); + if (unlikely(status < 0)) { + XIO_DBG("no src sockaddr, status = %d\n", status); + goto done; + } + + status = xio_create_sockaddr(_sockaddr, bundle->host); + if (unlikely(status < 0)) { + XIO_DBG("no dst sockaddr, status = %d
[RFC 03/32] mars: add new module brick_mem
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/brick_mem.c | 1080 ++ include/linux/brick/brick_mem.h | 218 2 files changed, 1298 insertions(+) create mode 100644 drivers/staging/mars/brick_mem.c create mode 100644 include/linux/brick/brick_mem.h diff --git a/drivers/staging/mars/brick_mem.c b/drivers/staging/mars/brick_mem.c new file mode 100644 index ..232dbf6cb0ca --- /dev/null +++ b/drivers/staging/mars/brick_mem.c @@ -0,0 +1,1080 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include + +#define USE_KERNEL_PAGES /* currently mandatory (vmalloc does not work) */ + +#define MAGIC_BLOCK0x8B395D7B +#define MAGIC_BEND 0x8B395D7C +#define MAGIC_MEM1 0x8B395D7D +#define MAGIC_MEM2 0x9B395D8D +#define MAGIC_MEND10x8B395D7E +#define MAGIC_MEND20x9B395D8E +#define MAGIC_STR 0x8B395D7F +#define MAGIC_SEND 0x9B395D8F + +#define INT_ACCESS(ptr, offset) (*(int *)(((char *)(ptr)) + (offset))) + +#define _BRICK_FMT(_fmt, _class) \ + "%ld.%09ld %ld.%09ld MEM_%-5s %s[%d] %s:%d %s(): " \ + _fmt, \ + _s_now.tv_sec, _s_now.tv_nsec, \ + _l_now.tv_sec, _l_now.tv_nsec, \ + say_class[_class], \ + current->comm, (int)smp_processor_id(), \ + __BASE_FILE__, \ + __LINE__, \ + __func__ + +#define _BRICK_MSG(_class, _dump, _fmt, _args...) \ + do {\ + struct timespec _s_now = CURRENT_TIME; \ + struct timespec _l_now; \ + get_lamport(&_l_now); \ + say(_class, _BRICK_FMT(_fmt, _class), ##_args); \ + if (_dump) \ + dump_stack(); \ + } while (0) + +#define BRICK_ERR(_fmt, _args...) _BRICK_MSG(SAY_ERROR, true, _fmt, ##_args) +#define BRICK_WRN(_fmt, _args...) _BRICK_MSG(SAY_WARN, false, _fmt, ##_args) +#define BRICK_INF(_fmt, _args...) _BRICK_MSG(SAY_INFO, false, _fmt, ##_args) + +/***/ + +/* limit handling */ + +#include + +long long brick_global_memavail; +long long brick_global_memlimit; + +atomic64_t brick_global_block_used = ATOMIC64_INIT(0); + +void get_total_ram(void) +{ + struct sysinfo i = {}; + + si_meminfo(); + /* si_swapinfo(); */ + brick_global_memavail = (long long)i.totalram * (PAGE_SIZE / 1024); + BRICK_INF("total RAM = %lld [KiB]\n", brick_global_memavail); +} + +/***/ + +/* small memory allocation (use this only for len < PAGE_SIZE) */ + +#ifdef BRICK_DEBUG_MEM +static atomic_t phys_mem_alloc = ATOMIC_INIT(0); +static atomic_t mem_redirect_alloc = ATOMIC_INIT(0); +static atomic_t mem_count[BRICK_DEBUG_MEM]; +static atomic_t mem_free[BRICK_DEBUG_MEM]; +static int mem_len[BRICK_DEBUG_MEM]; + +#define PLUS_SIZE (6 * sizeof(int)) +#else +#define PLUS_SIZE (2 * sizeof(int)) +#endif + +static inline +void *__brick_mem_alloc(int len) +{ + void *res; + + if (len >= PAGE_SIZE) { +#ifdef BRICK_DEBUG_MEM + atomic_inc(_redirect_alloc); +#endif + res = _brick_block_alloc(0, len, 0); + } else { + for (;;) { + res = kmalloc(len, GFP_BRICK); + if (likely(res)) + break; + msleep(1000)
[RFC 22/32] mars: add new module xio_trans_logger
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/xio_trans_logger.c | 3410 include/linux/xio/xio_trans_logger.h | 271 ++ 2 files changed, 3681 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_trans_logger.c create mode 100644 include/linux/xio/xio_trans_logger.h diff --git a/drivers/staging/mars/xio_bricks/xio_trans_logger.c b/drivers/staging/mars/xio_bricks/xio_trans_logger.c new file mode 100644 index ..f82e9075ac5a --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_trans_logger.c @@ -0,0 +1,3410 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Trans_Logger brick */ + +#define XIO_DEBUGGING + +#include +#include +#include +#include + +#include +#include + +#include + +/* variants */ +#define KEEP_UNIQUE +#define DELAY_CALLERS /* this is _needed_ for production systems */ +/* When possible, queue 1 executes phase3_startio() directly without + * intermediate queueing into queue 3 = > may be irritating, but has better + * performance. NOTICE: when some day the IO scheduling should be + * different between queue 1 and 3, you MUST disable this in order + * to distinguish between them! + */ +#define SHORTCUT_1_to_3 + +/* commenting this out is dangerous for data integrity! use only for testing! */ +#define USE_MEMCPY +#define DO_WRITEBACK /* otherwise FAKE IO */ +#define REPLAY_DATA + +/* tuning */ +#ifdef BRICK_DEBUG_MEM +#define CONF_TRANS_CHUNKSIZE (128 * 1024 - PAGE_SIZE * 2) +#else +#define CONF_TRANS_CHUNKSIZE (128 * 1024) +#endif +#define CONF_TRANS_MAX_AIO_SIZEPAGE_SIZE +#define CONF_TRANS_ALIGN 0 + +#define XIO_RPL(_args...) /*empty*/ + +struct trans_logger_hash_anchor { + struct rw_semaphore hash_mutex; + struct list_head hash_anchor; +}; + +#define NR_HASH_PAGES 64 + +#define MAX_HASH_PAGES (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor *)) +#define HASH_PER_PAGE (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor)) +#define HASH_TOTAL (NR_HASH_PAGES * HASH_PER_PAGE) + +#define STATIST_SIZE 2048 + +/ global tuning ***/ + +int trans_logger_completion_semantics = 1; + +int trans_logger_do_crc = +#ifdef CONFIG_MARS_DEBUG + true; +#else + false; +#endif + +int trans_logger_mem_usage; /* in KB */ + +int trans_logger_max_interleave = -1; + +int trans_logger_resume = 1; + +int trans_logger_replay_timeout = 1; /* in s */ + +struct writeback_group global_writeback = { + .lock = __RW_LOCK_UNLOCKED(global_writeback.lock), + .group_anchor = LIST_HEAD_INIT(global_writeback.group_anchor), + .until_percent = 30, +}; + +static +void add_to_group(struct writeback_group *gr, struct trans_logger_brick *brick) +{ + unsigned long flags; + + write_lock_irqsave(>lock, flags); + list_add_tail(>group_head, >group_anchor); + write_unlock_irqrestore(>lock, flags); +} + +static +void remove_from_group(struct writeback_group *gr, struct trans_logger_brick *brick) +{ + unsigned long flags; + + write_lock_irqsave(>lock, flags); + list_del_init(>group_head); + gr->leader = NULL; + write_unlock_irqrestore(>lock, flags); +} + +static +struct trans_logger_brick *elect_leader(struct writeback_group *gr) +{ + struct trans_logger_brick *res = gr->leader; + struct list_head *tmp; + unsigned long flags; + + if (res && gr->until_percent >= 0) { + loff_t used = atomic64_read(>shadow_mem_used); + + if (used > gr->biggest * gr->until_percent / 100) + goto done; + } + + read_lock_irqsave(>lock, flags); + for (tmp = gr->group_anchor.next; tmp != >group_anchor; tmp = tmp->next) { + struct trans_logger_brick *test = container_of(tmp, struct trans_logger_brick, group_head); + loff_t new_used = atomic64_read(>shadow_mem_used); + + if (!res || new_used > atomic64_read(>shadow_mem_used)) { + res = test; + gr->b
[RFC 19/32] mars: add new module xio_client
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_client.c | 1083 ++ include/linux/xio/xio_client.h | 105 +++ 2 files changed, 1188 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_client.c create mode 100644 include/linux/xio/xio_client.h diff --git a/drivers/staging/mars/xio_bricks/xio_client.c b/drivers/staging/mars/xio_bricks/xio_client.c new file mode 100644 index ..209523378660 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_client.c @@ -0,0 +1,1083 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include + +#include + +/ own type definitions ***/ + +#include + +#define CLIENT_HASH_MAX(PAGE_SIZE / sizeof(struct list_head)) + +int xio_client_abort = 10; + +int max_client_channels = 1; + +int max_client_bulk = 16; + +/ own helper functions ***/ + +static int thread_count; + +static +void _do_resubmit(struct client_channel *ch) +{ + struct client_output *output = ch->output; + unsigned long flags; + + spin_lock_irqsave(>lock, flags); + if (!list_empty(>wait_list)) { + struct list_head *first = ch->wait_list.next; + struct list_head *last = ch->wait_list.prev; + struct list_head *old_start = output->aio_list.next; + +#define list_connect __list_del /* the original routine has a misleading name: in reality it is more general */ + list_connect(>aio_list, first); + list_connect(last, old_start); + INIT_LIST_HEAD(>wait_list); + } + spin_unlock_irqrestore(>lock, flags); +} + +static +void _kill_thread(struct client_threadinfo *ti, const char *name) +{ + struct task_struct *thread = ti->thread; + + if (thread) { + XIO_DBG("stopping %s thread\n", name); + ti->thread = NULL; + brick_thread_stop(thread); + } +} + +static +void _kill_channel(struct client_channel *ch) +{ + XIO_DBG("channel = %p\n", ch); + if (xio_socket_is_alive(>socket)) { + XIO_DBG("shutdown socket\n"); + xio_shutdown_socket(>socket); + } + _kill_thread(>receiver, "receiver"); + if (ch->is_open) { + XIO_DBG("close socket\n"); + xio_put_socket(>socket); + } + ch->recv_error = 0; + ch->is_used = false; + ch->is_open = false; + ch->is_connected = false; + /* Re-Submit any waiting requests +*/ + _do_resubmit(ch); +} + +static inline +void _kill_all_channels(struct client_bundle *bundle) +{ + int i; + + /* first pass: shutdown in parallel without waiting */ + for (i = 0; i < MAX_CLIENT_CHANNELS; i++) { + struct client_channel *ch = >channel[i]; + + if (xio_socket_is_alive(>socket)) { + XIO_DBG("shutdown socket %d\n", i); + xio_shutdown_socket(>socket); + } + } + /* separate pass (may wait) */ + for (i = 0; i < MAX_CLIENT_CHANNELS; i++) + _kill_channel(>channel[i]); +} + +static int receiver_thread(void *data); + +static +int _setup_channel(struct client_bundle *bundle, int ch_nr) +{ + struct client_channel *ch = >channel[ch_nr]; + struct sockaddr_storage src_sockaddr; + struct sockaddr_storage dst_sockaddr; + int status; + + ch->ch_nr = ch_nr; + if (unlikely(ch->receiver.thread)) { + XIO_WRN("receiver thread %d unexpectedly not dead\n", ch_nr); + _kill_thread(>receiver, "receiver"); + } + + status = xio_create_sockaddr(_sockaddr, my_id()); + if (unlikely(status < 0)) { + XIO_DBG("no src sockaddr, status = %d\n", status); + goto done; + } + + status = xio_create_sockaddr(_sockaddr, bundle->host); + if (unlikely(status < 0)) { + XIO_DBG("no dst sockaddr, status = %d\n", status); +
[RFC 03/32] mars: add new module brick_mem
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/brick_mem.c | 1080 ++ include/linux/brick/brick_mem.h | 218 2 files changed, 1298 insertions(+) create mode 100644 drivers/staging/mars/brick_mem.c create mode 100644 include/linux/brick/brick_mem.h diff --git a/drivers/staging/mars/brick_mem.c b/drivers/staging/mars/brick_mem.c new file mode 100644 index ..232dbf6cb0ca --- /dev/null +++ b/drivers/staging/mars/brick_mem.c @@ -0,0 +1,1080 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include + +#define USE_KERNEL_PAGES /* currently mandatory (vmalloc does not work) */ + +#define MAGIC_BLOCK0x8B395D7B +#define MAGIC_BEND 0x8B395D7C +#define MAGIC_MEM1 0x8B395D7D +#define MAGIC_MEM2 0x9B395D8D +#define MAGIC_MEND10x8B395D7E +#define MAGIC_MEND20x9B395D8E +#define MAGIC_STR 0x8B395D7F +#define MAGIC_SEND 0x9B395D8F + +#define INT_ACCESS(ptr, offset) (*(int *)(((char *)(ptr)) + (offset))) + +#define _BRICK_FMT(_fmt, _class) \ + "%ld.%09ld %ld.%09ld MEM_%-5s %s[%d] %s:%d %s(): " \ + _fmt, \ + _s_now.tv_sec, _s_now.tv_nsec, \ + _l_now.tv_sec, _l_now.tv_nsec, \ + say_class[_class], \ + current->comm, (int)smp_processor_id(), \ + __BASE_FILE__, \ + __LINE__, \ + __func__ + +#define _BRICK_MSG(_class, _dump, _fmt, _args...) \ + do {\ + struct timespec _s_now = CURRENT_TIME; \ + struct timespec _l_now; \ + get_lamport(&_l_now); \ + say(_class, _BRICK_FMT(_fmt, _class), ##_args); \ + if (_dump) \ + dump_stack(); \ + } while (0) + +#define BRICK_ERR(_fmt, _args...) _BRICK_MSG(SAY_ERROR, true, _fmt, ##_args) +#define BRICK_WRN(_fmt, _args...) _BRICK_MSG(SAY_WARN, false, _fmt, ##_args) +#define BRICK_INF(_fmt, _args...) _BRICK_MSG(SAY_INFO, false, _fmt, ##_args) + +/***/ + +/* limit handling */ + +#include + +long long brick_global_memavail; +long long brick_global_memlimit; + +atomic64_t brick_global_block_used = ATOMIC64_INIT(0); + +void get_total_ram(void) +{ + struct sysinfo i = {}; + + si_meminfo(); + /* si_swapinfo(); */ + brick_global_memavail = (long long)i.totalram * (PAGE_SIZE / 1024); + BRICK_INF("total RAM = %lld [KiB]\n", brick_global_memavail); +} + +/***/ + +/* small memory allocation (use this only for len < PAGE_SIZE) */ + +#ifdef BRICK_DEBUG_MEM +static atomic_t phys_mem_alloc = ATOMIC_INIT(0); +static atomic_t mem_redirect_alloc = ATOMIC_INIT(0); +static atomic_t mem_count[BRICK_DEBUG_MEM]; +static atomic_t mem_free[BRICK_DEBUG_MEM]; +static int mem_len[BRICK_DEBUG_MEM]; + +#define PLUS_SIZE (6 * sizeof(int)) +#else +#define PLUS_SIZE (2 * sizeof(int)) +#endif + +static inline +void *__brick_mem_alloc(int len) +{ + void *res; + + if (len >= PAGE_SIZE) { +#ifdef BRICK_DEBUG_MEM + atomic_inc(_redirect_alloc); +#endif + res = _brick_block_alloc(0, len, 0); + } else { + for (;;) { + res = kmalloc(len, GFP_BRICK); + if (likely(res)) + break; + msleep(1000); + } +#ifdef BRICK
[RFC 22/32] mars: add new module xio_trans_logger
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_trans_logger.c | 3410 include/linux/xio/xio_trans_logger.h | 271 ++ 2 files changed, 3681 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_trans_logger.c create mode 100644 include/linux/xio/xio_trans_logger.h diff --git a/drivers/staging/mars/xio_bricks/xio_trans_logger.c b/drivers/staging/mars/xio_bricks/xio_trans_logger.c new file mode 100644 index ..f82e9075ac5a --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_trans_logger.c @@ -0,0 +1,3410 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Trans_Logger brick */ + +#define XIO_DEBUGGING + +#include +#include +#include +#include + +#include +#include + +#include + +/* variants */ +#define KEEP_UNIQUE +#define DELAY_CALLERS /* this is _needed_ for production systems */ +/* When possible, queue 1 executes phase3_startio() directly without + * intermediate queueing into queue 3 = > may be irritating, but has better + * performance. NOTICE: when some day the IO scheduling should be + * different between queue 1 and 3, you MUST disable this in order + * to distinguish between them! + */ +#define SHORTCUT_1_to_3 + +/* commenting this out is dangerous for data integrity! use only for testing! */ +#define USE_MEMCPY +#define DO_WRITEBACK /* otherwise FAKE IO */ +#define REPLAY_DATA + +/* tuning */ +#ifdef BRICK_DEBUG_MEM +#define CONF_TRANS_CHUNKSIZE (128 * 1024 - PAGE_SIZE * 2) +#else +#define CONF_TRANS_CHUNKSIZE (128 * 1024) +#endif +#define CONF_TRANS_MAX_AIO_SIZEPAGE_SIZE +#define CONF_TRANS_ALIGN 0 + +#define XIO_RPL(_args...) /*empty*/ + +struct trans_logger_hash_anchor { + struct rw_semaphore hash_mutex; + struct list_head hash_anchor; +}; + +#define NR_HASH_PAGES 64 + +#define MAX_HASH_PAGES (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor *)) +#define HASH_PER_PAGE (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor)) +#define HASH_TOTAL (NR_HASH_PAGES * HASH_PER_PAGE) + +#define STATIST_SIZE 2048 + +/ global tuning ***/ + +int trans_logger_completion_semantics = 1; + +int trans_logger_do_crc = +#ifdef CONFIG_MARS_DEBUG + true; +#else + false; +#endif + +int trans_logger_mem_usage; /* in KB */ + +int trans_logger_max_interleave = -1; + +int trans_logger_resume = 1; + +int trans_logger_replay_timeout = 1; /* in s */ + +struct writeback_group global_writeback = { + .lock = __RW_LOCK_UNLOCKED(global_writeback.lock), + .group_anchor = LIST_HEAD_INIT(global_writeback.group_anchor), + .until_percent = 30, +}; + +static +void add_to_group(struct writeback_group *gr, struct trans_logger_brick *brick) +{ + unsigned long flags; + + write_lock_irqsave(>lock, flags); + list_add_tail(>group_head, >group_anchor); + write_unlock_irqrestore(>lock, flags); +} + +static +void remove_from_group(struct writeback_group *gr, struct trans_logger_brick *brick) +{ + unsigned long flags; + + write_lock_irqsave(>lock, flags); + list_del_init(>group_head); + gr->leader = NULL; + write_unlock_irqrestore(>lock, flags); +} + +static +struct trans_logger_brick *elect_leader(struct writeback_group *gr) +{ + struct trans_logger_brick *res = gr->leader; + struct list_head *tmp; + unsigned long flags; + + if (res && gr->until_percent >= 0) { + loff_t used = atomic64_read(>shadow_mem_used); + + if (used > gr->biggest * gr->until_percent / 100) + goto done; + } + + read_lock_irqsave(>lock, flags); + for (tmp = gr->group_anchor.next; tmp != >group_anchor; tmp = tmp->next) { + struct trans_logger_brick *test = container_of(tmp, struct trans_logger_brick, group_head); + loff_t new_used = atomic64_read(>shadow_mem_used); + + if (!res || new_used > atomic64_read(>shadow_mem_used)) { + res = test; + gr->biggest = new_used; +
[RFC 18/32] mars: add new module xio_sio
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/xio_sio.c | 578 ++ include/linux/xio/xio_sio.h | 68 2 files changed, 646 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_sio.c create mode 100644 include/linux/xio/xio_sio.h diff --git a/drivers/staging/mars/xio_bricks/xio_sio.c b/drivers/staging/mars/xio_bricks/xio_sio.c new file mode 100644 index ..c910cbda2ae5 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_sio.c @@ -0,0 +1,578 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +/ own type definitions ***/ + +#include + +/* own brick * input * output operations */ + +static int sio_io_get(struct sio_output *output, struct aio_object *aio) +{ + struct file *file; + + if (unlikely(!output->brick->power.on_led)) + return -EBADFD; + + if (aio->obj_initialized) { + obj_get(aio); + return aio->io_len; + } + + file = output->mf->mf_filp; + if (file) { + loff_t total_size = i_size_read(file->f_mapping->host); + + aio->io_total_size = total_size; + /* Only check reads. +* Writes behind EOF are always allowed (sparse files) +*/ + if (!aio->io_may_write) { + loff_t len = total_size - aio->io_pos; + + if (unlikely(len <= 0)) { + /* Special case: allow reads starting _exactly_ at EOF when a timeout is specified. +*/ + if (len < 0 || aio->io_timeout <= 0) { + XIO_DBG("ENODATA %lld\n", len); + return -ENODATA; + } + } + /* Shorten below EOF, but allow special case */ + if (aio->io_len > len && len > 0) + aio->io_len = len; + } + } + + /* Buffered IO. +*/ + if (!aio->io_data) { + struct sio_aio_aspect *aio_a = sio_aio_get_aspect(output->brick, aio); + + if (unlikely(!aio_a)) + return -EILSEQ; + if (unlikely(aio->io_len <= 0)) { + XIO_ERR("bad io_len = %d\n", aio->io_len); + return -ENOMEM; + } + aio->io_data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len)); + aio_a->do_dealloc = true; + /* atomic_inc(>total_alloc_count); */ + /* atomic_inc(>alloc_count); */ + } + + obj_get_first(aio); + return aio->io_len; +} + +static void sio_io_put(struct sio_output *output, struct aio_object *aio) +{ + struct file *file; + struct sio_aio_aspect *aio_a; + + if (!obj_put(aio)) + goto out_return; + file = output->mf->mf_filp; + aio->io_total_size = i_size_read(file->f_mapping->host); + + aio_a = sio_aio_get_aspect(output->brick, aio); + if (aio_a && aio_a->do_dealloc) { + brick_block_free(aio->io_data, aio_a->alloc_len); + /* atomic_dec(>alloc_count); */ + } + + obj_free(aio); +out_return:; +} + +static +int write_aops(struct sio_output *output, struct aio_object *aio) +{ + struct file *file = output->mf->mf_filp; + loff_t pos = aio->io_pos; + void *data = aio->io_data; + int len = aio->io_len; + int ret = 0; + + mm_segment_t oldfs; + + oldfs = get_fs(); + set_fs(get_ds()); + ret = vfs_write(file, data, len, ); + set_fs(oldfs); + return ret; +} + +static +int read_aops(struct sio_output *output, struct aio_object *aio) +{ + loff_t pos = aio->io_pos; + int len = aio->io
[RFC 20/32] mars: add new module xio_if
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/xio_if.c | 892 +++ include/linux/xio/xio_if.h | 109 2 files changed, 1001 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_if.c create mode 100644 include/linux/xio/xio_if.h diff --git a/drivers/staging/mars/xio_bricks/xio_if.c b/drivers/staging/mars/xio_bricks/xio_if.c new file mode 100644 index ..97e0cd541c5c --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_if.c @@ -0,0 +1,892 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Interface to a Linux device. + * 1 Input, 0 Outputs. + */ + +#define ALWAYS_UNPLUG true +#define PREFETCH_LEN PAGE_SIZE + +/* low-level device parameters */ +#define USE_MAX_SECTORS(PAGE_SIZE >> 9) +#define USE_MAX_PHYS_SEGMENTS (PAGE_SIZE >> 9) +#define USE_MAX_SEGMENT_SIZE PAGE_SIZE +#define USE_LOGICAL_BLOCK_SIZE 512 +#define USE_SEGMENT_BOUNDARY (PAGE_SIZE - 1) + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include + +#ifndef XIO_MAJOR +#define XIO_MAJOR (DRBD_MAJOR + 1) +#endif + +/ global tuning ***/ + +int if_throttle_start_size; + +struct rate_limiter if_throttle = { + .lim_max_amount_rate = 5000, +}; + +/ own type definitions ***/ + +/ own static definitions ***/ + +/* TODO: check bounds, ensure that free minor numbers are recycled */ +static int device_minor; + +/*** object * aspect constructors * destructors **/ + +/ linux operations ***/ + +static +void _if_start_io_acct(struct if_input *input, struct bio_wrapper *biow) +{ + struct bio *bio = biow->bio; + const int rw = bio_data_dir(bio); + const int cpu = part_stat_lock(); + + (void)cpu; + part_round_stats(cpu, >disk->part0); + part_stat_inc(cpu, >disk->part0, ios[rw]); + part_stat_add(cpu, >disk->part0, sectors[rw], bio->bi_iter.bi_size >> 9); + part_inc_in_flight(>disk->part0, rw); + part_stat_unlock(); + biow->start_time = jiffies; +} + +static +void _if_end_io_acct(struct if_input *input, struct bio_wrapper *biow) +{ + unsigned long duration = jiffies - biow->start_time; + struct bio *bio = biow->bio; + const int rw = bio_data_dir(bio); + const int cpu = part_stat_lock(); + + (void)cpu; + part_stat_add(cpu, >disk->part0, ticks[rw], duration); + part_round_stats(cpu, >disk->part0); + part_dec_in_flight(>disk->part0, rw); + part_stat_unlock(); +} + +/* callback + */ +static +void if_endio(struct generic_callback *cb) +{ + struct if_aio_aspect *aio_a = cb->cb_private; + struct if_input *input; + int k; + int rw; + int error; + + LAST_CALLBACK(cb); + if (unlikely(!aio_a || !aio_a->object)) { + XIO_FAT("aio_a = %p aio = %p, something is very wrong here!\n", aio_a, aio_a->object); + goto out_return; + } + input = aio_a->input; + CHECK_PTR(input, err); + + rw = aio_a->object->io_rw; + + for (k = 0; k < aio_a->bio_count; k++) { + struct bio_wrapper *biow; + struct bio *bio; + + biow = aio_a->orig_biow[k]; + aio_a->orig_biow[k] = NULL; + CHECK_PTR(biow, err); + + CHECK_ATOMIC(>bi_comp_cnt, 1); + if (!atomic_dec_and_test(>bi_comp_cnt)) + continue; + + bio = biow->bio; + CHECK_PTR_NULL(bio, err); + + _if_end_io_acct(input, biow); + + error = CALLBACK_ERROR(aio_a->object); + if (unlikely(error < 0)) { + int bi_size = bio->bi_iter.bi_size; + + XIO_ERR("NYI: error=%d RETRY LOGIC %u\n", error, bi_size); + } else { /* bio conventions are slightly dif
[RFC 23/32] mars: add new module xio_server
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/xio_server.c | 493 +++ include/linux/xio/xio_server.h | 91 + 2 files changed, 584 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_server.c create mode 100644 include/linux/xio/xio_server.h diff --git a/drivers/staging/mars/xio_bricks/xio_server.c b/drivers/staging/mars/xio_bricks/xio_server.c new file mode 100644 index ..28944d15a7bf --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_server.c @@ -0,0 +1,493 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Server brick (just for demonstration) */ + +#include +#include +#include + +#include +#include +#include +#include +#include + +/ own type definitions ***/ + +#include + +static struct xio_socket server_socket[NR_SERVER_SOCKETS]; +static struct task_struct *server_threads[NR_SERVER_SOCKETS]; + +/ own helper functions ***/ + +int cb_thread(void *data) +{ + struct server_brick *brick = data; + struct xio_socket *sock = >handler_socket; + bool aborted = false; + bool ok = xio_get_socket(sock); + int status = -EINVAL; + + XIO_DBG("--- cb_thread starting on socket #%d, ok = %d\n", sock->s_debug_nr, ok); + if (!ok) + goto done; + + brick->cb_running = true; + wake_up_interruptible(>startup_event); + + while (!brick_thread_should_stop() || + !list_empty(>cb_read_list) || + !list_empty(>cb_write_list) || + atomic_read(>in_flight) > 0) { + struct server_aio_aspect *aio_a; + struct aio_object *aio; + struct list_head *tmp; + unsigned long flags; + + wait_event_interruptible_timeout( + brick->cb_event, + !list_empty(>cb_read_list) || + !list_empty(>cb_write_list), + 1 * HZ); + + spin_lock_irqsave(>cb_lock, flags); + tmp = brick->cb_write_list.next; + if (tmp == >cb_write_list) { + tmp = brick->cb_read_list.next; + if (tmp == >cb_read_list) { + spin_unlock_irqrestore(>cb_lock, flags); + brick_msleep(1000 / HZ); + continue; + } + } + list_del_init(tmp); + spin_unlock_irqrestore(>cb_lock, flags); + + aio_a = container_of(tmp, struct server_aio_aspect, cb_head); + aio = aio_a->object; + status = -EINVAL; + CHECK_PTR(aio, err); + + status = 0; + /* Report a remote error when consistency cannot be guaranteed, +* e.g. emergency mode during sync. +*/ + if (brick->conn_brick && + brick->conn_brick->mode_ptr && + *brick->conn_brick->mode_ptr < 0 && + aio->object_cb) + aio->object_cb->cb_error = *brick->conn_brick->mode_ptr; + if (!aborted) { + down(>socket_sem); + status = xio_send_cb(sock, aio); + up(>socket_sem); + } + +err: + if (unlikely(status < 0) && !aborted) { + aborted = true; + XIO_WRN("cannot send response, status = %d\n", status); + /* Just shutdown the socket and forget all pending +* requests. +* The _client_ is responsible for resending +* any lost operations. +*/ + xio_shutdown_socket(sock); + } + + if (aio_a->data) { + brick_block_free(aio_a->data, aio_a->len); + aio->io_data = NULL; + } +
[RFC 21/32] mars: add new module xio_copy
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/xio_copy.c | 1005 include/linux/xio/xio_copy.h | 115 2 files changed, 1120 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_copy.c create mode 100644 include/linux/xio/xio_copy.h diff --git a/drivers/staging/mars/xio_bricks/xio_copy.c b/drivers/staging/mars/xio_bricks/xio_copy.c new file mode 100644 index ..56b60f2f837e --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_copy.c @@ -0,0 +1,1005 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Copy brick (just for demonstration) */ + +#include +#include +#include + +#include +#include + +#ifndef READ +#define READ 0 +#define WRITE 1 +#endif + +#define COPY_CHUNK (PAGE_SIZE) +#define NR_COPY_REQUESTS (32 * 1024 * 1024 / COPY_CHUNK) + +#define STATES_PER_PAGE(PAGE_SIZE / sizeof(struct copy_state)) +#define MAX_SUB_TABLES (NR_COPY_REQUESTS / STATES_PER_PAGE + (NR_COPY_REQUESTS % STATES_PER_PAGE ? 1 : 0)\ + \ +) +#define MAX_COPY_REQUESTS (PAGE_SIZE / sizeof(struct copy_state *) * STATES_PER_PAGE) + +#define GET_STATE(brick, index) \ + ((brick)->st[(index) / STATES_PER_PAGE][(index) % STATES_PER_PAGE]) + +/ own type definitions ***/ + +#include + +int xio_copy_overlap = 1; + +int xio_copy_read_prio = XIO_PRIO_NORMAL; + +int xio_copy_write_prio = XIO_PRIO_NORMAL; + +int xio_copy_read_max_fly; + +int xio_copy_write_max_fly; + +#define is_read_limited(brick) \ + (xio_copy_read_max_fly > 0 && atomic_read(&(brick)->copy_read_flight) >= xio_copy_read_max_fly) + +#define is_write_limited(brick) \ + (xio_copy_write_max_fly > 0 && atomic_read(&(brick)->copy_write_flight) >= xio_copy_write_max_fly) + +/ own helper functions ***/ + +/* TODO: + * The clash logic is untested / alpha stage (Feb. 2011). + * + * For now, the output is never used, so this cannot do harm. + * + * In order to get the output really working / enterprise grade, + * some larger test effort should be invested. + */ +static inline +void _clash(struct copy_brick *brick) +{ + brick->trigger = true; + set_bit(0, >clash); + atomic_inc(>total_clash_count); + wake_up_interruptible(>event); +} + +static inline +int _clear_clash(struct copy_brick *brick) +{ + int old; + + old = test_and_clear_bit(0, >clash); + return old; +} + +/* Current semantics: + * + * All writes are always going to the original input A. They are _not_ + * replicated to B. + * + * In order to get B really uptodate, you have to replay the right + * transaction logs there (at the right time). + * [If you had no writes on A at all during the copy, of course + * this is not necessary] + * + * When utilize_mode is on, reads can utilize the already copied + * region from B, but only as long as this region has not been + * invalidated by writes (indicated by low_dirty). + * + * TODO: implement replicated writes, together with some transaction + * replay logic applying the transaction logs _only_ after + * crashes during inconsistency caused by partial replication of writes. + */ +static +int _determine_input(struct copy_brick *brick, struct aio_object *aio) +{ + int rw; + int below; + int behind; + loff_t io_end; + + if (!brick->utilize_mode || brick->low_dirty) + return INPUT_A_IO; + + io_end = aio->io_pos + aio->io_len; + below = io_end <= brick->copy_start; + behind = !brick->copy_end || aio->io_pos >= brick->copy_end; + rw = aio->io_may_write | aio->io_rw; + if (rw) { + if (!behind) { + brick->low_dirty = true; + if (!below) { + _clash(brick); +
[RFC 18/32] mars: add new module xio_sio
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_sio.c | 578 ++ include/linux/xio/xio_sio.h | 68 2 files changed, 646 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_sio.c create mode 100644 include/linux/xio/xio_sio.h diff --git a/drivers/staging/mars/xio_bricks/xio_sio.c b/drivers/staging/mars/xio_bricks/xio_sio.c new file mode 100644 index ..c910cbda2ae5 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_sio.c @@ -0,0 +1,578 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +/ own type definitions ***/ + +#include + +/* own brick * input * output operations */ + +static int sio_io_get(struct sio_output *output, struct aio_object *aio) +{ + struct file *file; + + if (unlikely(!output->brick->power.on_led)) + return -EBADFD; + + if (aio->obj_initialized) { + obj_get(aio); + return aio->io_len; + } + + file = output->mf->mf_filp; + if (file) { + loff_t total_size = i_size_read(file->f_mapping->host); + + aio->io_total_size = total_size; + /* Only check reads. +* Writes behind EOF are always allowed (sparse files) +*/ + if (!aio->io_may_write) { + loff_t len = total_size - aio->io_pos; + + if (unlikely(len <= 0)) { + /* Special case: allow reads starting _exactly_ at EOF when a timeout is specified. +*/ + if (len < 0 || aio->io_timeout <= 0) { + XIO_DBG("ENODATA %lld\n", len); + return -ENODATA; + } + } + /* Shorten below EOF, but allow special case */ + if (aio->io_len > len && len > 0) + aio->io_len = len; + } + } + + /* Buffered IO. +*/ + if (!aio->io_data) { + struct sio_aio_aspect *aio_a = sio_aio_get_aspect(output->brick, aio); + + if (unlikely(!aio_a)) + return -EILSEQ; + if (unlikely(aio->io_len <= 0)) { + XIO_ERR("bad io_len = %d\n", aio->io_len); + return -ENOMEM; + } + aio->io_data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len)); + aio_a->do_dealloc = true; + /* atomic_inc(>total_alloc_count); */ + /* atomic_inc(>alloc_count); */ + } + + obj_get_first(aio); + return aio->io_len; +} + +static void sio_io_put(struct sio_output *output, struct aio_object *aio) +{ + struct file *file; + struct sio_aio_aspect *aio_a; + + if (!obj_put(aio)) + goto out_return; + file = output->mf->mf_filp; + aio->io_total_size = i_size_read(file->f_mapping->host); + + aio_a = sio_aio_get_aspect(output->brick, aio); + if (aio_a && aio_a->do_dealloc) { + brick_block_free(aio->io_data, aio_a->alloc_len); + /* atomic_dec(>alloc_count); */ + } + + obj_free(aio); +out_return:; +} + +static +int write_aops(struct sio_output *output, struct aio_object *aio) +{ + struct file *file = output->mf->mf_filp; + loff_t pos = aio->io_pos; + void *data = aio->io_data; + int len = aio->io_len; + int ret = 0; + + mm_segment_t oldfs; + + oldfs = get_fs(); + set_fs(get_ds()); + ret = vfs_write(file, data, len, ); + set_fs(oldfs); + return ret; +} + +static +int read_aops(struct sio_output *output, struct aio_object *aio) +{ + loff_t pos = aio->io_pos; + int len = aio->io_len; + int ret; + + mm_seg
[RFC 20/32] mars: add new module xio_if
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_if.c | 892 +++ include/linux/xio/xio_if.h | 109 2 files changed, 1001 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_if.c create mode 100644 include/linux/xio/xio_if.h diff --git a/drivers/staging/mars/xio_bricks/xio_if.c b/drivers/staging/mars/xio_bricks/xio_if.c new file mode 100644 index ..97e0cd541c5c --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_if.c @@ -0,0 +1,892 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Interface to a Linux device. + * 1 Input, 0 Outputs. + */ + +#define ALWAYS_UNPLUG true +#define PREFETCH_LEN PAGE_SIZE + +/* low-level device parameters */ +#define USE_MAX_SECTORS(PAGE_SIZE >> 9) +#define USE_MAX_PHYS_SEGMENTS (PAGE_SIZE >> 9) +#define USE_MAX_SEGMENT_SIZE PAGE_SIZE +#define USE_LOGICAL_BLOCK_SIZE 512 +#define USE_SEGMENT_BOUNDARY (PAGE_SIZE - 1) + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include + +#ifndef XIO_MAJOR +#define XIO_MAJOR (DRBD_MAJOR + 1) +#endif + +/ global tuning ***/ + +int if_throttle_start_size; + +struct rate_limiter if_throttle = { + .lim_max_amount_rate = 5000, +}; + +/ own type definitions ***/ + +/ own static definitions ***/ + +/* TODO: check bounds, ensure that free minor numbers are recycled */ +static int device_minor; + +/*** object * aspect constructors * destructors **/ + +/ linux operations ***/ + +static +void _if_start_io_acct(struct if_input *input, struct bio_wrapper *biow) +{ + struct bio *bio = biow->bio; + const int rw = bio_data_dir(bio); + const int cpu = part_stat_lock(); + + (void)cpu; + part_round_stats(cpu, >disk->part0); + part_stat_inc(cpu, >disk->part0, ios[rw]); + part_stat_add(cpu, >disk->part0, sectors[rw], bio->bi_iter.bi_size >> 9); + part_inc_in_flight(>disk->part0, rw); + part_stat_unlock(); + biow->start_time = jiffies; +} + +static +void _if_end_io_acct(struct if_input *input, struct bio_wrapper *biow) +{ + unsigned long duration = jiffies - biow->start_time; + struct bio *bio = biow->bio; + const int rw = bio_data_dir(bio); + const int cpu = part_stat_lock(); + + (void)cpu; + part_stat_add(cpu, >disk->part0, ticks[rw], duration); + part_round_stats(cpu, >disk->part0); + part_dec_in_flight(>disk->part0, rw); + part_stat_unlock(); +} + +/* callback + */ +static +void if_endio(struct generic_callback *cb) +{ + struct if_aio_aspect *aio_a = cb->cb_private; + struct if_input *input; + int k; + int rw; + int error; + + LAST_CALLBACK(cb); + if (unlikely(!aio_a || !aio_a->object)) { + XIO_FAT("aio_a = %p aio = %p, something is very wrong here!\n", aio_a, aio_a->object); + goto out_return; + } + input = aio_a->input; + CHECK_PTR(input, err); + + rw = aio_a->object->io_rw; + + for (k = 0; k < aio_a->bio_count; k++) { + struct bio_wrapper *biow; + struct bio *bio; + + biow = aio_a->orig_biow[k]; + aio_a->orig_biow[k] = NULL; + CHECK_PTR(biow, err); + + CHECK_ATOMIC(>bi_comp_cnt, 1); + if (!atomic_dec_and_test(>bi_comp_cnt)) + continue; + + bio = biow->bio; + CHECK_PTR_NULL(bio, err); + + _if_end_io_acct(input, biow); + + error = CALLBACK_ERROR(aio_a->object); + if (unlikely(error < 0)) { + int bi_size = bio->bi_iter.bi_size; + + XIO_ERR("NYI: error=%d RETRY LOGIC %u\n", error, bi_size); + } else { /* bio conventions are slightly different... */ +
[RFC 23/32] mars: add new module xio_server
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_server.c | 493 +++ include/linux/xio/xio_server.h | 91 + 2 files changed, 584 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_server.c create mode 100644 include/linux/xio/xio_server.h diff --git a/drivers/staging/mars/xio_bricks/xio_server.c b/drivers/staging/mars/xio_bricks/xio_server.c new file mode 100644 index ..28944d15a7bf --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_server.c @@ -0,0 +1,493 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Server brick (just for demonstration) */ + +#include +#include +#include + +#include +#include +#include +#include +#include + +/ own type definitions ***/ + +#include + +static struct xio_socket server_socket[NR_SERVER_SOCKETS]; +static struct task_struct *server_threads[NR_SERVER_SOCKETS]; + +/ own helper functions ***/ + +int cb_thread(void *data) +{ + struct server_brick *brick = data; + struct xio_socket *sock = >handler_socket; + bool aborted = false; + bool ok = xio_get_socket(sock); + int status = -EINVAL; + + XIO_DBG("--- cb_thread starting on socket #%d, ok = %d\n", sock->s_debug_nr, ok); + if (!ok) + goto done; + + brick->cb_running = true; + wake_up_interruptible(>startup_event); + + while (!brick_thread_should_stop() || + !list_empty(>cb_read_list) || + !list_empty(>cb_write_list) || + atomic_read(>in_flight) > 0) { + struct server_aio_aspect *aio_a; + struct aio_object *aio; + struct list_head *tmp; + unsigned long flags; + + wait_event_interruptible_timeout( + brick->cb_event, + !list_empty(>cb_read_list) || + !list_empty(>cb_write_list), + 1 * HZ); + + spin_lock_irqsave(>cb_lock, flags); + tmp = brick->cb_write_list.next; + if (tmp == >cb_write_list) { + tmp = brick->cb_read_list.next; + if (tmp == >cb_read_list) { + spin_unlock_irqrestore(>cb_lock, flags); + brick_msleep(1000 / HZ); + continue; + } + } + list_del_init(tmp); + spin_unlock_irqrestore(>cb_lock, flags); + + aio_a = container_of(tmp, struct server_aio_aspect, cb_head); + aio = aio_a->object; + status = -EINVAL; + CHECK_PTR(aio, err); + + status = 0; + /* Report a remote error when consistency cannot be guaranteed, +* e.g. emergency mode during sync. +*/ + if (brick->conn_brick && + brick->conn_brick->mode_ptr && + *brick->conn_brick->mode_ptr < 0 && + aio->object_cb) + aio->object_cb->cb_error = *brick->conn_brick->mode_ptr; + if (!aborted) { + down(>socket_sem); + status = xio_send_cb(sock, aio); + up(>socket_sem); + } + +err: + if (unlikely(status < 0) && !aborted) { + aborted = true; + XIO_WRN("cannot send response, status = %d\n", status); + /* Just shutdown the socket and forget all pending +* requests. +* The _client_ is responsible for resending +* any lost operations. +*/ + xio_shutdown_socket(sock); + } + + if (aio_a->data) { + brick_block_free(aio_a->data, aio_a->len); + aio->io_data = NULL; + } + if (aio_a->do_put) { +
[RFC 21/32] mars: add new module xio_copy
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_copy.c | 1005 include/linux/xio/xio_copy.h | 115 2 files changed, 1120 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_copy.c create mode 100644 include/linux/xio/xio_copy.h diff --git a/drivers/staging/mars/xio_bricks/xio_copy.c b/drivers/staging/mars/xio_bricks/xio_copy.c new file mode 100644 index ..56b60f2f837e --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_copy.c @@ -0,0 +1,1005 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Copy brick (just for demonstration) */ + +#include +#include +#include + +#include +#include + +#ifndef READ +#define READ 0 +#define WRITE 1 +#endif + +#define COPY_CHUNK (PAGE_SIZE) +#define NR_COPY_REQUESTS (32 * 1024 * 1024 / COPY_CHUNK) + +#define STATES_PER_PAGE(PAGE_SIZE / sizeof(struct copy_state)) +#define MAX_SUB_TABLES (NR_COPY_REQUESTS / STATES_PER_PAGE + (NR_COPY_REQUESTS % STATES_PER_PAGE ? 1 : 0)\ + \ +) +#define MAX_COPY_REQUESTS (PAGE_SIZE / sizeof(struct copy_state *) * STATES_PER_PAGE) + +#define GET_STATE(brick, index) \ + ((brick)->st[(index) / STATES_PER_PAGE][(index) % STATES_PER_PAGE]) + +/ own type definitions ***/ + +#include + +int xio_copy_overlap = 1; + +int xio_copy_read_prio = XIO_PRIO_NORMAL; + +int xio_copy_write_prio = XIO_PRIO_NORMAL; + +int xio_copy_read_max_fly; + +int xio_copy_write_max_fly; + +#define is_read_limited(brick) \ + (xio_copy_read_max_fly > 0 && atomic_read(&(brick)->copy_read_flight) >= xio_copy_read_max_fly) + +#define is_write_limited(brick) \ + (xio_copy_write_max_fly > 0 && atomic_read(&(brick)->copy_write_flight) >= xio_copy_write_max_fly) + +/ own helper functions ***/ + +/* TODO: + * The clash logic is untested / alpha stage (Feb. 2011). + * + * For now, the output is never used, so this cannot do harm. + * + * In order to get the output really working / enterprise grade, + * some larger test effort should be invested. + */ +static inline +void _clash(struct copy_brick *brick) +{ + brick->trigger = true; + set_bit(0, >clash); + atomic_inc(>total_clash_count); + wake_up_interruptible(>event); +} + +static inline +int _clear_clash(struct copy_brick *brick) +{ + int old; + + old = test_and_clear_bit(0, >clash); + return old; +} + +/* Current semantics: + * + * All writes are always going to the original input A. They are _not_ + * replicated to B. + * + * In order to get B really uptodate, you have to replay the right + * transaction logs there (at the right time). + * [If you had no writes on A at all during the copy, of course + * this is not necessary] + * + * When utilize_mode is on, reads can utilize the already copied + * region from B, but only as long as this region has not been + * invalidated by writes (indicated by low_dirty). + * + * TODO: implement replicated writes, together with some transaction + * replay logic applying the transaction logs _only_ after + * crashes during inconsistency caused by partial replication of writes. + */ +static +int _determine_input(struct copy_brick *brick, struct aio_object *aio) +{ + int rw; + int below; + int behind; + loff_t io_end; + + if (!brick->utilize_mode || brick->low_dirty) + return INPUT_A_IO; + + io_end = aio->io_pos + aio->io_len; + below = io_end <= brick->copy_start; + behind = !brick->copy_end || aio->io_pos >= brick->copy_end; + rw = aio->io_may_write | aio->io_rw; + if (rw) { + if (!behind) { + brick->low_dirty = true; + if (!below) { + _clash(brick); + wake_
[RFC 31/32] mars: add new module Kconfig
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/Kconfig | 266 +++ 1 file changed, 266 insertions(+) create mode 100644 drivers/staging/mars/Kconfig diff --git a/drivers/staging/mars/Kconfig b/drivers/staging/mars/Kconfig new file mode 100644 index ..836185e9509c --- /dev/null +++ b/drivers/staging/mars/Kconfig @@ -0,0 +1,266 @@ +# +# MARS configuration +# + +config MARS + tristate "storage system MARS (EXPERIMENTAL)" + depends on BLOCK && PROC_SYSCTL && HIGH_RES_TIMERS && !DEBUG_SLAB && !DEBUG_SG + default n + ---help--- + MARS is a long-distance replication of generic block devices. + It works asynchronously and tolerates network bottlenecks. + Please read the full documentation at + https://github.com/schoebel/mars/blob/master/docu/mars-manual.pdf?raw=true + Always compile MARS as a module! + +config MARS_CHECKS + bool "enable simple runtime checks in MARS" + depends on MARS + default y + ---help--- + These checks should be rather lightweight. Use them + for beta testing and for production systems where + safety is more important than performance. + In case of bugs in the reference counting, an automatic repair + is attempted, which lowers the risk of memory corruptions. + Disable only if you need the absolutely last grain of + performance. + If unsure, say Y here. + +config MARS_DEBUG + bool "enable full runtime checks and some tracing in MARS" + depends on MARS + default n + ---help--- + Some of these checks and some additional error tracing may + consume noticeable amounts of memory. However, this is extremely + valuable for finding bugs, even in production systems. + + OFF for production systems. ON for testing! + + If you encounter bugs in production systems, you + may / should use this also in production if you carefully + monitor your systems. + +config MARS_DEBUG_MEM + bool "debug memory operations" + depends on MARS_DEBUG + default n + ---help--- + This adds considerable space and time overhead, but catches + many errors (including some that are not caught by kmemleak). + + OFF for production systems. ON for testing! + Use only for development and thorough testing! + +config MARS_DEBUG_MEM_STRONG + bool "intensified debugging of memory operations" + depends on MARS_DEBUG_MEM + default y + ---help--- + Trace all block allocations, find more errors. + Adds some overhead. + + Use for debugging of new bricks or for intensified + regression testing. + +config MARS_DEBUG_ORDER0 + bool "also debug order0 operations" + depends on MARS_DEBUG_MEM + default n + ---help--- + Turn even order 0 allocations into order 1 ones and provoke + heavy memory fragmentation problems from the buddy allocator, + but catch some additional memory problems. + Use only if you know what you are doing! + Normally OFF. + +config MARS_DEFAULT_PORT + int "port number where MARS is listening" + depends on MARS + default + ---help--- + Best practice is to uniformly use the same port number + in a cluster. Therefore, this is a compiletime constant. + You may override this at insmod time via the mars_port= parameter. + +config MARS_NET_COMPAT + bool "compatibility to 0.1 series network protocol" + depends on MARS + default y + ---help--- + TRANSITIONAL: this is only needed for _mixed_ operations of the + MARS Light 0.1 kernel modules and 0.2 module. + Typically, you will need this only during upgrade for minimizig + downtime (e.g. first upgrade secondary side, then handover, + and finally upgrade the former primary side). + This option will be removed for 0.3 and later stable + series, since you will no longer need it. + +config MARS_LOGDIR + string "absolute path to the logging directory" + depends on MARS + default "/mars" + ---help--- + Path to the directory where all MARS messages will reside. + Usually this is equal to the global /mars directory. + + Logfiles and status files obey the following naming conventions: + 0.debug.log + 1.info.log + 2.warn.log + 3.error.log + 4.fatal.log + 5.total.log + Logfiles must already exist in order to be appended. + Logiles can be rotated by renaming them and creating +
[RFC 17/32] mars: add new module xio_bio
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/xio_bio.c | 845 ++ include/linux/xio/xio_bio.h | 85 +++ 2 files changed, 930 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_bio.c create mode 100644 include/linux/xio/xio_bio.h diff --git a/drivers/staging/mars/xio_bricks/xio_bio.c b/drivers/staging/mars/xio_bricks/xio_bio.c new file mode 100644 index ..97bc4fc46f3e --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_bio.c @@ -0,0 +1,845 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Bio brick (interface to blkdev IO via kernel bios) */ + +#include +#include +#include +#include + +#include +#include +#include + +#include +static struct timing_stats timings[2]; + +struct threshold bio_submit_threshold = { + .thr_ban = _global_ban, + .thr_parent = _io_threshold, + .thr_limit = BIO_SUBMIT_MAX_LATENCY, + .thr_factor = 100, + .thr_plus = 0, +}; + +struct threshold bio_io_threshold[2] = { + [0] = { + .thr_ban = _global_ban, + .thr_parent = _io_threshold, + .thr_limit = BIO_IO_R_MAX_LATENCY, + .thr_factor = 10, + .thr_plus = 1, + }, + [1] = { + .thr_ban = _global_ban, + .thr_parent = _io_threshold, + .thr_limit = BIO_IO_W_MAX_LATENCY, + .thr_factor = 10, + .thr_plus = 1, + }, +}; + +/ own type definitions ***/ + +/ own helper functions ***/ + +/* This is called from the kernel bio layer. + */ +static +void bio_callback(struct bio *bio) +{ + struct bio_aio_aspect *aio_a = bio->bi_private; + struct bio_brick *brick; + unsigned long flags; + + CHECK_PTR(aio_a, err); + CHECK_PTR(aio_a->output, err); + brick = aio_a->output->brick; + CHECK_PTR(brick, err); + + aio_a->status_code = bio->bi_error; + + spin_lock_irqsave(>lock, flags); + list_del(_a->io_head); + list_add_tail(_a->io_head, >completed_list); + atomic_inc(>completed_count); + spin_unlock_irqrestore(>lock, flags); + + wake_up_interruptible(>response_event); + goto out_return; +err: + XIO_FAT("cannot handle bio callback\n"); +out_return:; +} + +/* Map from kernel address/length to struct page (if not already known), + * check alignment constraints, create bio from it. + * Return the length (may be smaller than requested). + */ +static +int make_bio( +struct bio_brick *brick, void *data, int len, loff_t pos, struct bio_aio_aspect *private, struct bio **_bio) +{ + unsigned long long sector; + int sector_offset; + int data_offset; + int page_offset; + int page_len; + int bvec_count; + int rest_len = len; + int result_len = 0; + int status; + int i; + struct bio *bio = NULL; + struct block_device *bdev; + + status = -EINVAL; + CHECK_PTR(brick, out); + bdev = brick->bdev; + CHECK_PTR(bdev, out); + + if (unlikely(rest_len <= 0)) { + XIO_ERR("bad bio len %d\n", rest_len); + goto out; + } + + sector = pos >> 9; /* TODO: make dynamic */ + sector_offset = pos & ((1 << 9) - 1); /* TODO: make dynamic */ + data_offset = ((unsigned long)data) & ((1 << 9) - 1); /* TODO: make dynamic */ + + if (unlikely(sector_offset > 0)) { + XIO_ERR("odd sector offset %d\n", sector_offset); + goto out; + } + if (unlikely(sector_offset != data_offset)) { + XIO_ERR("bad alignment: sector_offset %d != data_offset %d\n", sector_offset, data_offset); + goto out; + } + if (unlikely(rest_len & ((1 << 9) - 1))) { + XIO_ERR("odd length %d\n", rest_len); + goto out; + } + + page_offset = ((unsigned long)data) & (PAGE_SIZE - 1); + page_len = rest_len + page_offset; + bvec_count = (page_len - 1) /
[RFC 31/32] mars: add new module Kconfig
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/Kconfig | 266 +++ 1 file changed, 266 insertions(+) create mode 100644 drivers/staging/mars/Kconfig diff --git a/drivers/staging/mars/Kconfig b/drivers/staging/mars/Kconfig new file mode 100644 index ..836185e9509c --- /dev/null +++ b/drivers/staging/mars/Kconfig @@ -0,0 +1,266 @@ +# +# MARS configuration +# + +config MARS + tristate "storage system MARS (EXPERIMENTAL)" + depends on BLOCK && PROC_SYSCTL && HIGH_RES_TIMERS && !DEBUG_SLAB && !DEBUG_SG + default n + ---help--- + MARS is a long-distance replication of generic block devices. + It works asynchronously and tolerates network bottlenecks. + Please read the full documentation at + https://github.com/schoebel/mars/blob/master/docu/mars-manual.pdf?raw=true + Always compile MARS as a module! + +config MARS_CHECKS + bool "enable simple runtime checks in MARS" + depends on MARS + default y + ---help--- + These checks should be rather lightweight. Use them + for beta testing and for production systems where + safety is more important than performance. + In case of bugs in the reference counting, an automatic repair + is attempted, which lowers the risk of memory corruptions. + Disable only if you need the absolutely last grain of + performance. + If unsure, say Y here. + +config MARS_DEBUG + bool "enable full runtime checks and some tracing in MARS" + depends on MARS + default n + ---help--- + Some of these checks and some additional error tracing may + consume noticeable amounts of memory. However, this is extremely + valuable for finding bugs, even in production systems. + + OFF for production systems. ON for testing! + + If you encounter bugs in production systems, you + may / should use this also in production if you carefully + monitor your systems. + +config MARS_DEBUG_MEM + bool "debug memory operations" + depends on MARS_DEBUG + default n + ---help--- + This adds considerable space and time overhead, but catches + many errors (including some that are not caught by kmemleak). + + OFF for production systems. ON for testing! + Use only for development and thorough testing! + +config MARS_DEBUG_MEM_STRONG + bool "intensified debugging of memory operations" + depends on MARS_DEBUG_MEM + default y + ---help--- + Trace all block allocations, find more errors. + Adds some overhead. + + Use for debugging of new bricks or for intensified + regression testing. + +config MARS_DEBUG_ORDER0 + bool "also debug order0 operations" + depends on MARS_DEBUG_MEM + default n + ---help--- + Turn even order 0 allocations into order 1 ones and provoke + heavy memory fragmentation problems from the buddy allocator, + but catch some additional memory problems. + Use only if you know what you are doing! + Normally OFF. + +config MARS_DEFAULT_PORT + int "port number where MARS is listening" + depends on MARS + default + ---help--- + Best practice is to uniformly use the same port number + in a cluster. Therefore, this is a compiletime constant. + You may override this at insmod time via the mars_port= parameter. + +config MARS_NET_COMPAT + bool "compatibility to 0.1 series network protocol" + depends on MARS + default y + ---help--- + TRANSITIONAL: this is only needed for _mixed_ operations of the + MARS Light 0.1 kernel modules and 0.2 module. + Typically, you will need this only during upgrade for minimizig + downtime (e.g. first upgrade secondary side, then handover, + and finally upgrade the former primary side). + This option will be removed for 0.3 and later stable + series, since you will no longer need it. + +config MARS_LOGDIR + string "absolute path to the logging directory" + depends on MARS + default "/mars" + ---help--- + Path to the directory where all MARS messages will reside. + Usually this is equal to the global /mars directory. + + Logfiles and status files obey the following naming conventions: + 0.debug.log + 1.info.log + 2.warn.log + 3.error.log + 4.fatal.log + 5.total.log + Logfiles must already exist in order to be appended. + Logiles can be rotated by renaming them and creating + a new empty file in place of the ol
[RFC 17/32] mars: add new module xio_bio
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_bio.c | 845 ++ include/linux/xio/xio_bio.h | 85 +++ 2 files changed, 930 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_bio.c create mode 100644 include/linux/xio/xio_bio.h diff --git a/drivers/staging/mars/xio_bricks/xio_bio.c b/drivers/staging/mars/xio_bricks/xio_bio.c new file mode 100644 index ..97bc4fc46f3e --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_bio.c @@ -0,0 +1,845 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Bio brick (interface to blkdev IO via kernel bios) */ + +#include +#include +#include +#include + +#include +#include +#include + +#include +static struct timing_stats timings[2]; + +struct threshold bio_submit_threshold = { + .thr_ban = _global_ban, + .thr_parent = _io_threshold, + .thr_limit = BIO_SUBMIT_MAX_LATENCY, + .thr_factor = 100, + .thr_plus = 0, +}; + +struct threshold bio_io_threshold[2] = { + [0] = { + .thr_ban = _global_ban, + .thr_parent = _io_threshold, + .thr_limit = BIO_IO_R_MAX_LATENCY, + .thr_factor = 10, + .thr_plus = 1, + }, + [1] = { + .thr_ban = _global_ban, + .thr_parent = _io_threshold, + .thr_limit = BIO_IO_W_MAX_LATENCY, + .thr_factor = 10, + .thr_plus = 1, + }, +}; + +/ own type definitions ***/ + +/ own helper functions ***/ + +/* This is called from the kernel bio layer. + */ +static +void bio_callback(struct bio *bio) +{ + struct bio_aio_aspect *aio_a = bio->bi_private; + struct bio_brick *brick; + unsigned long flags; + + CHECK_PTR(aio_a, err); + CHECK_PTR(aio_a->output, err); + brick = aio_a->output->brick; + CHECK_PTR(brick, err); + + aio_a->status_code = bio->bi_error; + + spin_lock_irqsave(>lock, flags); + list_del(_a->io_head); + list_add_tail(_a->io_head, >completed_list); + atomic_inc(>completed_count); + spin_unlock_irqrestore(>lock, flags); + + wake_up_interruptible(>response_event); + goto out_return; +err: + XIO_FAT("cannot handle bio callback\n"); +out_return:; +} + +/* Map from kernel address/length to struct page (if not already known), + * check alignment constraints, create bio from it. + * Return the length (may be smaller than requested). + */ +static +int make_bio( +struct bio_brick *brick, void *data, int len, loff_t pos, struct bio_aio_aspect *private, struct bio **_bio) +{ + unsigned long long sector; + int sector_offset; + int data_offset; + int page_offset; + int page_len; + int bvec_count; + int rest_len = len; + int result_len = 0; + int status; + int i; + struct bio *bio = NULL; + struct block_device *bdev; + + status = -EINVAL; + CHECK_PTR(brick, out); + bdev = brick->bdev; + CHECK_PTR(bdev, out); + + if (unlikely(rest_len <= 0)) { + XIO_ERR("bad bio len %d\n", rest_len); + goto out; + } + + sector = pos >> 9; /* TODO: make dynamic */ + sector_offset = pos & ((1 << 9) - 1); /* TODO: make dynamic */ + data_offset = ((unsigned long)data) & ((1 << 9) - 1); /* TODO: make dynamic */ + + if (unlikely(sector_offset > 0)) { + XIO_ERR("odd sector offset %d\n", sector_offset); + goto out; + } + if (unlikely(sector_offset != data_offset)) { + XIO_ERR("bad alignment: sector_offset %d != data_offset %d\n", sector_offset, data_offset); + goto out; + } + if (unlikely(rest_len & ((1 << 9) - 1))) { + XIO_ERR("odd length %d\n", rest_len); + goto out; + } + + page_offset = ((unsigned long)data) & (PAGE_SIZE - 1); + page_len = rest_len + page_offset; + bvec_count = (page_len - 1) / PAGE_SIZE + 1; + if (bvec_co
[RFC 12/32] mars: add new module vfs_compat
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- include/linux/brick/vfs_compat.h | 48 1 file changed, 48 insertions(+) create mode 100644 include/linux/brick/vfs_compat.h diff --git a/include/linux/brick/vfs_compat.h b/include/linux/brick/vfs_compat.h new file mode 100644 index ..68d082b70b43 --- /dev/null +++ b/include/linux/brick/vfs_compat.h @@ -0,0 +1,48 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef _MARS_COMPAT +#define _MARS_COMPAT + +/* TRANSITIONAL compatibility to BOTH the old prepatch + * and the new wrappers around vfs_*(). + */ +#ifndef MARS_MAJOR +#define __USE_COMPAT +#endif + +#ifdef __USE_COMPAT + +int _compat_symlink( +const char __user *oldname, + const char __user *newname, + struct timespec *mtime); + +int _compat_mkdir( +const char __user *pathname, + int mode); + +int _compat_rename( +const char __user *oldname, + const char __user *newname); + +int _compat_unlink(const char __user *pathname); + +#else +#include +#endif +#endif -- 2.11.0
[RFC 26/32] mars: add new module net
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/mars/net.c | 109 1 file changed, 109 insertions(+) create mode 100644 drivers/staging/mars/mars/net.c diff --git a/drivers/staging/mars/mars/net.c b/drivers/staging/mars/mars/net.c new file mode 100644 index ..d1b9715c0a93 --- /dev/null +++ b/drivers/staging/mars/mars/net.c @@ -0,0 +1,109 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include "strategy.h" +#include + +static +char *_xio_translate_hostname(const char *name) +{ + char *res = brick_strdup(name); + char *test; + char *tmp; + + for (tmp = res; *tmp; tmp++) { + if (*tmp == ':') { + *tmp = '\0'; + break; + } + } + + tmp = path_make("/mars/ips/ip-%s", res); + if (unlikely(!tmp)) + goto done; + + test = mars_readlink(tmp); + if (test && test[0]) { + XIO_DBG("'%s' => '%s'\n", tmp, test); + brick_string_free(res); + res = test; + } else { + brick_string_free(test); + XIO_WRN("no hostname translation for '%s'\n", tmp); + } + brick_string_free(tmp); + +done: + return res; +} + +int xio_send_dent_list(struct xio_socket *sock, struct list_head *anchor) +{ + struct list_head *tmp; + struct mars_dent *dent; + int status = 0; + + for (tmp = anchor->next; tmp != anchor; tmp = tmp->next) { + dent = container_of(tmp, struct mars_dent, dent_link); + status = xio_send_struct(sock, dent, mars_dent_meta); + if (status < 0) + break; + } + if (status >= 0) { /* send EOR */ + status = xio_send_struct(sock, NULL, mars_dent_meta); + } + return status; +} + +int xio_recv_dent_list(struct xio_socket *sock, struct list_head *anchor) +{ + int status; + + for (;;) { + struct mars_dent *dent = brick_zmem_alloc(sizeof(struct mars_dent)); + + INIT_LIST_HEAD(>dent_link); + INIT_LIST_HEAD(>brick_list); + + status = xio_recv_struct(sock, dent, mars_dent_meta); + if (status <= 0) { + xio_free_dent(dent); + goto done; + } + list_add_tail(>dent_link, anchor); + } +done: + return status; +} + +/* module init stuff / + +int __init init_sy_net(void) +{ + XIO_INF("init_sy_net()\n"); + xio_translate_hostname = _xio_translate_hostname; + return 0; +} + +void exit_sy_net(void) +{ + XIO_INF("exit_sy_net()\n"); +} -- 2.11.0
[RFC 12/32] mars: add new module vfs_compat
Signed-off-by: Thomas Schoebel-Theuer --- include/linux/brick/vfs_compat.h | 48 1 file changed, 48 insertions(+) create mode 100644 include/linux/brick/vfs_compat.h diff --git a/include/linux/brick/vfs_compat.h b/include/linux/brick/vfs_compat.h new file mode 100644 index ..68d082b70b43 --- /dev/null +++ b/include/linux/brick/vfs_compat.h @@ -0,0 +1,48 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef _MARS_COMPAT +#define _MARS_COMPAT + +/* TRANSITIONAL compatibility to BOTH the old prepatch + * and the new wrappers around vfs_*(). + */ +#ifndef MARS_MAJOR +#define __USE_COMPAT +#endif + +#ifdef __USE_COMPAT + +int _compat_symlink( +const char __user *oldname, + const char __user *newname, + struct timespec *mtime); + +int _compat_mkdir( +const char __user *pathname, + int mode); + +int _compat_rename( +const char __user *oldname, + const char __user *newname); + +int _compat_unlink(const char __user *pathname); + +#else +#include +#endif +#endif -- 2.11.0
[RFC 26/32] mars: add new module net
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/mars/net.c | 109 1 file changed, 109 insertions(+) create mode 100644 drivers/staging/mars/mars/net.c diff --git a/drivers/staging/mars/mars/net.c b/drivers/staging/mars/mars/net.c new file mode 100644 index ..d1b9715c0a93 --- /dev/null +++ b/drivers/staging/mars/mars/net.c @@ -0,0 +1,109 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include "strategy.h" +#include + +static +char *_xio_translate_hostname(const char *name) +{ + char *res = brick_strdup(name); + char *test; + char *tmp; + + for (tmp = res; *tmp; tmp++) { + if (*tmp == ':') { + *tmp = '\0'; + break; + } + } + + tmp = path_make("/mars/ips/ip-%s", res); + if (unlikely(!tmp)) + goto done; + + test = mars_readlink(tmp); + if (test && test[0]) { + XIO_DBG("'%s' => '%s'\n", tmp, test); + brick_string_free(res); + res = test; + } else { + brick_string_free(test); + XIO_WRN("no hostname translation for '%s'\n", tmp); + } + brick_string_free(tmp); + +done: + return res; +} + +int xio_send_dent_list(struct xio_socket *sock, struct list_head *anchor) +{ + struct list_head *tmp; + struct mars_dent *dent; + int status = 0; + + for (tmp = anchor->next; tmp != anchor; tmp = tmp->next) { + dent = container_of(tmp, struct mars_dent, dent_link); + status = xio_send_struct(sock, dent, mars_dent_meta); + if (status < 0) + break; + } + if (status >= 0) { /* send EOR */ + status = xio_send_struct(sock, NULL, mars_dent_meta); + } + return status; +} + +int xio_recv_dent_list(struct xio_socket *sock, struct list_head *anchor) +{ + int status; + + for (;;) { + struct mars_dent *dent = brick_zmem_alloc(sizeof(struct mars_dent)); + + INIT_LIST_HEAD(>dent_link); + INIT_LIST_HEAD(>brick_list); + + status = xio_recv_struct(sock, dent, mars_dent_meta); + if (status <= 0) { + xio_free_dent(dent); + goto done; + } + list_add_tail(>dent_link, anchor); + } +done: + return status; +} + +/* module init stuff / + +int __init init_sy_net(void) +{ + XIO_INF("init_sy_net()\n"); + xio_translate_hostname = _xio_translate_hostname; + return 0; +} + +void exit_sy_net(void) +{ + XIO_INF("exit_sy_net()\n"); +} -- 2.11.0
[RFC 30/32] mars: add new module Makefile
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/Makefile | 96 +++ 1 file changed, 96 insertions(+) create mode 100644 drivers/staging/mars/Makefile diff --git a/drivers/staging/mars/Makefile b/drivers/staging/mars/Makefile new file mode 100644 index ..5e94c3c692c2 --- /dev/null +++ b/drivers/staging/mars/Makefile @@ -0,0 +1,96 @@ +# +# Makefile for MARS +# + +# remove_this +# +# TST: this was required by some sysadmins some years ago for +# very 1&1-specific OOT Debian build methods. +# Not tested in other environments. Might need some tweaks, or could +# be removed in the long term. +# +ifndef CONFIG_MARS +# mars_config.h is generated by a simple Kconfig parser (gen_config.pl) +# at build time. +# It does not respect any Kconfig dependencies. +# Therefore, it is unsafe. Use at your own risk! +# It is ONLY used for out-of-tree builds. +# +CONFIG_MARS_BIGMODULE := m +CONFIG_MARS_NET_COMPAT := y +obj-$(CONFIG_MARS_BIGMODULE) += mars.o +extra-y+= mars_config.h +GEN_CONFIG_SCRIPT := $(src)/../scripts/gen_config.pl +$(obj)/mars_config.h: $(obj)/buildtag.h +$(obj)/mars_config.h: $(src)/Kconfig $(GEN_CONFIG_SCRIPT) + $(Q)$(kecho) "MARS: using compiler $($(CC) --version | head -1)" + $(CC) -v + $(Q)$(kecho) "MARS: Generating $@" + $(Q)set -e; \ + if [ ! -x $(GEN_CONFIG_SCRIPT) ]; then \ + $(kecho) "MARS: cannot execute script $(GEN_CONFIG_SCRIPT)"; \ + /bin/false; \ + fi; \ + cat $< | $(GEN_CONFIG_SCRIPT) > $@; + cat $@; +endif +# end_remove_this + +obj-$(CONFIG_MARS) += mars.o + +KBUILD_CFLAGS += -fdelete-null-pointer-checks + +# remove_this +# The following is 1&1 specific. Don't use anywhere else. +ifneq ($(KBUILD_EXTMOD),) + CONFIG_MARS := m +# mars_config.h is generated by a simple Kconfig parser (gen_config.pl) +# at build time. +# It does not respect any Kconfig dependencies. +# Therefore, it is unsafe. Use at your own risk! +# It is ONLY used for out-of-tree builds. +# +extra-y+= mars_config.h +GEN_CONFIG_SCRIPT := $(src)/../scripts/gen_config.pl +$(obj)/mars_config.h: $(obj)/buildtag.h +$(obj)/mars_config.h: $(src)/Kconfig $(GEN_CONFIG_SCRIPT) + $(Q)$(kecho) "MARS: using compiler $($(CC) --version | head -1)" + $(CC) -v + $(Q)$(kecho) "MARS: Generating $@" + $(Q)set -e; \ + if [ ! -x $(GEN_CONFIG_SCRIPT) ]; then \ + $(kecho) "MARS: cannot execute script $(GEN_CONFIG_SCRIPT)"; \ + /bin/false; \ + fi; \ + cat $< | $(GEN_CONFIG_SCRIPT) > $@; + cat $@; +endif +# end_remove_this + +obj-$(CONFIG_MARS) += mars.o + +mars-objs := \ + lamport.o \ + brick_say.o \ + brick_mem.o \ + brick.o \ + xio_bricks/xio.o\ + xio_bricks/lib_log.o\ + lib/lib_rank.o \ + lib/lib_limiter.o \ + lib/lib_timing.o\ + xio_bricks/lib_mapfree.o\ + xio_bricks/xio_net.o\ + mars/server_strategy.o \ + xio_bricks/xio_server.o \ + xio_bricks/xio_client.o \ + xio_bricks/xio_sio.o\ + xio_bricks/xio_bio.o\ + xio_bricks/xio_if.o \ + xio_bricks/xio_copy.o \ + xio_bricks/xio_trans_logger.o \ + mars/main_strategy.o\ + mars/net.o \ + mars/mars_proc.o\ + mars/mars_main.o + -- 2.11.0
[RFC 15/32] mars: add new module lib_mapfree
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/lib_mapfree.c | 382 ++ include/linux/xio/lib_mapfree.h | 84 ++ 2 files changed, 466 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/lib_mapfree.c create mode 100644 include/linux/xio/lib_mapfree.h diff --git a/drivers/staging/mars/xio_bricks/lib_mapfree.c b/drivers/staging/mars/xio_bricks/lib_mapfree.c new file mode 100644 index ..fc7c057fc993 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/lib_mapfree.c @@ -0,0 +1,382 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* time to wait between background mapfree operations */ +int mapfree_period_sec = 10; + +/* some grace space where no regular cleanup should occur */ +int mapfree_grace_keep_mb = 16; + +static +DECLARE_RWSEM(mapfree_mutex); + +static +LIST_HEAD(mapfree_list); + +void mapfree_pages(struct mapfree_info *mf, int grace_keep) +{ + struct address_space *mapping; + pgoff_t start; + pgoff_t end; + + if (unlikely(!mf)) + goto done; + if (unlikely(!mf->mf_filp)) + goto done; + + mapping = mf->mf_filp->f_mapping; + if (unlikely(!mapping)) + goto done; + + if (grace_keep < 0) { /* force full flush */ + start = 0; + end = -1; + } else { + unsigned long flags; + loff_t tmp; + loff_t min; + + spin_lock_irqsave(>mf_lock, flags); + + tmp = mf->mf_min[0]; + min = tmp; + if (likely(mf->mf_min[1] < min)) + min = mf->mf_min[1]; + if (tmp) { + mf->mf_min[1] = tmp; + mf->mf_min[0] = 0; + } + + spin_unlock_irqrestore(>mf_lock, flags); + + min -= (loff_t)grace_keep * (1024 * 1024); /* megabytes */ + end = 0; + + if (min > 0 || mf->mf_last) { + start = mf->mf_last / PAGE_SIZE; + /* add some grace overlapping */ + if (likely(start > 0)) + start--; + mf->mf_last = min; + end = min / PAGE_SIZE; + } else { /* there was no progress for at least 2 rounds */ + start = 0; + if (!grace_keep) /* also flush thoroughly */ + end = -1; + } + + XIO_DBG("file = '%s' start = %lu end = %lu\n", mf->mf_name, start, end); + } + + if (end > start || end == -1) + invalidate_mapping_pages(mapping, start, end); + +done:; +} + +static +void _mapfree_put(struct mapfree_info *mf) +{ + if (atomic_dec_and_test(>mf_count)) { + XIO_DBG("closing file '%s' filp = %p\n", mf->mf_name, mf->mf_filp); + list_del_init(>mf_head); + CHECK_HEAD_EMPTY(>mf_dirty_anchor); + if (likely(mf->mf_filp)) { + mapfree_pages(mf, -1); + filp_close(mf->mf_filp, NULL); + } + brick_string_free(mf->mf_name); + brick_mem_free(mf); + } +} + +void mapfree_put(struct mapfree_info *mf) +{ + if (likely(mf)) { + down_write(_mutex); + _mapfree_put(mf); + up_write(_mutex); + } +} + +struct mapfree_info *mapfree_get(const char *name, int flags) +{ + struct mapfree_info *mf = NULL; + struct list_head *tmp; + + if (!(flags & O_DIRECT)) { + down_read(_mutex); + for (tmp = mapfree_list.next; tmp != _list; tmp = tmp->next) { + struct mapfree_info *_mf = container_of(tmp, struct mapfree_info, mf_head); + + if (_mf->mf_flags == flags && !strcmp(_mf->mf_name, name)) { + mf = _mf; + atom
[RFC 32/32] mars: activate build
From: Thomas Schoebel-Theuer <t...@1und1.de> --- drivers/staging/Kconfig | 2 ++ drivers/staging/Makefile | 1 + 2 files changed, 3 insertions(+) diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig index 5d3b86a33857..bbccc4f0ebbe 100644 --- a/drivers/staging/Kconfig +++ b/drivers/staging/Kconfig @@ -56,6 +56,8 @@ source "drivers/staging/vt6656/Kconfig" source "drivers/staging/iio/Kconfig" +source "drivers/staging/mars/Kconfig" + source "drivers/staging/sm750fb/Kconfig" source "drivers/staging/xgifb/Kconfig" diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile index 30918edef5e3..01732bd65542 100644 --- a/drivers/staging/Makefile +++ b/drivers/staging/Makefile @@ -22,6 +22,7 @@ obj-$(CONFIG_VT6655) += vt6655/ obj-$(CONFIG_VT6656) += vt6656/ obj-$(CONFIG_VME_BUS) += vme/ obj-$(CONFIG_IIO) += iio/ +obj-$(CONFIG_MARS) += mars/ obj-$(CONFIG_FB_SM750) += sm750fb/ obj-$(CONFIG_FB_XGI) += xgifb/ obj-$(CONFIG_USB_EMXX) += emxx_udc/ -- 2.11.0
[RFC 30/32] mars: add new module Makefile
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/Makefile | 96 +++ 1 file changed, 96 insertions(+) create mode 100644 drivers/staging/mars/Makefile diff --git a/drivers/staging/mars/Makefile b/drivers/staging/mars/Makefile new file mode 100644 index ..5e94c3c692c2 --- /dev/null +++ b/drivers/staging/mars/Makefile @@ -0,0 +1,96 @@ +# +# Makefile for MARS +# + +# remove_this +# +# TST: this was required by some sysadmins some years ago for +# very 1&1-specific OOT Debian build methods. +# Not tested in other environments. Might need some tweaks, or could +# be removed in the long term. +# +ifndef CONFIG_MARS +# mars_config.h is generated by a simple Kconfig parser (gen_config.pl) +# at build time. +# It does not respect any Kconfig dependencies. +# Therefore, it is unsafe. Use at your own risk! +# It is ONLY used for out-of-tree builds. +# +CONFIG_MARS_BIGMODULE := m +CONFIG_MARS_NET_COMPAT := y +obj-$(CONFIG_MARS_BIGMODULE) += mars.o +extra-y+= mars_config.h +GEN_CONFIG_SCRIPT := $(src)/../scripts/gen_config.pl +$(obj)/mars_config.h: $(obj)/buildtag.h +$(obj)/mars_config.h: $(src)/Kconfig $(GEN_CONFIG_SCRIPT) + $(Q)$(kecho) "MARS: using compiler $($(CC) --version | head -1)" + $(CC) -v + $(Q)$(kecho) "MARS: Generating $@" + $(Q)set -e; \ + if [ ! -x $(GEN_CONFIG_SCRIPT) ]; then \ + $(kecho) "MARS: cannot execute script $(GEN_CONFIG_SCRIPT)"; \ + /bin/false; \ + fi; \ + cat $< | $(GEN_CONFIG_SCRIPT) > $@; + cat $@; +endif +# end_remove_this + +obj-$(CONFIG_MARS) += mars.o + +KBUILD_CFLAGS += -fdelete-null-pointer-checks + +# remove_this +# The following is 1&1 specific. Don't use anywhere else. +ifneq ($(KBUILD_EXTMOD),) + CONFIG_MARS := m +# mars_config.h is generated by a simple Kconfig parser (gen_config.pl) +# at build time. +# It does not respect any Kconfig dependencies. +# Therefore, it is unsafe. Use at your own risk! +# It is ONLY used for out-of-tree builds. +# +extra-y+= mars_config.h +GEN_CONFIG_SCRIPT := $(src)/../scripts/gen_config.pl +$(obj)/mars_config.h: $(obj)/buildtag.h +$(obj)/mars_config.h: $(src)/Kconfig $(GEN_CONFIG_SCRIPT) + $(Q)$(kecho) "MARS: using compiler $($(CC) --version | head -1)" + $(CC) -v + $(Q)$(kecho) "MARS: Generating $@" + $(Q)set -e; \ + if [ ! -x $(GEN_CONFIG_SCRIPT) ]; then \ + $(kecho) "MARS: cannot execute script $(GEN_CONFIG_SCRIPT)"; \ + /bin/false; \ + fi; \ + cat $< | $(GEN_CONFIG_SCRIPT) > $@; + cat $@; +endif +# end_remove_this + +obj-$(CONFIG_MARS) += mars.o + +mars-objs := \ + lamport.o \ + brick_say.o \ + brick_mem.o \ + brick.o \ + xio_bricks/xio.o\ + xio_bricks/lib_log.o\ + lib/lib_rank.o \ + lib/lib_limiter.o \ + lib/lib_timing.o\ + xio_bricks/lib_mapfree.o\ + xio_bricks/xio_net.o\ + mars/server_strategy.o \ + xio_bricks/xio_server.o \ + xio_bricks/xio_client.o \ + xio_bricks/xio_sio.o\ + xio_bricks/xio_bio.o\ + xio_bricks/xio_if.o \ + xio_bricks/xio_copy.o \ + xio_bricks/xio_trans_logger.o \ + mars/main_strategy.o\ + mars/net.o \ + mars/mars_proc.o\ + mars/mars_main.o + -- 2.11.0
[RFC 15/32] mars: add new module lib_mapfree
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/lib_mapfree.c | 382 ++ include/linux/xio/lib_mapfree.h | 84 ++ 2 files changed, 466 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/lib_mapfree.c create mode 100644 include/linux/xio/lib_mapfree.h diff --git a/drivers/staging/mars/xio_bricks/lib_mapfree.c b/drivers/staging/mars/xio_bricks/lib_mapfree.c new file mode 100644 index ..fc7c057fc993 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/lib_mapfree.c @@ -0,0 +1,382 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* time to wait between background mapfree operations */ +int mapfree_period_sec = 10; + +/* some grace space where no regular cleanup should occur */ +int mapfree_grace_keep_mb = 16; + +static +DECLARE_RWSEM(mapfree_mutex); + +static +LIST_HEAD(mapfree_list); + +void mapfree_pages(struct mapfree_info *mf, int grace_keep) +{ + struct address_space *mapping; + pgoff_t start; + pgoff_t end; + + if (unlikely(!mf)) + goto done; + if (unlikely(!mf->mf_filp)) + goto done; + + mapping = mf->mf_filp->f_mapping; + if (unlikely(!mapping)) + goto done; + + if (grace_keep < 0) { /* force full flush */ + start = 0; + end = -1; + } else { + unsigned long flags; + loff_t tmp; + loff_t min; + + spin_lock_irqsave(>mf_lock, flags); + + tmp = mf->mf_min[0]; + min = tmp; + if (likely(mf->mf_min[1] < min)) + min = mf->mf_min[1]; + if (tmp) { + mf->mf_min[1] = tmp; + mf->mf_min[0] = 0; + } + + spin_unlock_irqrestore(>mf_lock, flags); + + min -= (loff_t)grace_keep * (1024 * 1024); /* megabytes */ + end = 0; + + if (min > 0 || mf->mf_last) { + start = mf->mf_last / PAGE_SIZE; + /* add some grace overlapping */ + if (likely(start > 0)) + start--; + mf->mf_last = min; + end = min / PAGE_SIZE; + } else { /* there was no progress for at least 2 rounds */ + start = 0; + if (!grace_keep) /* also flush thoroughly */ + end = -1; + } + + XIO_DBG("file = '%s' start = %lu end = %lu\n", mf->mf_name, start, end); + } + + if (end > start || end == -1) + invalidate_mapping_pages(mapping, start, end); + +done:; +} + +static +void _mapfree_put(struct mapfree_info *mf) +{ + if (atomic_dec_and_test(>mf_count)) { + XIO_DBG("closing file '%s' filp = %p\n", mf->mf_name, mf->mf_filp); + list_del_init(>mf_head); + CHECK_HEAD_EMPTY(>mf_dirty_anchor); + if (likely(mf->mf_filp)) { + mapfree_pages(mf, -1); + filp_close(mf->mf_filp, NULL); + } + brick_string_free(mf->mf_name); + brick_mem_free(mf); + } +} + +void mapfree_put(struct mapfree_info *mf) +{ + if (likely(mf)) { + down_write(_mutex); + _mapfree_put(mf); + up_write(_mutex); + } +} + +struct mapfree_info *mapfree_get(const char *name, int flags) +{ + struct mapfree_info *mf = NULL; + struct list_head *tmp; + + if (!(flags & O_DIRECT)) { + down_read(_mutex); + for (tmp = mapfree_list.next; tmp != _list; tmp = tmp->next) { + struct mapfree_info *_mf = container_of(tmp, struct mapfree_info, mf_head); + + if (_mf->mf_flags == flags && !strcmp(_mf->mf_name, name)) { + mf = _mf; + atomic_inc(>mf_count); +
[RFC 32/32] mars: activate build
From: Thomas Schoebel-Theuer --- drivers/staging/Kconfig | 2 ++ drivers/staging/Makefile | 1 + 2 files changed, 3 insertions(+) diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig index 5d3b86a33857..bbccc4f0ebbe 100644 --- a/drivers/staging/Kconfig +++ b/drivers/staging/Kconfig @@ -56,6 +56,8 @@ source "drivers/staging/vt6656/Kconfig" source "drivers/staging/iio/Kconfig" +source "drivers/staging/mars/Kconfig" + source "drivers/staging/sm750fb/Kconfig" source "drivers/staging/xgifb/Kconfig" diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile index 30918edef5e3..01732bd65542 100644 --- a/drivers/staging/Makefile +++ b/drivers/staging/Makefile @@ -22,6 +22,7 @@ obj-$(CONFIG_VT6655) += vt6655/ obj-$(CONFIG_VT6656) += vt6656/ obj-$(CONFIG_VME_BUS) += vme/ obj-$(CONFIG_IIO) += iio/ +obj-$(CONFIG_MARS) += mars/ obj-$(CONFIG_FB_SM750) += sm750fb/ obj-$(CONFIG_FB_XGI) += xgifb/ obj-$(CONFIG_USB_EMXX) += emxx_udc/ -- 2.11.0
[RFC 24/32] mars: add new module strategy
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/mars/strategy.h | 239 +++ 1 file changed, 239 insertions(+) create mode 100644 drivers/staging/mars/mars/strategy.h diff --git a/drivers/staging/mars/mars/strategy.h b/drivers/staging/mars/mars/strategy.h new file mode 100644 index ..d570772847c2 --- /dev/null +++ b/drivers/staging/mars/mars/strategy.h @@ -0,0 +1,239 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* OLD CODE = > will disappear! */ +#ifndef _OLD_STRATEGY +#define _OLD_STRATEGY + +#define _STRATEGY /* call this only in strategy bricks, never in ordinary bricks */ + +#include + +#define MARS_ARGV_MAX 4 + +extern loff_t global_total_space; +extern loff_t global_remaining_space; + +extern int global_logrot_auto; +extern int global_free_space_0; +extern int global_free_space_1; +extern int global_free_space_2; +extern int global_free_space_3; +extern int global_free_space_4; +extern int global_sync_want; +extern int global_sync_nr; +extern int global_sync_limit; +extern int mars_rollover_interval; +extern int mars_scan_interval; +extern int mars_propagate_interval; +extern int mars_sync_flip_interval; +extern int mars_peer_abort; +extern int mars_emergency_mode; +extern int mars_reset_emergency; +extern int mars_keep_msg; + +extern int mars_fast_fullsync; + +#define MARS_DENT(TYPE) \ + struct list_head dent_link; \ + struct list_head brick_list;\ + struct TYPE *d_parent; \ + char *d_argv[MARS_ARGV_MAX]; /* for internal use, will be automatically deallocated*/\ + char *d_args; /* ditto uninterpreted */ \ + char *d_name; /* current path component */ \ + char *d_rest; /* some "meaningful" rest of d_name*/ \ + char *d_path; /* full absolute path */ \ + struct say_channel *d_say_channel; /* for messages */ \ + loff_t d_corr_A; /* logical size correction */ \ + loff_t d_corr_B; /* logical size correction */ \ + int d_depth; \ + /* from readdir() = > often DT_UNKNOWN */ \ + /* don't rely on it - use stat_val.mode instead */ \ + unsigned int d_type;\ + int d_class;/* for pre-grouping order */ \ + int d_serial; /* for pre-grouping order */ \ + int d_version; /* dynamic programming per call of mars_ent_work() */\ + int d_child_count;\ + bool d_killme; \ + bool d_use_channel; \ + struct kstat stat_val; \ + char *link_val; \ + struct mars_global *d_global; \ + void (*d_private_destruct)(void *private); \ + void *d_private + +struct mars_dent { + MARS_DENT(mars_dent); +}; + +extern const struct meta mars_kstat_meta[]; +extern const struct meta mars_dent_meta[]; + +struct mars_global { + struct rw_semaphore dent_mutex; + struct rw_semaphore brick_mutex; + struct generic_switch global_power; + struct list_head dent_anchor; + struct list_head brick_anchor; + + wait_queue_head_t main_event; + int global_version; + int deleted_my_border; + int deleted_border; + int deleted_min; + bool main_trigger; +}; + +extern void bind_to_dent(struct mars_dent *dent, struct say_channel **ch); + +typedef int ( +*mars_dent_checker_fn)( +struct mars_dent *parent, +const char *name, +int namlen, +unsigned int d_type, +int *prefix, +int *serial, +bool *use_channel); + +typedef int (*mars_dent_worker_fn)(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction); + +extern i
[RFC 16/32] mars: add new module lib_log
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/xio_bricks/lib_log.c | 506 ++ include/linux/xio/lib_log.h | 333 2 files changed, 839 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/lib_log.c create mode 100644 include/linux/xio/lib_log.h diff --git a/drivers/staging/mars/xio_bricks/lib_log.c b/drivers/staging/mars/xio_bricks/lib_log.c new file mode 100644 index ..e0d086a0981f --- /dev/null +++ b/drivers/staging/mars/xio_bricks/lib_log.c @@ -0,0 +1,506 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include + +atomic_t global_aio_flying = ATOMIC_INIT(0); + +void exit_logst(struct log_status *logst) +{ + int count; + + log_flush(logst); + + /* TODO: replace by event */ + count = 0; + while (atomic_read(>aio_flying) > 0) { + if (!count++) + XIO_DBG("waiting for IO terminating..."); + brick_msleep(500); + } + if (logst->read_aio) { + XIO_DBG("putting read_aio\n"); + GENERIC_INPUT_CALL(logst->input, aio_put, logst->read_aio); + logst->read_aio = NULL; + } + if (logst->log_aio) { + XIO_DBG("putting log_aio\n"); + GENERIC_INPUT_CALL(logst->input, aio_put, logst->log_aio); + logst->log_aio = NULL; + } +} + +void init_logst(struct log_status *logst, struct xio_input *input, loff_t start_pos, loff_t end_pos) +{ + exit_logst(logst); + + memset(logst, 0, sizeof(struct log_status)); + + logst->input = input; + logst->brick = input->brick; + logst->start_pos = start_pos; + logst->log_pos = start_pos; + logst->end_pos = end_pos; + init_waitqueue_head(>event); +} + +#define XIO_LOG_CB_MAX 32 + +struct log_cb_info { + struct aio_object *aio; + struct log_status *logst; + struct semaphore mutex; + atomic_t refcount; + int nr_cb; + void (*endios[XIO_LOG_CB_MAX])(void *private, int error); + void *privates[XIO_LOG_CB_MAX]; +}; + +static +void put_log_cb_info(struct log_cb_info *cb_info) +{ + if (atomic_dec_and_test(_info->refcount)) + brick_mem_free(cb_info); +} + +static +void _do_callbacks(struct log_cb_info *cb_info, int error) +{ + int i; + + down(_info->mutex); + for (i = 0; i < cb_info->nr_cb; i++) { + void (*end_fn)(void *private, int error); + + end_fn = cb_info->endios[i]; + cb_info->endios[i] = NULL; + if (end_fn) + end_fn(cb_info->privates[i], error); + } + up(_info->mutex); +} + +static +void log_write_endio(struct generic_callback *cb) +{ + struct log_cb_info *cb_info = cb->cb_private; + struct log_status *logst; + + LAST_CALLBACK(cb); + CHECK_PTR(cb_info, err); + + logst = cb_info->logst; + CHECK_PTR(logst, done); + + _do_callbacks(cb_info, cb->cb_error); + +done: + put_log_cb_info(cb_info); + atomic_dec(>aio_flying); + atomic_dec(_aio_flying); + if (logst->signal_event) + wake_up_interruptible(logst->signal_event); + + goto out_return; +err: + XIO_FAT("internal pointer corruption\n"); +out_return:; +} + +void log_flush(struct log_status *logst) +{ + struct aio_object *aio = logst->log_aio; + struct log_cb_info *cb_info; + int align_size; + int gap; + + if (!aio || !logst->count) + goto out_return; + gap = 0; + align_size = (logst->align_size / PAGE_SIZE) * PAGE_SIZE; + if (align_size > 0) { + /* round up to next alignment border */ + int align_offset = logst->offset & (align_size - 1); + + if (align_offset > 0) { + int restlen = aio->io_len - logst->offset; + + gap = align_size - align_offset; + if (unlikely(gap > restlen)) + gap = restlen;
[RFC 25/32] mars: add new module main_strategy
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/mars/main_strategy.c | 2135 + 1 file changed, 2135 insertions(+) create mode 100644 drivers/staging/mars/mars/main_strategy.c diff --git a/drivers/staging/mars/mars/main_strategy.c b/drivers/staging/mars/mars/main_strategy.c new file mode 100644 index ..7929b566d645 --- /dev/null +++ b/drivers/staging/mars/mars/main_strategy.c @@ -0,0 +1,2135 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#define XIO_DEBUGGING + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "strategy.h" + +#include +#include + +#include +#include +#include +#include + +#define SKIP_BIO false + +/***/ + +/* meta descriptions */ + +const struct meta mars_kstat_meta[] = { + META_INI(ino, struct kstat, FIELD_UINT), + META_INI(mode, struct kstat, FIELD_UINT), + META_INI(size, struct kstat, FIELD_INT), + META_INI_SUB(atime, struct kstat, xio_timespec_meta), + META_INI_SUB(mtime, struct kstat, xio_timespec_meta), + META_INI_SUB(ctime, struct kstat, xio_timespec_meta), + META_INI_TRANSFER(blksize, struct kstat, FIELD_UINT, 4), + {} +}; + +const struct meta mars_dent_meta[] = { + META_INI(d_name,struct mars_dent, FIELD_STRING), + META_INI(d_rest,struct mars_dent, FIELD_STRING), + META_INI(d_path,struct mars_dent, FIELD_STRING), + META_INI(d_type,struct mars_dent, FIELD_UINT), + META_INI(d_class, struct mars_dent, FIELD_INT), + META_INI(d_serial, struct mars_dent, FIELD_INT), + META_INI(d_corr_A, struct mars_dent, FIELD_INT), + META_INI(d_corr_B, struct mars_dent, FIELD_INT), + META_INI_SUB(stat_val, struct mars_dent, mars_kstat_meta), + META_INI(link_val,struct mars_dent, FIELD_STRING), + META_INI(d_args,struct mars_dent, FIELD_STRING), + META_INI(d_argv[0], struct mars_dent, FIELD_STRING), + META_INI(d_argv[1], struct mars_dent, FIELD_STRING), + META_INI(d_argv[2], struct mars_dent, FIELD_STRING), + META_INI(d_argv[3], struct mars_dent, FIELD_STRING), + {} +}; + +/***/ + +/* The _compat_*() functions are needed for the out-of-tree version + * of MARS for adapdation to different kernel version. + */ + +/* Hack because of 8bcb77fabd7cbabcad49f58750be8683febee92b + */ +static int __path_parent(const char *name, struct path *path, unsigned flags) +{ + char *tmp; + int len; + int error; + + len = strlen(name); + while (len > 0 && name[len] != '/') + len--; + if (unlikely(!len)) + return -EINVAL; + + tmp = brick_string_alloc(len + 1); + strncpy(tmp, name, len); + tmp[len] = '\0'; + + error = kern_path(tmp, flags | LOOKUP_DIRECTORY | LOOKUP_FOLLOW, path); + + brick_string_free(tmp); + return error; +} + +/* code is blindly stolen from symlinkat() + * and later adapted to various kernels + */ +int _compat_symlink( +const char __user *oldname, + const char __user *newname, + struct timespec *mtime) +{ + const int newdfd = AT_FDCWD; + int error; + char *from; + struct dentry *dentry; + struct path path; + unsigned int lookup_flags = 0; + + from = (char *)oldname; + +retry: + dentry = user_path_create(newdfd, newname, , lookup_flags); + error = PTR_ERR(dentry); + if (IS_ERR(dentry)) + goto out_putname; + + error = vfs_symlink(path.dentry->d_inode, dentry, from); + if (error >= 0 && mtime) { + struct iattr iattr = { + .ia_valid = ATTR_MTIME | ATTR_MTIME_SET | ATTR_TIMES_SET, + .ia_mtime.tv_sec = mtime->tv_sec, + .ia_mtime.tv_nsec = mtime->tv_nsec, + }; + + mutex_lock(>d_inode->i_mutex); + error = notify_change(dentry, , NULL); + mutex_unlock(>d_inode->i_mutex); + } + done_path
[RFC 27/32] mars: add new module server_strategy
Signed-off-by: Thomas Schoebel-Theuer <t...@schoebel-theuer.de> --- drivers/staging/mars/mars/server_strategy.c | 436 1 file changed, 436 insertions(+) create mode 100644 drivers/staging/mars/mars/server_strategy.c diff --git a/drivers/staging/mars/mars/server_strategy.c b/drivers/staging/mars/mars/server_strategy.c new file mode 100644 index ..3b880c10be49 --- /dev/null +++ b/drivers/staging/mars/mars/server_strategy.c @@ -0,0 +1,436 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2016 Thomas Schoebel-Theuer + * Copyright (C) 2011-2016 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* MARS Light specific parts of xio_server + */ + +#include +#include +#include + +#define _STRATEGY +#include +#include +#include +#include + +#include "strategy.h" + +#include +#include + +static +int dummy_worker(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction) +{ + return 0; +} + +static +int _set_server_sio_params(struct xio_brick *_brick, void *private) +{ + struct sio_brick *sio_brick = (void *)_brick; + + if (_brick->type != (void *)_sio_brick_type) { + XIO_ERR("bad brick type\n"); + return -EINVAL; + } + sio_brick->o_direct = false; + sio_brick->o_fdsync = false; + XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path); + return 1; +} + +static +int _set_server_bio_params(struct xio_brick *_brick, void *private) +{ + struct bio_brick *bio_brick; + + if (_brick->type == (void *)_sio_brick_type) + return _set_server_sio_params(_brick, private); + if (_brick->type != (void *)_bio_brick_type) { + XIO_ERR("bad brick type\n"); + return -EINVAL; + } + bio_brick = (void *)_brick; + bio_brick->ra_pages = 0; + bio_brick->do_noidle = true; + bio_brick->do_sync = true; + bio_brick->do_unplug = true; + XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path); + return 1; +} + +int handler_thread(void *data) +{ + struct mars_global handler_global = { + .dent_anchor = LIST_HEAD_INIT(handler_global.dent_anchor), + .brick_anchor = LIST_HEAD_INIT(handler_global.brick_anchor), + .global_power = { + .button = true, + }, + .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(handler_global.main_event), + }; + struct task_struct *thread = NULL; + struct server_brick *brick = data; + struct xio_socket *sock = >handler_socket; + bool ok = xio_get_socket(sock); + unsigned long statist_jiffies = jiffies; + int debug_nr; + int status = -EINVAL; + + init_rwsem(_global.dent_mutex); + init_rwsem(_global.brick_mutex); + + XIO_DBG("#%d --- handler_thread starting on socket %p\n", sock->s_debug_nr, sock); + if (!ok) + goto done; + + thread = brick_thread_create(cb_thread, brick, "xio_cb%d", brick->version); + if (unlikely(!thread)) { + XIO_ERR("cannot create cb thread\n"); + status = -ENOENT; + goto done; + } + brick->cb_thread = thread; + + brick->handler_running = true; + wake_up_interruptible(>startup_event); + + while (!list_empty(_global.brick_anchor) || + xio_socket_is_alive(sock)) { + struct xio_cmd cmd = {}; + + handler_global.global_version++; + + if (!list_empty(_global.brick_anchor)) { + if (server_show_statist && !time_is_before_jiffies(statist_jiffies + 10 * HZ)) { + show_statistics(_global, "handler"); + statist_jiffies = jiffies; + } + if (!xio_socket_is_alive(sock) && + atomic_read(>in_flight) <= 0 && + brick->conn_brick) { + if (generic_disconnect((void *)brick->inputs[0]) >= 0) + brick->conn_brick = NULL; + } + +
[RFC 24/32] mars: add new module strategy
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/mars/strategy.h | 239 +++ 1 file changed, 239 insertions(+) create mode 100644 drivers/staging/mars/mars/strategy.h diff --git a/drivers/staging/mars/mars/strategy.h b/drivers/staging/mars/mars/strategy.h new file mode 100644 index ..d570772847c2 --- /dev/null +++ b/drivers/staging/mars/mars/strategy.h @@ -0,0 +1,239 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* OLD CODE = > will disappear! */ +#ifndef _OLD_STRATEGY +#define _OLD_STRATEGY + +#define _STRATEGY /* call this only in strategy bricks, never in ordinary bricks */ + +#include + +#define MARS_ARGV_MAX 4 + +extern loff_t global_total_space; +extern loff_t global_remaining_space; + +extern int global_logrot_auto; +extern int global_free_space_0; +extern int global_free_space_1; +extern int global_free_space_2; +extern int global_free_space_3; +extern int global_free_space_4; +extern int global_sync_want; +extern int global_sync_nr; +extern int global_sync_limit; +extern int mars_rollover_interval; +extern int mars_scan_interval; +extern int mars_propagate_interval; +extern int mars_sync_flip_interval; +extern int mars_peer_abort; +extern int mars_emergency_mode; +extern int mars_reset_emergency; +extern int mars_keep_msg; + +extern int mars_fast_fullsync; + +#define MARS_DENT(TYPE) \ + struct list_head dent_link; \ + struct list_head brick_list;\ + struct TYPE *d_parent; \ + char *d_argv[MARS_ARGV_MAX]; /* for internal use, will be automatically deallocated*/\ + char *d_args; /* ditto uninterpreted */ \ + char *d_name; /* current path component */ \ + char *d_rest; /* some "meaningful" rest of d_name*/ \ + char *d_path; /* full absolute path */ \ + struct say_channel *d_say_channel; /* for messages */ \ + loff_t d_corr_A; /* logical size correction */ \ + loff_t d_corr_B; /* logical size correction */ \ + int d_depth; \ + /* from readdir() = > often DT_UNKNOWN */ \ + /* don't rely on it - use stat_val.mode instead */ \ + unsigned int d_type;\ + int d_class;/* for pre-grouping order */ \ + int d_serial; /* for pre-grouping order */ \ + int d_version; /* dynamic programming per call of mars_ent_work() */\ + int d_child_count;\ + bool d_killme; \ + bool d_use_channel; \ + struct kstat stat_val; \ + char *link_val; \ + struct mars_global *d_global; \ + void (*d_private_destruct)(void *private); \ + void *d_private + +struct mars_dent { + MARS_DENT(mars_dent); +}; + +extern const struct meta mars_kstat_meta[]; +extern const struct meta mars_dent_meta[]; + +struct mars_global { + struct rw_semaphore dent_mutex; + struct rw_semaphore brick_mutex; + struct generic_switch global_power; + struct list_head dent_anchor; + struct list_head brick_anchor; + + wait_queue_head_t main_event; + int global_version; + int deleted_my_border; + int deleted_border; + int deleted_min; + bool main_trigger; +}; + +extern void bind_to_dent(struct mars_dent *dent, struct say_channel **ch); + +typedef int ( +*mars_dent_checker_fn)( +struct mars_dent *parent, +const char *name, +int namlen, +unsigned int d_type, +int *prefix, +int *serial, +bool *use_channel); + +typedef int (*mars_dent_worker_fn)(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction); + +extern int mars_dent_work( +struct mars_g
[RFC 16/32] mars: add new module lib_log
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/lib_log.c | 506 ++ include/linux/xio/lib_log.h | 333 2 files changed, 839 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/lib_log.c create mode 100644 include/linux/xio/lib_log.h diff --git a/drivers/staging/mars/xio_bricks/lib_log.c b/drivers/staging/mars/xio_bricks/lib_log.c new file mode 100644 index ..e0d086a0981f --- /dev/null +++ b/drivers/staging/mars/xio_bricks/lib_log.c @@ -0,0 +1,506 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include + +atomic_t global_aio_flying = ATOMIC_INIT(0); + +void exit_logst(struct log_status *logst) +{ + int count; + + log_flush(logst); + + /* TODO: replace by event */ + count = 0; + while (atomic_read(>aio_flying) > 0) { + if (!count++) + XIO_DBG("waiting for IO terminating..."); + brick_msleep(500); + } + if (logst->read_aio) { + XIO_DBG("putting read_aio\n"); + GENERIC_INPUT_CALL(logst->input, aio_put, logst->read_aio); + logst->read_aio = NULL; + } + if (logst->log_aio) { + XIO_DBG("putting log_aio\n"); + GENERIC_INPUT_CALL(logst->input, aio_put, logst->log_aio); + logst->log_aio = NULL; + } +} + +void init_logst(struct log_status *logst, struct xio_input *input, loff_t start_pos, loff_t end_pos) +{ + exit_logst(logst); + + memset(logst, 0, sizeof(struct log_status)); + + logst->input = input; + logst->brick = input->brick; + logst->start_pos = start_pos; + logst->log_pos = start_pos; + logst->end_pos = end_pos; + init_waitqueue_head(>event); +} + +#define XIO_LOG_CB_MAX 32 + +struct log_cb_info { + struct aio_object *aio; + struct log_status *logst; + struct semaphore mutex; + atomic_t refcount; + int nr_cb; + void (*endios[XIO_LOG_CB_MAX])(void *private, int error); + void *privates[XIO_LOG_CB_MAX]; +}; + +static +void put_log_cb_info(struct log_cb_info *cb_info) +{ + if (atomic_dec_and_test(_info->refcount)) + brick_mem_free(cb_info); +} + +static +void _do_callbacks(struct log_cb_info *cb_info, int error) +{ + int i; + + down(_info->mutex); + for (i = 0; i < cb_info->nr_cb; i++) { + void (*end_fn)(void *private, int error); + + end_fn = cb_info->endios[i]; + cb_info->endios[i] = NULL; + if (end_fn) + end_fn(cb_info->privates[i], error); + } + up(_info->mutex); +} + +static +void log_write_endio(struct generic_callback *cb) +{ + struct log_cb_info *cb_info = cb->cb_private; + struct log_status *logst; + + LAST_CALLBACK(cb); + CHECK_PTR(cb_info, err); + + logst = cb_info->logst; + CHECK_PTR(logst, done); + + _do_callbacks(cb_info, cb->cb_error); + +done: + put_log_cb_info(cb_info); + atomic_dec(>aio_flying); + atomic_dec(_aio_flying); + if (logst->signal_event) + wake_up_interruptible(logst->signal_event); + + goto out_return; +err: + XIO_FAT("internal pointer corruption\n"); +out_return:; +} + +void log_flush(struct log_status *logst) +{ + struct aio_object *aio = logst->log_aio; + struct log_cb_info *cb_info; + int align_size; + int gap; + + if (!aio || !logst->count) + goto out_return; + gap = 0; + align_size = (logst->align_size / PAGE_SIZE) * PAGE_SIZE; + if (align_size > 0) { + /* round up to next alignment border */ + int align_offset = logst->offset & (align_size - 1); + + if (align_offset > 0) { + int restlen = aio->io_len - logst->offset; + + gap = align_size - align_offset; + if (unlikely(gap > restlen)) + gap = restlen; + } + } +
[RFC 25/32] mars: add new module main_strategy
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/mars/main_strategy.c | 2135 + 1 file changed, 2135 insertions(+) create mode 100644 drivers/staging/mars/mars/main_strategy.c diff --git a/drivers/staging/mars/mars/main_strategy.c b/drivers/staging/mars/mars/main_strategy.c new file mode 100644 index ..7929b566d645 --- /dev/null +++ b/drivers/staging/mars/mars/main_strategy.c @@ -0,0 +1,2135 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#define XIO_DEBUGGING + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "strategy.h" + +#include +#include + +#include +#include +#include +#include + +#define SKIP_BIO false + +/***/ + +/* meta descriptions */ + +const struct meta mars_kstat_meta[] = { + META_INI(ino, struct kstat, FIELD_UINT), + META_INI(mode, struct kstat, FIELD_UINT), + META_INI(size, struct kstat, FIELD_INT), + META_INI_SUB(atime, struct kstat, xio_timespec_meta), + META_INI_SUB(mtime, struct kstat, xio_timespec_meta), + META_INI_SUB(ctime, struct kstat, xio_timespec_meta), + META_INI_TRANSFER(blksize, struct kstat, FIELD_UINT, 4), + {} +}; + +const struct meta mars_dent_meta[] = { + META_INI(d_name,struct mars_dent, FIELD_STRING), + META_INI(d_rest,struct mars_dent, FIELD_STRING), + META_INI(d_path,struct mars_dent, FIELD_STRING), + META_INI(d_type,struct mars_dent, FIELD_UINT), + META_INI(d_class, struct mars_dent, FIELD_INT), + META_INI(d_serial, struct mars_dent, FIELD_INT), + META_INI(d_corr_A, struct mars_dent, FIELD_INT), + META_INI(d_corr_B, struct mars_dent, FIELD_INT), + META_INI_SUB(stat_val, struct mars_dent, mars_kstat_meta), + META_INI(link_val,struct mars_dent, FIELD_STRING), + META_INI(d_args,struct mars_dent, FIELD_STRING), + META_INI(d_argv[0], struct mars_dent, FIELD_STRING), + META_INI(d_argv[1], struct mars_dent, FIELD_STRING), + META_INI(d_argv[2], struct mars_dent, FIELD_STRING), + META_INI(d_argv[3], struct mars_dent, FIELD_STRING), + {} +}; + +/***/ + +/* The _compat_*() functions are needed for the out-of-tree version + * of MARS for adapdation to different kernel version. + */ + +/* Hack because of 8bcb77fabd7cbabcad49f58750be8683febee92b + */ +static int __path_parent(const char *name, struct path *path, unsigned flags) +{ + char *tmp; + int len; + int error; + + len = strlen(name); + while (len > 0 && name[len] != '/') + len--; + if (unlikely(!len)) + return -EINVAL; + + tmp = brick_string_alloc(len + 1); + strncpy(tmp, name, len); + tmp[len] = '\0'; + + error = kern_path(tmp, flags | LOOKUP_DIRECTORY | LOOKUP_FOLLOW, path); + + brick_string_free(tmp); + return error; +} + +/* code is blindly stolen from symlinkat() + * and later adapted to various kernels + */ +int _compat_symlink( +const char __user *oldname, + const char __user *newname, + struct timespec *mtime) +{ + const int newdfd = AT_FDCWD; + int error; + char *from; + struct dentry *dentry; + struct path path; + unsigned int lookup_flags = 0; + + from = (char *)oldname; + +retry: + dentry = user_path_create(newdfd, newname, , lookup_flags); + error = PTR_ERR(dentry); + if (IS_ERR(dentry)) + goto out_putname; + + error = vfs_symlink(path.dentry->d_inode, dentry, from); + if (error >= 0 && mtime) { + struct iattr iattr = { + .ia_valid = ATTR_MTIME | ATTR_MTIME_SET | ATTR_TIMES_SET, + .ia_mtime.tv_sec = mtime->tv_sec, + .ia_mtime.tv_nsec = mtime->tv_nsec, + }; + + mutex_lock(>d_inode->i_mutex); + error = notify_change(dentry, , NULL); + mutex_unlock(>d_inode->i_mutex); + } + done_path_create(, dentry); + if (retry_es
[RFC 27/32] mars: add new module server_strategy
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/mars/server_strategy.c | 436 1 file changed, 436 insertions(+) create mode 100644 drivers/staging/mars/mars/server_strategy.c diff --git a/drivers/staging/mars/mars/server_strategy.c b/drivers/staging/mars/mars/server_strategy.c new file mode 100644 index ..3b880c10be49 --- /dev/null +++ b/drivers/staging/mars/mars/server_strategy.c @@ -0,0 +1,436 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2016 Thomas Schoebel-Theuer + * Copyright (C) 2011-2016 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* MARS Light specific parts of xio_server + */ + +#include +#include +#include + +#define _STRATEGY +#include +#include +#include +#include + +#include "strategy.h" + +#include +#include + +static +int dummy_worker(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction) +{ + return 0; +} + +static +int _set_server_sio_params(struct xio_brick *_brick, void *private) +{ + struct sio_brick *sio_brick = (void *)_brick; + + if (_brick->type != (void *)_sio_brick_type) { + XIO_ERR("bad brick type\n"); + return -EINVAL; + } + sio_brick->o_direct = false; + sio_brick->o_fdsync = false; + XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path); + return 1; +} + +static +int _set_server_bio_params(struct xio_brick *_brick, void *private) +{ + struct bio_brick *bio_brick; + + if (_brick->type == (void *)_sio_brick_type) + return _set_server_sio_params(_brick, private); + if (_brick->type != (void *)_bio_brick_type) { + XIO_ERR("bad brick type\n"); + return -EINVAL; + } + bio_brick = (void *)_brick; + bio_brick->ra_pages = 0; + bio_brick->do_noidle = true; + bio_brick->do_sync = true; + bio_brick->do_unplug = true; + XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path); + return 1; +} + +int handler_thread(void *data) +{ + struct mars_global handler_global = { + .dent_anchor = LIST_HEAD_INIT(handler_global.dent_anchor), + .brick_anchor = LIST_HEAD_INIT(handler_global.brick_anchor), + .global_power = { + .button = true, + }, + .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(handler_global.main_event), + }; + struct task_struct *thread = NULL; + struct server_brick *brick = data; + struct xio_socket *sock = >handler_socket; + bool ok = xio_get_socket(sock); + unsigned long statist_jiffies = jiffies; + int debug_nr; + int status = -EINVAL; + + init_rwsem(_global.dent_mutex); + init_rwsem(_global.brick_mutex); + + XIO_DBG("#%d --- handler_thread starting on socket %p\n", sock->s_debug_nr, sock); + if (!ok) + goto done; + + thread = brick_thread_create(cb_thread, brick, "xio_cb%d", brick->version); + if (unlikely(!thread)) { + XIO_ERR("cannot create cb thread\n"); + status = -ENOENT; + goto done; + } + brick->cb_thread = thread; + + brick->handler_running = true; + wake_up_interruptible(>startup_event); + + while (!list_empty(_global.brick_anchor) || + xio_socket_is_alive(sock)) { + struct xio_cmd cmd = {}; + + handler_global.global_version++; + + if (!list_empty(_global.brick_anchor)) { + if (server_show_statist && !time_is_before_jiffies(statist_jiffies + 10 * HZ)) { + show_statistics(_global, "handler"); + statist_jiffies = jiffies; + } + if (!xio_socket_is_alive(sock) && + atomic_read(>in_flight) <= 0 && + brick->conn_brick) { + if (generic_disconnect((void *)brick->inputs[0]) >= 0) + brick->conn_brick = NULL; + } + + status
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/12/2016 08:19 AM, Theodore Ts'o wrote: On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote: There's a big difference between "give the user rope", and "tie the rope in a noose and put a banana peel so that the user might stumble into the rope and hang himself", though. [...] And then the application has to run setgid with that group's privileges. Your concept of hierarchically nesting containers via filesystem instances looks nice to me. A potential concern could be whether gids are the right implementation for expressing hierarchically nested access permissions in a persistent way. Your permissions attached to gids are nested (because inside of your containers you may have another instance of a completely different gid namespace), they are also persistent when your mount flags etc are restored properly after a crash (by some scripts), but probably use of gids for this might look like a kind of "misuse" of the original gid concept from the 1970s. Maybe you currently don't have a better /persistent/ concept for expressing your needs, so maybe your solution could be just fine under the currently given cirumstances. Introduction of a new concept for overcoming the current limitations must be done very carefully. The bad discard semantics concerns about information leaks could be /hypothetically/ solved at /concept level/ in the following way. Please note that by "concept level" I don't want to imply any particular implementation, this is just a mental experiment for discussion of the problems, just a "model of thinking": a) Use a hierarchical namespace for naming subjects, e.g. hypervisorA.containerB.subcontainerC.user9 instead of gid=9 b) Attach actual permissions to each block of the underlying block device (fine-grained object model). c) Correctly maintain access rights at each hierarchical layer, and for all operations (including discard with whatever semantics). In case some inner instance is untrusted and may do evil things, this will be intercepted / corrected at outer layers (which are more trusted). In essence, the nesting hierarchy is also a hierarchy of trust. Now information leaks by bad discard semantics etc should be solved at any level, even regarding completely unrelated containers or users, as long as no physical access to the disk is possible. In addition, encryption may be used for even overcoming this. Of course, a direct implementation of such extremely fine-grained access permissions would carry way too much overhead. Both the number of subjects as well as the number of objects must be reduced to some reasonable order of magnitude, at least at outer levels. Thus the question is: how can we achieve almost the same effect with much less overhead? Hmm, in my old Athomux research prototype, I proposed some solutions for this, on an academic green meadow. But I am unsure what is transferable to a standard POSIX semantics system, and what not. Rethinking these concepts as well as checking them may take some time Here is a first alpha-stage attempt: 1) Give up the hierarchical subject namespace a), but maybe not fully. Access checking will continue /locally/ at each layer, by treating each subsystem as a (grey) blackbox. This is already the default implementation strategy. The total system may be less secure than in an idealized fine-grained system, because outer levels can no longer detect bad guys inside of their subsystem instances. The question is: how to get a "more secure" system than currently, with some reasonable effort. 2) Some /coarse/ access permission checks at the block layer b), but finer than today. Currently there is almost no checking at all (except when accessing a huge block device as a whole during open() => at 1&1 we have very large ones, and they may continue running for years). I am unsure how to achieve this in detail. An idea for a long-term solution would be offloading of "allocation groups" to the block layer (if their size is coarsely dynamic in general, e.g. in steps of gigabytes), and to implement some coarse permission checks there. These could then be related to "containers" or "container groups". One of the problems is that some wide-spread network protocols like iSCSI have no clue about this, so this can only be an optional new feature. Further ideas sought. Cheers, Thomas P.S. The concept of a "nest" in Athomux was already some kind of "recursively nested block device".
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/12/2016 08:19 AM, Theodore Ts'o wrote: On Fri, Mar 11, 2016 at 04:44:16PM -0800, Linus Torvalds wrote: There's a big difference between "give the user rope", and "tie the rope in a noose and put a banana peel so that the user might stumble into the rope and hang himself", though. [...] And then the application has to run setgid with that group's privileges. Your concept of hierarchically nesting containers via filesystem instances looks nice to me. A potential concern could be whether gids are the right implementation for expressing hierarchically nested access permissions in a persistent way. Your permissions attached to gids are nested (because inside of your containers you may have another instance of a completely different gid namespace), they are also persistent when your mount flags etc are restored properly after a crash (by some scripts), but probably use of gids for this might look like a kind of "misuse" of the original gid concept from the 1970s. Maybe you currently don't have a better /persistent/ concept for expressing your needs, so maybe your solution could be just fine under the currently given cirumstances. Introduction of a new concept for overcoming the current limitations must be done very carefully. The bad discard semantics concerns about information leaks could be /hypothetically/ solved at /concept level/ in the following way. Please note that by "concept level" I don't want to imply any particular implementation, this is just a mental experiment for discussion of the problems, just a "model of thinking": a) Use a hierarchical namespace for naming subjects, e.g. hypervisorA.containerB.subcontainerC.user9 instead of gid=9 b) Attach actual permissions to each block of the underlying block device (fine-grained object model). c) Correctly maintain access rights at each hierarchical layer, and for all operations (including discard with whatever semantics). In case some inner instance is untrusted and may do evil things, this will be intercepted / corrected at outer layers (which are more trusted). In essence, the nesting hierarchy is also a hierarchy of trust. Now information leaks by bad discard semantics etc should be solved at any level, even regarding completely unrelated containers or users, as long as no physical access to the disk is possible. In addition, encryption may be used for even overcoming this. Of course, a direct implementation of such extremely fine-grained access permissions would carry way too much overhead. Both the number of subjects as well as the number of objects must be reduced to some reasonable order of magnitude, at least at outer levels. Thus the question is: how can we achieve almost the same effect with much less overhead? Hmm, in my old Athomux research prototype, I proposed some solutions for this, on an academic green meadow. But I am unsure what is transferable to a standard POSIX semantics system, and what not. Rethinking these concepts as well as checking them may take some time Here is a first alpha-stage attempt: 1) Give up the hierarchical subject namespace a), but maybe not fully. Access checking will continue /locally/ at each layer, by treating each subsystem as a (grey) blackbox. This is already the default implementation strategy. The total system may be less secure than in an idealized fine-grained system, because outer levels can no longer detect bad guys inside of their subsystem instances. The question is: how to get a "more secure" system than currently, with some reasonable effort. 2) Some /coarse/ access permission checks at the block layer b), but finer than today. Currently there is almost no checking at all (except when accessing a huge block device as a whole during open() => at 1&1 we have very large ones, and they may continue running for years). I am unsure how to achieve this in detail. An idea for a long-term solution would be offloading of "allocation groups" to the block layer (if their size is coarsely dynamic in general, e.g. in steps of gigabytes), and to implement some coarse permission checks there. These could then be related to "containers" or "container groups". One of the problems is that some wide-spread network protocols like iSCSI have no clue about this, so this can only be an optional new feature. Further ideas sought. Cheers, Thomas P.S. The concept of a "nest" in Athomux was already some kind of "recursively nested block device".
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/03/2016 11:56 PM, Dave Chinner wrote: > That "new kind of write command" would enable delayed allocation > algorithms to continue to work at the filesystem level on block > devices that freespace management completely is offloaded to... > Cheers, Dave. This would advocate a uniform /internal/ interface (family) across both fs and block layers, similiar in spirit to my old Athomux research prototype long ago (see www.athomux.net). This allows for recursive nesting in complex (distributed) storage/fs hierarchies. It would be nice if that internal interface (family) would be (partly / fully) asynchronous with callbacks. In ideal case, it should be compatible with workqueues (no need for blocking threads anymore). Uniformity is only needed at concept level. There might remain different flavours of concrete interfaces at different subsystems, if the number of subsystems remains as small as possible, and interfacing is close to trivial. I would like to support this also in future versions of MARS (see github.com/schoebel/mars). Cheers, Thomas
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On 03/03/2016 11:56 PM, Dave Chinner wrote: > That "new kind of write command" would enable delayed allocation > algorithms to continue to work at the filesystem level on block > devices that freespace management completely is offloaded to... > Cheers, Dave. This would advocate a uniform /internal/ interface (family) across both fs and block layers, similiar in spirit to my old Athomux research prototype long ago (see www.athomux.net). This allows for recursive nesting in complex (distributed) storage/fs hierarchies. It would be nice if that internal interface (family) would be (partly / fully) asynchronous with callbacks. In ideal case, it should be compatible with workqueues (no need for blocking threads anymore). Uniformity is only needed at concept level. There might remain different flavours of concrete interfaces at different subsystems, if the number of subsystems remains as small as possible, and interfacing is close to trivial. I would like to support this also in future versions of MARS (see github.com/schoebel/mars). Cheers, Thomas
[RFC 04/31] mars: add new module brick_checking
Signed-off-by: Thomas Schoebel-Theuer --- include/linux/brick/brick_checking.h | 104 +++ 1 file changed, 104 insertions(+) create mode 100644 include/linux/brick/brick_checking.h diff --git a/include/linux/brick/brick_checking.h b/include/linux/brick/brick_checking.h new file mode 100644 index 000..a02f1bf --- /dev/null +++ b/include/linux/brick/brick_checking.h @@ -0,0 +1,104 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef BRICK_CHECKING_H +#define BRICK_CHECKING_H + +/***/ + +/* checking */ + +#if defined(CONFIG_MARS_DEBUG) || defined(CONFIG_MARS_CHECKS) +#define BRICK_CHECKING true +#else +#define BRICK_CHECKING false +#endif + +#define _CHECK_ATOMIC(atom, OP, minval) \ +do { \ + if (BRICK_CHECKING) { \ + int __test = atomic_read(atom); \ + if (unlikely(__test OP(minval))) { \ + atomic_set(atom, minval); \ + BRICK_ERR("%d: atomic " #atom " " #OP " " #minval " (%d)\n", __LINE__, __test);\ + } \ + } \ +} while (0) + +#define CHECK_ATOMIC(atom, minval) \ + _CHECK_ATOMIC(atom, <, minval) + +#define CHECK_HEAD_EMPTY(head) \ +do { \ + if (BRICK_CHECKING && unlikely(!list_empty(head) && (head)->next)) {\ + list_del_init(head);\ + BRICK_ERR("%d: list_head " #head " (%p) not empty\n", __LINE__, head);\ + } \ +} while (0) + +#ifdef CONFIG_MARS_DEBUG_MEM +#define CHECK_PTR_DEAD(ptr, label) \ +do { \ + if (BRICK_CHECKING && unlikely((ptr) == (void *)0x5a5a5a5a5a5a5a5a)) {\ + BRICK_FAT("%d: pointer '" #ptr "' is DEAD\n", __LINE__);\ + goto label; \ + } \ +} while (0) +#else +#define CHECK_PTR_DEAD(ptr, label) /*empty*/ +#endif + +#define CHECK_PTR_NULL(ptr, label) \ +do { \ + CHECK_PTR_DEAD(ptr, label); \ + if (BRICK_CHECKING && unlikely(!(ptr))) { \ + BRICK_FAT("%d: pointer '" #ptr "' is NULL\n", __LINE__);\ + goto label; \ + } \ +} while (0) + +#ifdef CONFIG_MARS_DEBUG +#define CHECK_PTR(ptr, label) \ +do { \ + CHECK_PTR_NULL(ptr, label); \ + if (BRICK_CHECKING && unlikely(!virt_addr_valid(ptr))) {\ + BRICK_FAT("%d: pointer '" #ptr "' (%p) is no valid virtual KERNEL address\n", __LINE__, ptr);\ + goto label; \ + } \ +} while (0) +#else +#define CHECK_PTR(ptr, label) CHECK_PTR_NULL(ptr, label) +#endif + +#define CHECK_ASPECT(a_ptr, o_ptr, label) \ +do { \ + if (BRICK_CHECKING && unlikely((a_ptr)->object != o_ptr)) { \ + BRICK_FAT("%d
[RFC 11/31] mars: add new module lib_timing
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/lib/lib_timing.c | 71 + include/linux/brick/lib_timing.h | 181 ++ 2 files changed, 252 insertions(+) create mode 100644 drivers/staging/mars/lib/lib_timing.c create mode 100644 include/linux/brick/lib_timing.h diff --git a/drivers/staging/mars/lib/lib_timing.c b/drivers/staging/mars/lib/lib_timing.c new file mode 100644 index 000..7421dc4 --- /dev/null +++ b/drivers/staging/mars/lib/lib_timing.c @@ -0,0 +1,71 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include +#include + +#ifdef CONFIG_DEBUG_KERNEL + +int report_timing(struct timing_stats *tim, char *str, int maxlen) +{ + int len = 0; + int time = 1; + int resol = 1; + + static const char * const units[] = { + "us", + "ms", + "s", + "ERROR" + }; + const char *unit = units[0]; + int unit_index = 0; + int i; + + for (i = 0; i < TIMING_MAX; i++) { + int this_len = scnprintf(str, + + maxlen, + "<%d%s = %d (%lld) ", + resol, + unit, + tim->tim_count[i], + (long long)tim->tim_count[i] * time); + str += this_len; + len += this_len; + maxlen -= this_len; + if (maxlen <= 1) + break; + resol <<= 1; + time <<= 1; + if (resol >= 1000) { + resol = 1; + unit = units[++unit_index]; + } + } + return len; +} + +#endif /* CONFIG_DEBUG_KERNEL */ + +struct threshold global_io_threshold = { + .thr_limit = 30 * 100, /* 30 seconds */ + .thr_factor = 100, + .thr_plus = 0, +}; diff --git a/include/linux/brick/lib_timing.h b/include/linux/brick/lib_timing.h new file mode 100644 index 000..8a7a1e9 --- /dev/null +++ b/include/linux/brick/lib_timing.h @@ -0,0 +1,181 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef LIB_TIMING_H +#define LIB_TIMING_H + +#include + +/* Simple infrastructure for timing of arbitrary operations and creation + * of some simple histogram statistics. + */ + +#define TIMING_MAX 24 + +struct timing_stats { +#ifdef CONFIG_DEBUG_KERNEL + int tim_count[TIMING_MAX]; + +#endif +}; + +#define _TIME_THIS(_stamp1, _stamp2, _CODE)\ + ({ \ + (_stamp1) = cpu_clock(raw_smp_processor_id()); \ + \ + _CODE; \ + \ + (_stamp2) = cpu_clock(raw_smp_processor_id()); \ + (_stamp2) - (_stamp1); \ + }) + +#define TIME_THIS(_CODE) \ + ({ \ + unsigned long long _stamp1; \ + unsigned long long _stamp2; \ + _TIME_THIS(_stamp1, _stamp2, _CODE);\ + }) + +#ifdef CONFIG_DEBUG_KERNEL + +#define _TIME_STATS(_timing, _stamp1, _stamp2, _CODE) \ + ({
[RFC 08/31] mars: add new module lib_queue
Signed-off-by: Thomas Schoebel-Theuer --- include/linux/brick/lib_queue.h | 166 1 file changed, 166 insertions(+) create mode 100644 include/linux/brick/lib_queue.h diff --git a/include/linux/brick/lib_queue.h b/include/linux/brick/lib_queue.h new file mode 100644 index 000..f1b1a9e --- /dev/null +++ b/include/linux/brick/lib_queue.h @@ -0,0 +1,166 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef LIB_QUEUE_H +#define LIB_QUEUE_H + +#define QUEUE_ANCHOR(PREFIX, KEYTYPE, HEAPTYPE) \ + /* parameters */\ + /* readonly from outside */ \ + atomic_t q_queued; \ + atomic_t q_flying; \ + atomic_t q_total; \ + /* tunables */ \ + int q_batchlen; \ + int q_io_prio; \ + bool q_ordering;\ + /* private */ \ + wait_queue_head_t *q_event; \ + spinlock_t q_lock; \ + struct list_head q_anchor; \ + struct pairing_heap_##HEAPTYPE *heap_high; \ + struct pairing_heap_##HEAPTYPE *heap_low; \ + long long q_last_insert; /* jiffies */ \ + KEYTYPE heap_margin;\ + KEYTYPE last_pos; \ + /* this comment is for keeping TRAILING_SEMICOLON happy */ + +#define QUEUE_FUNCTIONS(PREFIX, ELEM_TYPE, HEAD, KEYFN, KEYCMP, HEAPTYPE)\ + \ +static inline \ +void q_##PREFIX##_trigger(struct PREFIX##_queue *q)\ +{ \ + if (q->q_event) { \ + wake_up_interruptible(q->q_event); \ + } \ +} \ + \ +static inline \ +void q_##PREFIX##_init(struct PREFIX##_queue *q) \ +{ \ + INIT_LIST_HEAD(>q_anchor); \ + q->heap_low = NULL; \ + q->heap_high = NULL;\ + spin_lock_init(>q_lock); \ + atomic_set(>q_queued, 0);\ + atomic_set(>q_flying, 0);\ +} \ + \ +static inline \ +void q_##PREFIX##_insert(struct PREFIX##_queue *q, ELEM_TYPE * elem) \ +{ \ + unsigned long flags;\ + \ + spin_lock_irqsave(>q_lock, flags); \ + \ + if (q->q_ordering) {\ + struct pairing_heap_##HEAPTYPE **use = >heap_high; \ + if (KEYCMP(KEYFN(elem), >heap_margin) <= 0) {
[RFC 13/31] mars: add new module xio
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio.c | 161 + include/linux/xio/xio.h | 313 ++ 2 files changed, 474 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio.c create mode 100644 include/linux/xio/xio.h diff --git a/drivers/staging/mars/xio_bricks/xio.c b/drivers/staging/mars/xio_bricks/xio.c new file mode 100644 index 000..94aeb60 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio.c @@ -0,0 +1,161 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include + +#include + +// + +/* infrastructure */ + +struct banning xio_global_ban = {}; +atomic_t xio_global_io_flying = ATOMIC_INIT(0); + +// + +/* object stuff */ + +const struct generic_object_type aio_type = { + .object_type_name = "aio", + .default_size = sizeof(struct aio_object), + .object_type_nr = OBJ_TYPE_AIO, +}; + +// + +/* brick stuff */ + +/***/ + +/* meta descriptions */ + +const struct meta xio_info_meta[] = { + META_INI(current_size,struct xio_info, FIELD_INT), + META_INI(tf_align,struct xio_info, FIELD_INT), + META_INI(tf_min_size, struct xio_info, FIELD_INT), + {} +}; + +const struct meta xio_aio_user_meta[] = { + META_INI(_object_cb.cb_error, struct aio_object, FIELD_INT), + META_INI(io_pos, struct aio_object, FIELD_INT), + META_INI(io_len, struct aio_object, FIELD_INT), + META_INI(io_may_write,struct aio_object, FIELD_INT), + META_INI(io_prio, struct aio_object, FIELD_INT), + META_INI(io_cs_mode, struct aio_object, FIELD_INT), + META_INI(io_timeout, struct aio_object, FIELD_INT), + META_INI(io_total_size, struct aio_object, FIELD_INT), + META_INI(io_checksum, struct aio_object, FIELD_RAW), + META_INI(io_flags, struct aio_object, FIELD_INT), + META_INI(io_rw,struct aio_object, FIELD_INT), + META_INI(io_id,struct aio_object, FIELD_INT), + META_INI(io_skip_sync,struct aio_object, FIELD_INT), + {} +}; + +const struct meta xio_timespec_meta[] = { + META_INI_TRANSFER(tv_sec, struct timespec, FIELD_UINT, 8), + META_INI_TRANSFER(tv_nsec, struct timespec, FIELD_UINT, 4), + {} +}; + +// + +/* crypto stuff */ + +#include +#include + +static struct crypto_hash *xio_tfm; +static struct semaphore tfm_sem; +int xio_digest_size; + +void xio_digest(unsigned char *digest, void *data, int len) +{ + struct hash_desc desc = { + .tfm = xio_tfm, + .flags = 0, + }; + struct scatterlist sg; + + memset(digest, 0, xio_digest_size); + + /* TODO: use per-thread instance, omit locking */ + down(_sem); + + crypto_hash_init(); + sg_init_table(, 1); + sg_set_buf(, data, len); + crypto_hash_update(, , sg.length); + crypto_hash_final(, digest); + up(_sem); +} + +void aio_checksum(struct aio_object *aio) +{ + unsigned char checksum[xio_digest_size]; + int len; + + if (aio->io_cs_mode <= 0 || !aio->io_data) + goto out_return; + xio_digest(checksum, aio->io_data, aio->io_len); + + len = sizeof(aio->io_checksum); + if (len > xio_digest_size) + len = xio_digest_size; + memcpy(>io_checksum, checksum, len); +out_return:; +} + +/***/ + +/* init stuff */ + +int __init init_xio(void) +{ + XIO_INF("init_xio()\n"); + + sema_init(_sem, 1); + + xio_tfm = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC); + if (!xio_tfm) { + XIO_ERR("cannot alloc crypto hash\n"); + return -ENOMEM; + } + if (IS_ERR(xio_tfm)) { + XIO_ERR("alloc crypto hash failed, status
[RFC 19/31] mars: add new module xio_client
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_client.c | 1055 ++ include/linux/xio/xio_client.h | 105 +++ 2 files changed, 1160 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_client.c create mode 100644 include/linux/xio/xio_client.h diff --git a/drivers/staging/mars/xio_bricks/xio_client.c b/drivers/staging/mars/xio_bricks/xio_client.c new file mode 100644 index 000..6fdc261 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_client.c @@ -0,0 +1,1055 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include + +#include + +/ own type definitions ***/ + +#include + +#define CLIENT_HASH_MAX(PAGE_SIZE / sizeof(struct list_head)) + +int xio_client_abort = 10; + +int max_client_channels = 1; + +int max_client_bulk = 16; + +/ own helper functions ***/ + +static int thread_count; + +static +void _do_resubmit(struct client_channel *ch) +{ + struct client_output *output = ch->output; + unsigned long flags; + + spin_lock_irqsave(>lock, flags); + if (!list_empty(>wait_list)) { + struct list_head *first = ch->wait_list.next; + struct list_head *last = ch->wait_list.prev; + struct list_head *old_start = output->aio_list.next; + +#define list_connect __list_del /* the original routine has a misleading name: in reality it is more general */ + list_connect(>aio_list, first); + list_connect(last, old_start); + INIT_LIST_HEAD(>wait_list); + } + spin_unlock_irqrestore(>lock, flags); +} + +static +void _kill_thread(struct client_threadinfo *ti, const char *name) +{ + struct task_struct *thread = ti->thread; + + if (thread) { + XIO_DBG("stopping %s thread\n", name); + ti->thread = NULL; + brick_thread_stop(thread); + } +} + +static +void _kill_channel(struct client_channel *ch) +{ + XIO_DBG("channel = %p\n", ch); + if (xio_socket_is_alive(>socket)) { + XIO_DBG("shutdown socket\n"); + xio_shutdown_socket(>socket); + } + _kill_thread(>receiver, "receiver"); + if (ch->is_open) { + XIO_DBG("close socket\n"); + xio_put_socket(>socket); + } + ch->recv_error = 0; + ch->is_used = false; + ch->is_open = false; + ch->is_connected = false; + /* Re-Submit any waiting requests +*/ + _do_resubmit(ch); +} + +static inline +void _kill_all_channels(struct client_bundle *bundle) +{ + int i; + + /* first pass: shutdown in parallel without waiting */ + for (i = 0; i < MAX_CLIENT_CHANNELS; i++) { + struct client_channel *ch = >channel[i]; + + if (xio_socket_is_alive(>socket)) { + XIO_DBG("shutdown socket %d\n", i); + xio_shutdown_socket(>socket); + } + } + /* separate pass (may wait) */ + for (i = 0; i < MAX_CLIENT_CHANNELS; i++) + _kill_channel(>channel[i]); +} + +static int receiver_thread(void *data); + +static +int _setup_channel(struct client_bundle *bundle, int ch_nr) +{ + struct client_channel *ch = >channel[ch_nr]; + struct sockaddr_storage src_sockaddr; + struct sockaddr_storage dst_sockaddr; + int status; + + ch->ch_nr = ch_nr; + if (unlikely(ch->receiver.thread)) { + XIO_WRN("receiver thread %d unexpectedly not dead\n", ch_nr); + _kill_thread(>receiver, "receiver"); + } + + status = xio_create_sockaddr(_sockaddr, my_id()); + if (unlikely(status < 0)) { + XIO_DBG("no src sockaddr, status = %d\n", status); + goto done; + } + + status = xio_create_sockaddr(_sockaddr, bundle->host); + if (unlikely(status < 0)) { + XIO_DBG("no dst sockaddr, status = %d\n", status); + goto
[RFC 27/31] mars: add new module mars_proc
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/mars_light/mars_proc.c | 369 include/linux/mars_light/mars_proc.h| 34 +++ 2 files changed, 403 insertions(+) create mode 100644 drivers/staging/mars/mars_light/mars_proc.c create mode 100644 include/linux/mars_light/mars_proc.h diff --git a/drivers/staging/mars/mars_light/mars_proc.c b/drivers/staging/mars/mars_light/mars_proc.c new file mode 100644 index 000..2a96614 --- /dev/null +++ b/drivers/staging/mars/mars_light/mars_proc.c @@ -0,0 +1,369 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +xio_info_fn xio_info; + +static +int trigger_sysctl_handler( + struct ctl_table *table, + int write, + void __user *buffer, + size_t *length, + loff_t *ppos) +{ + ssize_t res = 0; + size_t len = *length; + + XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos); + + if (!len || *ppos > 0) + goto done; + + if (write) { + char tmp[8] = {}; + + res = len; /* fake consumption of all data */ + + if (len > 7) + len = 7; + if (!copy_from_user(tmp, buffer, len)) { + int code = 0; + int status = kstrtoint(tmp, 10, ); + + /* the return value from ssanf() does not matter */ + (void)status; + if (code > 0) + local_trigger(); + if (code > 1) + remote_trigger(); + } + } else { + char *answer = "MARS module not operational\n"; + char *tmp = NULL; + int mylen; + + if (xio_info) { + answer = "internal error while determining xio_info\n"; + tmp = xio_info(); + if (tmp) + answer = tmp; + } + + mylen = strlen(answer); + if (len > mylen) + len = mylen; + res = len; + if (copy_to_user(buffer, answer, len)) { + XIO_ERR("write %ld bytes at %p failed\n", len, buffer); + res = -EFAULT; + } + brick_string_free(tmp); + } + +done: + XIO_DBG("res = %ld\n", res); + *length = res; + if (res >= 0) { + *ppos += res; + return 0; + } + return res; +} + +static +int lamport_sysctl_handler( + struct ctl_table *table, + int write, + void __user *buffer, + size_t *length, + loff_t *ppos) +{ + ssize_t res = 0; + size_t len = *length; + + XIO_DBG("write = %d len = %ld pos = %lld\n", write, len, *ppos); + + if (!len || *ppos > 0) + goto done; + + if (write) { + return -EINVAL; + } else { + int my_len = 128; + char *tmp = brick_string_alloc(my_len); + struct timespec know = CURRENT_TIME; + struct timespec lnow; + + get_lamport(); + + res = scnprintf(tmp, my_len, + "CURRENT_TIME=%ld.%09ld\nlamport_now=%ld.%09ld\n", + know.tv_sec, know.tv_nsec, + lnow.tv_sec, lnow.tv_nsec + ); + + if (copy_to_user(buffer, tmp, res)) { + XIO_ERR("write %ld bytes at %p failed\n", res, buffer); + res = -EFAULT; + } + brick_string_free(tmp); + } + +done: + XIO_DBG("res = %ld\n", res); + *length = res; + if (res >= 0) { + *ppos += res; + return 0; + } + return res; +} + +#ifdef CTL_UNNUMBERED +#define _CTL_NAME .ctl_name = CTL_UNNUMBERED, +#define _CTL_STRATEGY(handler) .strategy = , +#else +#defin
[RFC 29/31] mars: add new module Makefile
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/Makefile | 61 +++ 1 file changed, 61 insertions(+) create mode 100644 drivers/staging/mars/Makefile diff --git a/drivers/staging/mars/Makefile b/drivers/staging/mars/Makefile new file mode 100644 index 000..13d68cc --- /dev/null +++ b/drivers/staging/mars/Makefile @@ -0,0 +1,61 @@ +# +# Makefile for MARS +# + +# remove_this +ifndef CONFIG_MARS +# mars_config.h is generated by a simple Kconfig parser (gen_config.pl) +# at build time. +# It does not respect any Kconfig dependencies. +# Therefore, it is unsafe. Use at your own risk! +# It is ONLY used for out-of-tree builds. +# +CONFIG_MARS_BIGMODULE := m +CONFIG_MARS_NET_COMPAT := y +obj-$(CONFIG_MARS_BIGMODULE) += mars.o +extra-y+= mars_config.h +GEN_CONFIG_SCRIPT := $(src)/../scripts/gen_config.pl +$(obj)/mars_config.h: $(obj)/buildtag.h +$(obj)/mars_config.h: $(src)/Kconfig $(GEN_CONFIG_SCRIPT) + $(Q)$(kecho) "MARS: using compiler $($(CC) --version | head -1)" + $(CC) -v + $(Q)$(kecho) "MARS: Generating $@" + $(Q)set -e; \ + if [ ! -x $(GEN_CONFIG_SCRIPT) ]; then \ + $(kecho) "MARS: cannot execute script $(GEN_CONFIG_SCRIPT)"; \ + /bin/false; \ + fi; \ + cat $< | $(GEN_CONFIG_SCRIPT) > $@; + cat $@; +endif +# end_remove_this + +obj-$(CONFIG_MARS) += mars.o + +KBUILD_CFLAGS += -fdelete-null-pointer-checks + +mars-objs := \ + lamport.o \ + brick_say.o \ + brick_mem.o \ + brick.o \ + xio_bricks/xio.o\ + xio_bricks/lib_log.o\ + lib/lib_rank.o \ + lib/lib_limiter.o \ + lib/lib_timing.o\ + xio_bricks/lib_mapfree.o\ + xio_bricks/xio_net.o\ + mars_light/light_server_strategy.o \ + xio_bricks/xio_server.o \ + xio_bricks/xio_client.o \ + xio_bricks/xio_sio.o\ + xio_bricks/xio_bio.o\ + xio_bricks/xio_if.o \ + xio_bricks/xio_copy.o \ + xio_bricks/xio_trans_logger.o \ + mars_light/light_strategy.o \ + mars_light/light_net.o \ + mars_light/mars_proc.o \ + mars_light/mars_light.o + -- 2.6.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 06/31] mars: add new module brick
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/brick.c | 728 +++ include/linux/brick/brick.h | 642 ++ 2 files changed, 1370 insertions(+) create mode 100644 drivers/staging/mars/brick.c create mode 100644 include/linux/brick/brick.h diff --git a/drivers/staging/mars/brick.c b/drivers/staging/mars/brick.c new file mode 100644 index 000..9c3d5b9 --- /dev/null +++ b/drivers/staging/mars/brick.c @@ -0,0 +1,728 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#define _STRATEGY + +#include +#include + +// + +/* init / exit functions */ + +void _generic_output_init(struct generic_brick *brick, + const struct generic_output_type *type, + struct generic_output *output) +{ + output->brick = brick; + output->type = type; + output->ops = type->master_ops; + output->nr_connected = 0; + INIT_LIST_HEAD(>output_head); +} + +void _generic_output_exit(struct generic_output *output) +{ + list_del_init(>output_head); + output->brick = NULL; + output->type = NULL; + output->ops = NULL; + output->nr_connected = 0; +} + +int generic_brick_init(const struct generic_brick_type *type, struct generic_brick *brick) +{ + brick->aspect_context.brick_index = get_brick_nr(); + brick->type = type; + brick->ops = type->master_ops; + brick->nr_inputs = 0; + brick->nr_outputs = 0; + brick->power.off_led = true; + init_waitqueue_head(>power.event); + INIT_LIST_HEAD(>tmp_head); + return 0; +} + +void generic_brick_exit(struct generic_brick *brick) +{ + list_del_init(>tmp_head); + brick->type = NULL; + brick->ops = NULL; + brick->nr_inputs = 0; + brick->nr_outputs = 0; + put_brick_nr(brick->aspect_context.brick_index); +} + +int generic_input_init(struct generic_brick *brick, + int index, + const struct generic_input_type *type, + struct generic_input *input) +{ + if (index < 0 || index >= brick->type->max_inputs) + return -EINVAL; + if (brick->inputs[index]) + return -EEXIST; + input->brick = brick; + input->type = type; + input->connect = NULL; + INIT_LIST_HEAD(>input_head); + brick->inputs[index] = input; + brick->nr_inputs++; + return 0; +} + +void generic_input_exit(struct generic_input *input) +{ + list_del_init(>input_head); + input->brick = NULL; + input->type = NULL; + input->connect = NULL; +} + +int generic_output_init(struct generic_brick *brick, + int index, + const struct generic_output_type *type, + struct generic_output *output) +{ + if (index < 0 || index >= brick->type->max_outputs) + return -ENOMEM; + if (brick->outputs[index]) + return -EEXIST; + _generic_output_init(brick, type, output); + brick->outputs[index] = output; + brick->nr_outputs++; + return 0; +} + +int generic_size(const struct generic_brick_type *brick_type) +{ + int size = brick_type->brick_size; + int i; + + size += brick_type->max_inputs * sizeof(void *); + for (i = 0; i < brick_type->max_inputs; i++) + size += brick_type->default_input_types[i]->input_size; + size += brick_type->max_outputs * sizeof(void *); + for (i = 0; i < brick_type->max_outputs; i++) + size += brick_type->default_output_types[i]->output_size; + return size; +} + +int generic_connect(struct generic_input *input, struct generic_output *output) +{ + BRICK_DBG("generic_connect(input=%p, output=%p)\n", input, output); + if (unlikely(!input || !output)) + return -EINVAL; + if (unlikely(input->connect)) + return -EEXIST; + if (unlikely(!list_empty(>input_head))) + return -EINVAL; + /* helps only against the most common errors */ + if (unlikely(input->brick == output->bri
[RFC 26/31] mars: add new module light_server_strategy
Signed-off-by: Thomas Schoebel-Theuer --- .../mars/mars_light/light_server_strategy.c| 403 + 1 file changed, 403 insertions(+) create mode 100644 drivers/staging/mars/mars_light/light_server_strategy.c diff --git a/drivers/staging/mars/mars_light/light_server_strategy.c b/drivers/staging/mars/mars_light/light_server_strategy.c new file mode 100644 index 000..6bb5cd7 --- /dev/null +++ b/drivers/staging/mars/mars_light/light_server_strategy.c @@ -0,0 +1,403 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* MARS Light specific parts of xio_server + */ + +#include +#include +#include + +#define _STRATEGY +#include +#include +#include +#include + +#include + +#include + +static +int dummy_worker(struct mars_global *global, struct mars_dent *dent, bool prepare, bool direction) +{ + return 0; +} + +static +int _set_server_sio_params(struct xio_brick *_brick, void *private) +{ + struct sio_brick *sio_brick = (void *)_brick; + + if (_brick->type != (void *)_sio_brick_type) { + XIO_ERR("bad brick type\n"); + return -EINVAL; + } + sio_brick->o_direct = false; + sio_brick->o_fdsync = false; + XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path); + return 1; +} + +static +int _set_server_bio_params(struct xio_brick *_brick, void *private) +{ + struct bio_brick *bio_brick; + + if (_brick->type == (void *)_sio_brick_type) + return _set_server_sio_params(_brick, private); + if (_brick->type != (void *)_bio_brick_type) { + XIO_ERR("bad brick type\n"); + return -EINVAL; + } + bio_brick = (void *)_brick; + bio_brick->ra_pages = 0; + bio_brick->do_noidle = true; + bio_brick->do_sync = true; + bio_brick->do_unplug = true; + XIO_INF("name = '%s' path = '%s'\n", _brick->brick_name, _brick->brick_path); + return 1; +} + +int handler_thread(void *data) +{ + struct mars_global handler_global = { + .dent_anchor = LIST_HEAD_INIT(handler_global.dent_anchor), + .brick_anchor = LIST_HEAD_INIT(handler_global.brick_anchor), + .global_power = { + .button = true, + }, + .main_event = __WAIT_QUEUE_HEAD_INITIALIZER(handler_global.main_event), + }; + struct task_struct *thread = NULL; + struct server_brick *brick = data; + struct xio_socket *sock = >handler_socket; + bool ok = xio_get_socket(sock); + unsigned long statist_jiffies = jiffies; + int debug_nr; + int status = -EINVAL; + + init_rwsem(_global.dent_mutex); + init_rwsem(_global.brick_mutex); + + XIO_DBG("#%d --- handler_thread starting on socket %p\n", sock->s_debug_nr, sock); + if (!ok) + goto done; + + thread = brick_thread_create(cb_thread, brick, "xio_cb%d", brick->version); + if (unlikely(!thread)) { + XIO_ERR("cannot create cb thread\n"); + status = -ENOENT; + goto done; + } + brick->cb_thread = thread; + + brick->handler_running = true; + wake_up_interruptible(>startup_event); + + while (!list_empty(_global.brick_anchor) || + xio_socket_is_alive(sock)) { + struct xio_cmd cmd = {}; + + handler_global.global_version++; + + if (!list_empty(_global.brick_anchor)) { + if (server_show_statist && !time_is_before_jiffies(statist_jiffies + 10 * HZ)) { + show_statistics(_global, "handler"); + statist_jiffies = jiffies; + } + if (!xio_socket_is_alive(sock) && + atomic_read(>in_flight) <= 0 && + brick->conn_brick) { + if (generic_disconnect((void *)brick->inputs[0]) >= 0) + brick->conn_brick = NULL; + } + +
[RFC 15/31] mars: add new module lib_mapfree
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/lib_mapfree.c | 380 ++ include/linux/xio/lib_mapfree.h | 84 ++ 2 files changed, 464 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/lib_mapfree.c create mode 100644 include/linux/xio/lib_mapfree.h diff --git a/drivers/staging/mars/xio_bricks/lib_mapfree.c b/drivers/staging/mars/xio_bricks/lib_mapfree.c new file mode 100644 index 000..6b464d7 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/lib_mapfree.c @@ -0,0 +1,380 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* time to wait between background mapfree operations */ +int mapfree_period_sec = 10; + +/* some grace space where no regular cleanup should occur */ +int mapfree_grace_keep_mb = 16; + +static +DECLARE_RWSEM(mapfree_mutex); + +static +LIST_HEAD(mapfree_list); + +void mapfree_pages(struct mapfree_info *mf, int grace_keep) +{ + struct address_space *mapping; + pgoff_t start; + pgoff_t end; + + if (unlikely(!mf)) + goto done; + if (unlikely(!mf->mf_filp)) + goto done; + + mapping = mf->mf_filp->f_mapping; + if (unlikely(!mapping)) + goto done; + + if (grace_keep < 0) { /* force full flush */ + start = 0; + end = -1; + } else { + unsigned long flags; + loff_t tmp; + loff_t min; + + spin_lock_irqsave(>mf_lock, flags); + + min = tmp = mf->mf_min[0]; + if (likely(mf->mf_min[1] < min)) + min = mf->mf_min[1]; + if (tmp) { + mf->mf_min[1] = tmp; + mf->mf_min[0] = 0; + } + + spin_unlock_irqrestore(>mf_lock, flags); + + min -= (loff_t)grace_keep * (1024 * 1024); /* megabytes */ + end = 0; + + if (min > 0 || mf->mf_last) { + start = mf->mf_last / PAGE_SIZE; + /* add some grace overlapping */ + if (likely(start > 0)) + start--; + mf->mf_last = min; + end = min / PAGE_SIZE; + } else { /* there was no progress for at least 2 rounds */ + start = 0; + if (!grace_keep) /* also flush thoroughly */ + end = -1; + } + + XIO_DBG("file = '%s' start = %lu end = %lu\n", mf->mf_name, start, end); + } + + if (end > start || end == -1) + invalidate_mapping_pages(mapping, start, end); + +done:; +} + +static +void _mapfree_put(struct mapfree_info *mf) +{ + if (atomic_dec_and_test(>mf_count)) { + XIO_DBG("closing file '%s' filp = %p\n", mf->mf_name, mf->mf_filp); + list_del_init(>mf_head); + CHECK_HEAD_EMPTY(>mf_dirty_anchor); + if (likely(mf->mf_filp)) { + mapfree_pages(mf, -1); + filp_close(mf->mf_filp, NULL); + } + brick_string_free(mf->mf_name); + brick_mem_free(mf); + } +} + +void mapfree_put(struct mapfree_info *mf) +{ + if (likely(mf)) { + down_write(_mutex); + _mapfree_put(mf); + up_write(_mutex); + } +} + +struct mapfree_info *mapfree_get(const char *name, int flags) +{ + struct mapfree_info *mf = NULL; + struct list_head *tmp; + + if (!(flags & O_DIRECT)) { + down_read(_mutex); + for (tmp = mapfree_list.next; tmp != _list; tmp = tmp->next) { + struct mapfree_info *_mf = container_of(tmp, struct mapfree_info, mf_head); + + if (_mf->mf_flags == flags && !strcmp(_mf->mf_name, name)) { + mf = _mf; + atomic_inc(>mf_count); + break; +
[RFC 17/31] mars: add new module xio_bio
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_bio.c | 845 ++ include/linux/xio/xio_bio.h | 85 +++ 2 files changed, 930 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_bio.c create mode 100644 include/linux/xio/xio_bio.h diff --git a/drivers/staging/mars/xio_bricks/xio_bio.c b/drivers/staging/mars/xio_bricks/xio_bio.c new file mode 100644 index 000..ef18325 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_bio.c @@ -0,0 +1,845 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Bio brick (interface to blkdev IO via kernel bios) */ + +#include +#include +#include +#include + +#include +#include +#include + +#include +static struct timing_stats timings[2]; + +struct threshold bio_submit_threshold = { + .thr_ban = _global_ban, + .thr_parent = _io_threshold, + .thr_limit = BIO_SUBMIT_MAX_LATENCY, + .thr_factor = 100, + .thr_plus = 0, +}; + +struct threshold bio_io_threshold[2] = { + [0] = { + .thr_ban = _global_ban, + .thr_parent = _io_threshold, + .thr_limit = BIO_IO_R_MAX_LATENCY, + .thr_factor = 10, + .thr_plus = 1, + }, + [1] = { + .thr_ban = _global_ban, + .thr_parent = _io_threshold, + .thr_limit = BIO_IO_W_MAX_LATENCY, + .thr_factor = 10, + .thr_plus = 1, + }, +}; + +/ own type definitions ***/ + +/ own helper functions ***/ + +/* This is called from the kernel bio layer. + */ +static +void bio_callback(struct bio *bio) +{ + struct bio_aio_aspect *aio_a = bio->bi_private; + struct bio_brick *brick; + unsigned long flags; + + CHECK_PTR(aio_a, err); + CHECK_PTR(aio_a->output, err); + brick = aio_a->output->brick; + CHECK_PTR(brick, err); + + aio_a->status_code = bio->bi_error; + + spin_lock_irqsave(>lock, flags); + list_del(_a->io_head); + list_add_tail(_a->io_head, >completed_list); + atomic_inc(>completed_count); + spin_unlock_irqrestore(>lock, flags); + + wake_up_interruptible(>response_event); + goto out_return; +err: + XIO_FAT("cannot handle bio callback\n"); +out_return:; +} + +/* Map from kernel address/length to struct page (if not already known), + * check alignment constraints, create bio from it. + * Return the length (may be smaller than requested). + */ +static +int make_bio(struct bio_brick *brick, + void *data, + int len, + loff_t pos, + struct bio_aio_aspect *private, + struct bio **_bio) +{ + unsigned long long sector; + int sector_offset; + int data_offset; + int page_offset; + int page_len; + int bvec_count; + int rest_len = len; + int result_len = 0; + int status; + int i; + struct bio *bio = NULL; + struct block_device *bdev; + + status = -EINVAL; + CHECK_PTR(brick, out); + bdev = brick->bdev; + CHECK_PTR(bdev, out); + + if (unlikely(rest_len <= 0)) { + XIO_ERR("bad bio len %d\n", rest_len); + goto out; + } + + sector = pos >> 9; /* TODO: make dynamic */ + sector_offset = pos & ((1 << 9) - 1); /* TODO: make dynamic */ + data_offset = ((unsigned long)data) & ((1 << 9) - 1); /* TODO: make dynamic */ + + if (unlikely(sector_offset > 0)) { + XIO_ERR("odd sector offset %d\n", sector_offset); + goto out; + } + if (unlikely(sector_offset != data_offset)) { + XIO_ERR("bad alignment: sector_offset %d != data_offset %d\n", sector_offset, data_offset); + goto out; + } + if (unlikely(rest_len & ((1 << 9) - 1))) { + XIO_ERR("odd length %d\n", rest_len); + goto out; + } + + page_offset = ((unsigned long)data) & (PAGE_SIZE-1); + page_len = rest_len + page_offset; + bvec_count = (page_len - 1) / PAGE_S
[RFC 16/31] mars: add new module lib_log
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/lib_log.c | 505 ++ include/linux/xio/lib_log.h | 329 +++ 2 files changed, 834 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/lib_log.c create mode 100644 include/linux/xio/lib_log.h diff --git a/drivers/staging/mars/xio_bricks/lib_log.c b/drivers/staging/mars/xio_bricks/lib_log.c new file mode 100644 index 000..a8382e5 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/lib_log.c @@ -0,0 +1,505 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include + +atomic_t global_aio_flying = ATOMIC_INIT(0); + +void exit_logst(struct log_status *logst) +{ + int count; + + log_flush(logst); + + /* TODO: replace by event */ + count = 0; + while (atomic_read(>aio_flying) > 0) { + if (!count++) + XIO_DBG("waiting for IO terminating..."); + brick_msleep(500); + } + if (logst->read_aio) { + XIO_DBG("putting read_aio\n"); + GENERIC_INPUT_CALL(logst->input, aio_put, logst->read_aio); + logst->read_aio = NULL; + } + if (logst->log_aio) { + XIO_DBG("putting log_aio\n"); + GENERIC_INPUT_CALL(logst->input, aio_put, logst->log_aio); + logst->log_aio = NULL; + } +} + +void init_logst(struct log_status *logst, struct xio_input *input, loff_t start_pos, loff_t end_pos) +{ + exit_logst(logst); + + memset(logst, 0, sizeof(struct log_status)); + + logst->input = input; + logst->brick = input->brick; + logst->start_pos = start_pos; + logst->log_pos = start_pos; + logst->end_pos = end_pos; + init_waitqueue_head(>event); +} + +#define XIO_LOG_CB_MAX 32 + +struct log_cb_info { + struct aio_object *aio; + struct log_status *logst; + struct semaphore mutex; + atomic_t refcount; + int nr_cb; + void (*endios[XIO_LOG_CB_MAX])(void *private, int error); + void *privates[XIO_LOG_CB_MAX]; +}; + +static +void put_log_cb_info(struct log_cb_info *cb_info) +{ + if (atomic_dec_and_test(_info->refcount)) + brick_mem_free(cb_info); +} + +static +void _do_callbacks(struct log_cb_info *cb_info, int error) +{ + int i; + + down(_info->mutex); + for (i = 0; i < cb_info->nr_cb; i++) { + void (*end_fn)(void *private, int error); + + end_fn = cb_info->endios[i]; + cb_info->endios[i] = NULL; + if (end_fn) + end_fn(cb_info->privates[i], error); + } + up(_info->mutex); +} + +static +void log_write_endio(struct generic_callback *cb) +{ + struct log_cb_info *cb_info = cb->cb_private; + struct log_status *logst; + + LAST_CALLBACK(cb); + CHECK_PTR(cb_info, err); + + logst = cb_info->logst; + CHECK_PTR(logst, done); + + _do_callbacks(cb_info, cb->cb_error); + +done: + put_log_cb_info(cb_info); + atomic_dec(>aio_flying); + atomic_dec(_aio_flying); + if (logst->signal_event) + wake_up_interruptible(logst->signal_event); + + goto out_return; +err: + XIO_FAT("internal pointer corruption\n"); +out_return:; +} + +void log_flush(struct log_status *logst) +{ + struct aio_object *aio = logst->log_aio; + struct log_cb_info *cb_info; + int align_size; + int gap; + + if (!aio || !logst->count) + goto out_return; + gap = 0; + align_size = (logst->align_size / PAGE_SIZE) * PAGE_SIZE; + if (align_size > 0) { + /* round up to next alignment border */ + int align_offset = logst->offset & (align_size-1); + + if (align_offset > 0) { + int restlen = aio->io_len - logst->offset; + + gap = align_size - align_offset; + if (unlikely(gap > restlen)) + gap = restlen; + } + } + if (gap
[RFC 14/31] mars: add new module xio_net
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_net.c | 1830 + include/linux/xio/xio_net.h | 171 +++ 2 files changed, 2001 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_net.c create mode 100644 include/linux/xio/xio_net.h diff --git a/drivers/staging/mars/xio_bricks/xio_net.c b/drivers/staging/mars/xio_bricks/xio_net.c new file mode 100644 index 000..dcc443c --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_net.c @@ -0,0 +1,1830 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include + +#include +#include + +/**/ + +/* provisionary version detection */ + +#ifndef TCP_MAX_REORDERING +#define __HAS_IOV_ITER +#endif + +#ifdef sk_net_refcnt +/* see eeb1bd5c40edb0e2fd925c8535e2fdebdbc5cef2 */ +#define __HAS_STRUCT_NET +#endif + +/**/ + +#define USE_BUFFERING + +#define SEND_PROTO_VERSION 2 + +enum COMPRESS_TYPES { + COMPRESS_NONE = 0, + COMPRESS_LZO = 1, + /* insert further methods here */ +}; + +int xio_net_compress_data; + +const u16 net_global_flags = 0 +#ifdef __HAVE_LZO + | COMPRESS_LZO +#endif + ; + +/**/ + +/* Internal data structures for low-level transfer of C structures + * described by struct meta. + * Only these low-level fields need to have a fixed size like s64. + * The size and bytesex of the higher-level C structures is converted + * automatically; therefore classical "int" or "long long" etc is viable. + */ + +#define MAX_FIELD_LEN (32 + 16) + +/* Please keep this at a size of 64 bytes by + * reuse of *spare* fields. + */ +struct xio_desc_cache { + u8cache_sender_proto; + u8cache_recver_proto; + s8cache_is_bigendian; + u8cache_spare0; + s16 cache_items; + u16 cache_spare1; + u32 cache_spare2; + u32 cache_spare3; + u64 cache_spare4[4]; + u64 cache_sender_cookie; + u64 cache_recver_cookie; +}; + +/* Please keep this also at a size of 64 bytes by + * reuse of *spare* fields. + */ +struct xio_desc_item { + s8field_type; + s8field_spare0; + s16 field_data_size; + s16 field_sender_size; + s16 field_sender_offset; + s16 field_recver_size; + s16 field_recver_offset; + s32 field_spare; + char field_name[MAX_FIELD_LEN]; +}; + +/* This must not be mirror symmetric between big and little endian + */ +#define XIO_DESC_MAGIC 0x73D0A2EC6148F48Ell + +struct xio_desc_header { + u64 h_magic; + u64 h_cookie; + s16 h_meta_len; + s16 h_index; + u32 h_spare1; + u64 h_spare2; +}; + +#define MAX_INT_TRANSFER 16 + +/**/ + +/* Bytesex conversion / sign extension + */ + +#ifdef __LITTLE_ENDIAN +static const bool myself_is_bigendian; + +#endif +#ifdef __BIG_ENDIAN +static const bool myself_is_bigendian = true; + +#endif + +static inline +void swap_bytes(void *data, int len) +{ + char *a = data; + char *b = data + len - 1; + + while (a < b) { + char tmp = *a; + + *a = *b; + *b = tmp; + a++; + b--; + } +} + +#define SWAP_FIELD(x) swap_bytes(&(x), sizeof(x)) + +static inline +void swap_mc(struct xio_desc_cache *mc, int len) +{ + struct xio_desc_item *mi; + + SWAP_FIELD(mc->cache_sender_cookie); + SWAP_FIELD(mc->cache_recver_cookie); + SWAP_FIELD(mc->cache_items); + + len -= sizeof(*mc); + + for (mi = (void *)(mc + 1); len > 0; mi++, len -= sizeof(*mi)) { + SWAP_FIELD(mi->field_data_size); + SWAP_FIELD(mi->field_sender_size); + SWAP_FIELD(mi->field_sender_offset); + SWAP_FIELD(mi->field_recver_size); + SWAP_FIELD(mi->field_recver_offset); + } +} + +static inline +char get_sign(const void *data, int len, bool is_bigendian, bool is_
[RFC 01/31] mars: add new module lamport
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/lamport.c | 61 ++ include/linux/brick/lamport.h | 26 ++ 2 files changed, 87 insertions(+) create mode 100644 drivers/staging/mars/lamport.c create mode 100644 include/linux/brick/lamport.h diff --git a/drivers/staging/mars/lamport.c b/drivers/staging/mars/lamport.c new file mode 100644 index 000..373093f --- /dev/null +++ b/drivers/staging/mars/lamport.c @@ -0,0 +1,61 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include + +/* TODO: replace with spinlock if possible (first check) */ +struct semaphore lamport_sem = __SEMAPHORE_INITIALIZER(lamport_sem, 1); +struct timespec lamport_now = {}; + +void get_lamport(struct timespec *now) +{ + int diff; + + down(_sem); + + *now = CURRENT_TIME; + diff = timespec_compare(now, _now); + if (diff >= 0) { + timespec_add_ns(now, 1); + memcpy(_now, now, sizeof(lamport_now)); + timespec_add_ns(_now, 1); + } else { + timespec_add_ns(_now, 1); + memcpy(now, _now, sizeof(*now)); + } + + up(_sem); +} + +void set_lamport(struct timespec *old) +{ + int diff; + + down(_sem); + + diff = timespec_compare(old, _now); + if (diff >= 0) { + memcpy(_now, old, sizeof(lamport_now)); + timespec_add_ns(_now, 1); + } + + up(_sem); +} diff --git a/include/linux/brick/lamport.h b/include/linux/brick/lamport.h new file mode 100644 index 000..9aac0ce --- /dev/null +++ b/include/linux/brick/lamport.h @@ -0,0 +1,26 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef LAMPORT_H +#define LAMPORT_H + +#include + +extern void get_lamport(struct timespec *now); +extern void set_lamport(struct timespec *old); + +#endif -- 2.6.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 18/31] mars: add new module xio_sio
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_sio.c | 571 ++ include/linux/xio/xio_sio.h | 68 2 files changed, 639 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_sio.c create mode 100644 include/linux/xio/xio_sio.h diff --git a/drivers/staging/mars/xio_bricks/xio_sio.c b/drivers/staging/mars/xio_bricks/xio_sio.c new file mode 100644 index 000..5822847 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_sio.c @@ -0,0 +1,571 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +/ own type definitions ***/ + +#include + +/* own brick * input * output operations */ + +static int sio_io_get(struct sio_output *output, struct aio_object *aio) +{ + struct file *file; + + if (unlikely(!output->brick->power.on_led)) + return -EBADFD; + + if (aio->obj_initialized) { + obj_get(aio); + return aio->io_len; + } + + file = output->mf->mf_filp; + if (file) { + loff_t total_size = i_size_read(file->f_mapping->host); + + aio->io_total_size = total_size; + /* Only check reads. +* Writes behind EOF are always allowed (sparse files) +*/ + if (!aio->io_may_write) { + loff_t len = total_size - aio->io_pos; + + if (unlikely(len <= 0)) { + /* Special case: allow reads starting _exactly_ at EOF when a timeout is specified. +*/ + if (len < 0 || aio->io_timeout <= 0) { + XIO_DBG("ENODATA %lld\n", len); + return -ENODATA; + } + } + /* Shorten below EOF, but allow special case */ + if (aio->io_len > len && len > 0) + aio->io_len = len; + } + } + + /* Buffered IO. +*/ + if (!aio->io_data) { + struct sio_aio_aspect *aio_a = sio_aio_get_aspect(output->brick, aio); + + if (unlikely(!aio_a)) + return -EILSEQ; + if (unlikely(aio->io_len <= 0)) { + XIO_ERR("bad io_len = %d\n", aio->io_len); + return -ENOMEM; + } + aio->io_data = brick_block_alloc(aio->io_pos, (aio_a->alloc_len = aio->io_len)); + aio_a->do_dealloc = true; + /* atomic_inc(>total_alloc_count); */ + /* atomic_inc(>alloc_count); */ + } + + obj_get_first(aio); + return aio->io_len; +} + +static void sio_io_put(struct sio_output *output, struct aio_object *aio) +{ + struct file *file; + struct sio_aio_aspect *aio_a; + + if (!obj_put(aio)) + goto out_return; + file = output->mf->mf_filp; + aio->io_total_size = i_size_read(file->f_mapping->host); + + aio_a = sio_aio_get_aspect(output->brick, aio); + if (aio_a && aio_a->do_dealloc) { + brick_block_free(aio->io_data, aio_a->alloc_len); + /* atomic_dec(>alloc_count); */ + } + + obj_free(aio); +out_return:; +} + +static +int write_aops(struct sio_output *output, struct aio_object *aio) +{ + struct file *file = output->mf->mf_filp; + loff_t pos = aio->io_pos; + void *data = aio->io_data; + int len = aio->io_len; + int ret = 0; + + mm_segment_t oldfs; + + oldfs = get_fs(); + set_fs(get_ds()); + ret = vfs_write(file, data, len, ); + set_fs(oldfs); + return ret; +} + +static +int read_aops(struct sio_output *output, struct aio_object *aio) +{ + loff_t pos = aio->io_pos; + int len = aio->io_len; + int ret; + + mm_seg
[RFC 02/31] mars: add new module brick_say
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/brick_say.c | 916 +++ include/linux/brick/brick_say.h | 96 2 files changed, 1012 insertions(+) create mode 100644 drivers/staging/mars/brick_say.c create mode 100644 include/linux/brick/brick_say.h diff --git a/drivers/staging/mars/brick_say.c b/drivers/staging/mars/brick_say.c new file mode 100644 index 000..7a51273 --- /dev/null +++ b/drivers/staging/mars/brick_say.c @@ -0,0 +1,916 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include +#include + +/***/ + +/* messaging */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#ifndef GFP_BRICK +#define GFP_BRICK GFP_NOIO +#endif + +#define SAY_ORDER 0 +#define SAY_BUFMAX (PAGE_SIZE << SAY_ORDER) +#define SAY_BUF_LIMIT (SAY_BUFMAX - 1500) +#define MAX_FILELEN16 +#define MAX_IDS1000 + +const char *say_class[MAX_SAY_CLASS] = { + [SAY_DEBUG] = "debug", + [SAY_INFO] = "info", + [SAY_WARN] = "warn", + [SAY_ERROR] = "error", + [SAY_FATAL] = "fatal", + [SAY_TOTAL] = "total", +}; + +int brick_say_logging = 1; + +module_param_named(say_logging, brick_say_logging, int, 0); +int brick_say_debug; + +module_param_named(say_debug, brick_say_debug, int, 0); + +int brick_say_syslog_min = 1; +int brick_say_syslog_max = -1; +int brick_say_syslog_flood_class = 3; +int brick_say_syslog_flood_limit = 20; +int brick_say_syslog_flood_recovery = 300; + +int delay_say_on_overflow = +#ifdef CONFIG_MARS_DEBUG + 1; +#else + 0; +#endif + +static atomic_t say_alloc_channels = ATOMIC_INIT(0); +static atomic_t say_alloc_names = ATOMIC_INIT(0); +static atomic_t say_alloc_pages = ATOMIC_INIT(0); + +static unsigned long flood_start_jiffies; +static int flood_count; + +struct say_channel { + char *ch_name; + struct say_channel *ch_next; + spinlock_t ch_lock[MAX_SAY_CLASS]; + char *ch_buf[MAX_SAY_CLASS][2]; + + short ch_index[MAX_SAY_CLASS]; + struct file *ch_filp[MAX_SAY_CLASS][2]; + int ch_overflow[MAX_SAY_CLASS]; + bool ch_written[MAX_SAY_CLASS]; + bool ch_rollover; + bool ch_must_exist; + bool ch_is_dir; + bool ch_delete; + int ch_status_written; + int ch_id_max; + void *ch_ids[MAX_IDS]; + + wait_queue_head_t ch_progress; +}; + +struct say_channel *default_channel; + +static struct say_channel *channel_list; + +static rwlock_t say_lock = __RW_LOCK_UNLOCKED(say_lock); + +static struct task_struct *say_thread; + +static DECLARE_WAIT_QUEUE_HEAD(say_event); + +bool say_dirty; + +#define use_atomic() \ + ((preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK | HARDIRQ_MASK | NMI_MASK)) != 0 || irqs_disabled()) + +static +void wait_channel(struct say_channel *ch, int class) +{ + if (delay_say_on_overflow && ch->ch_index[class] > SAY_BUF_LIMIT) { + if (!use_atomic()) { + say_dirty = true; + wake_up_interruptible(_event); + wait_event_interruptible_timeout(ch->ch_progress, + ch->ch_index[class] < SAY_BUF_LIMIT, + HZ / 10); + } + } +} + +static +struct say_channel *find_channel(const void *id) +{ + struct say_channel *res = default_channel; + struct say_channel *ch; + + read_lock(_lock); + for (ch = channel_list; ch; ch = ch->ch_next) { + int i; + + for (i = 0; i < ch->ch_id_max; i++) { + if (ch->ch_ids[i] == id) { + res = ch; + goto found; + } + } + } +found: + read_unlock(_lock); + return res; +} + +static +void _remove_binding(struct task_struct *whom) +{ + struct say_channel *ch; + int
[RFC 23/31] mars: add new module xio_server
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_server.c | 486 +++ include/linux/xio/xio_server.h | 91 + 2 files changed, 577 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_server.c create mode 100644 include/linux/xio/xio_server.h diff --git a/drivers/staging/mars/xio_bricks/xio_server.c b/drivers/staging/mars/xio_bricks/xio_server.c new file mode 100644 index 000..95a3327 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_server.c @@ -0,0 +1,486 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Server brick (just for demonstration) */ + +#include +#include +#include + +#include +#include +#include +#include + +/ own type definitions ***/ + +#include + +static struct xio_socket server_socket[NR_SERVER_SOCKETS]; +static struct task_struct *server_threads[NR_SERVER_SOCKETS]; + +/ own helper functions ***/ + +int cb_thread(void *data) +{ + struct server_brick *brick = data; + struct xio_socket *sock = >handler_socket; + bool aborted = false; + bool ok = xio_get_socket(sock); + int status = -EINVAL; + + XIO_DBG("--- cb_thread starting on socket #%d, ok = %d\n", sock->s_debug_nr, ok); + if (!ok) + goto done; + + brick->cb_running = true; + wake_up_interruptible(>startup_event); + + while (!brick_thread_should_stop() || !list_empty(>cb_read_list) || !list_empty(>cb_write_list) || atomic_read(>in_flight) > 0) { + struct server_aio_aspect *aio_a; + struct aio_object *aio; + struct list_head *tmp; + unsigned long flags; + + wait_event_interruptible_timeout( + brick->cb_event, + !list_empty(>cb_read_list) || + !list_empty(>cb_write_list), + 1 * HZ); + + spin_lock_irqsave(>cb_lock, flags); + tmp = brick->cb_write_list.next; + if (tmp == >cb_write_list) { + tmp = brick->cb_read_list.next; + if (tmp == >cb_read_list) { + spin_unlock_irqrestore(>cb_lock, flags); + brick_msleep(1000 / HZ); + continue; + } + } + list_del_init(tmp); + spin_unlock_irqrestore(>cb_lock, flags); + + aio_a = container_of(tmp, struct server_aio_aspect, cb_head); + aio = aio_a->object; + status = -EINVAL; + CHECK_PTR(aio, err); + + status = 0; + /* Report a remote error when consistency cannot be guaranteed, +* e.g. emergency mode during sync. +*/ + if (brick->conn_brick && brick->conn_brick->mode_ptr && *brick->conn_brick->mode_ptr < 0 + && aio->object_cb) + aio->object_cb->cb_error = *brick->conn_brick->mode_ptr; + if (!aborted) { + down(>socket_sem); + status = xio_send_cb(sock, aio); + up(>socket_sem); + } + +err: + if (unlikely(status < 0) && !aborted) { + aborted = true; + XIO_WRN("cannot send response, status = %d\n", status); + /* Just shutdown the socket and forget all pending +* requests. +* The _client_ is responsible for resending +* any lost operations. +*/ + xio_shutdown_socket(sock); + } + + if (aio_a->data) { + brick_block_free(aio_a->data, aio_a->len); + aio->io_data = NULL; + } + if (aio_a->do_put) { + GENERIC_INPUT_CALL(brick->inputs[0], aio_put, aio); + atomic_
[RFC 25/31] mars: add new module light_net
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/mars_light/light_net.c | 109 1 file changed, 109 insertions(+) create mode 100644 drivers/staging/mars/mars_light/light_net.c diff --git a/drivers/staging/mars/mars_light/light_net.c b/drivers/staging/mars/mars_light/light_net.c new file mode 100644 index 000..9890edd --- /dev/null +++ b/drivers/staging/mars/mars_light/light_net.c @@ -0,0 +1,109 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include + +#include +#include + +static +char *_xio_translate_hostname(const char *name) +{ + char *res = brick_strdup(name); + char *test; + char *tmp; + + for (tmp = res; *tmp; tmp++) { + if (*tmp == ':') { + *tmp = '\0'; + break; + } + } + + tmp = path_make("/mars/ips/ip-%s", res); + if (unlikely(!tmp)) + goto done; + + test = mars_readlink(tmp); + if (test && test[0]) { + XIO_DBG("'%s' => '%s'\n", tmp, test); + brick_string_free(res); + res = test; + } else { + brick_string_free(test); + XIO_WRN("no hostname translation for '%s'\n", tmp); + } + brick_string_free(tmp); + +done: + return res; +} + +int xio_send_dent_list(struct xio_socket *sock, struct list_head *anchor) +{ + struct list_head *tmp; + struct mars_dent *dent; + int status = 0; + + for (tmp = anchor->next; tmp != anchor; tmp = tmp->next) { + dent = container_of(tmp, struct mars_dent, dent_link); + status = xio_send_struct(sock, dent, mars_dent_meta); + if (status < 0) + break; + } + if (status >= 0) { /* send EOR */ + status = xio_send_struct(sock, NULL, mars_dent_meta); + } + return status; +} + +int xio_recv_dent_list(struct xio_socket *sock, struct list_head *anchor) +{ + int status; + + for (;;) { + struct mars_dent *dent = brick_zmem_alloc(sizeof(struct mars_dent)); + + INIT_LIST_HEAD(>dent_link); + INIT_LIST_HEAD(>brick_list); + + status = xio_recv_struct(sock, dent, mars_dent_meta); + if (status <= 0) { + xio_free_dent(dent); + goto done; + } + list_add_tail(>dent_link, anchor); + } +done: + return status; +} + +/* module init stuff / + +int __init init_sy_net(void) +{ + XIO_INF("init_sy_net()\n"); + xio_translate_hostname = _xio_translate_hostname; + return 0; +} + +void exit_sy_net(void) +{ + XIO_INF("exit_sy_net()\n"); +} -- 2.6.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 05/31] mars: add new module meta
Signed-off-by: Thomas Schoebel-Theuer --- include/linux/brick/meta.h | 106 + 1 file changed, 106 insertions(+) create mode 100644 include/linux/brick/meta.h diff --git a/include/linux/brick/meta.h b/include/linux/brick/meta.h new file mode 100644 index 000..a92b2b6 --- /dev/null +++ b/include/linux/brick/meta.h @@ -0,0 +1,106 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#ifndef META_H +#define META_H + +/***/ + +/* metadata descriptions */ + +/* The idea is to describe your C structures in such a way that + * transfers to disk or over a network become self-describing. + * + * In essence, this is a kind of version-independent marshalling. + * + * Advantage: + * When you extend your original C struct (and of course update the + * corresponding meta structure), old data on disk (or network peers + * running an old version of your program) will remain valid. + * Upon read, newly added fields missing in the old version will be simply + * not filled in and therefore remain zeroed (if you don't forget to + * initially clear your structures via memset() / initializers / etc). + * Note that this works only if you never rename or remove existing + * fields; you should only add new ones. + * [TODO: add macros for description of ignored / renamed fields to + * overcome this limitation] + * You may increase the size of integers, for example from 32bit to 64bit + * or even higher; sign extension will be automatically carried out + * when necessary. + * Also, you may change the order of fields, because the metadata interpreter + * will check each field individually; field offsets are automatically + * maintained. + * + * Disadvantage: this adds some (small) overhead. + */ + +enum field_type { + FIELD_DONE, + FIELD_REF, + FIELD_SUB, + FIELD_STRING, + FIELD_RAW, + FIELD_INT, + FIELD_UINT, +}; + +struct meta { + /* char field_name[MAX_FIELD_LEN]; */ + char *field_name; + + short field_type; + short field_data_size; + short field_transfer_size; + int field_offset; + const struct meta *field_ref; +}; + +#define _META_INI(NAME, STRUCT, TYPE, TSIZE) \ + .field_name = #NAME,\ + .field_type = TYPE, \ + .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \ + .field_transfer_size = (TSIZE), \ + .field_offset = offsetof(STRUCT, NAME) \ + +#define META_INI_TRANSFER(NAME, STRUCT, TYPE, TSIZE) \ + { _META_INI(NAME, STRUCT, TYPE, TSIZE) } + +#define META_INI(NAME, STRUCT, TYPE) \ + { _META_INI(NAME, STRUCT, TYPE, 0) } + +#define _META_INI_AIO(NAME, STRUCT, AIO) \ + .field_name = #NAME,\ + .field_type = FIELD_REF,\ + .field_data_size = sizeof(*(((STRUCT *)NULL)->NAME)), \ + .field_offset = offsetof(STRUCT, NAME), \ + .field_ref = AIO + +#define META_INI_AIO(NAME, STRUCT, AIO) { _META_INI_AIO(NAME, STRUCT, AIO) } + +#define _META_INI_SUB(NAME, STRUCT, SUB) \ + .field_name = #NAME,\ + .field_type = FIELD_SUB,\ + .field_data_size = sizeof(((STRUCT *)NULL)->NAME), \ + .field_offset = offsetof(STRUCT, NAME), \ + .field_ref = SUB + +#define META_INI_SUB(NAME, STRUCT, SUB) { _META_INI_SUB(NAME, STRUCT, SUB) } + +extern const struct meta *find_meta(const struct meta *meta, const char *field_name); +/* extern void free_meta(void *data, const struct meta *meta); */ + +#endif -- 2.6.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC 21/31] mars: add new module xio_copy
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_copy.c | 1005 include/linux/xio/xio_copy.h | 115 2 files changed, 1120 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_copy.c create mode 100644 include/linux/xio/xio_copy.h diff --git a/drivers/staging/mars/xio_bricks/xio_copy.c b/drivers/staging/mars/xio_bricks/xio_copy.c new file mode 100644 index 000..aa5bc56 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_copy.c @@ -0,0 +1,1005 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Copy brick (just for demonstration) */ + +#include +#include +#include + +#include +#include + +#ifndef READ +#define READ 0 +#define WRITE 1 +#endif + +#define COPY_CHUNK (PAGE_SIZE) +#define NR_COPY_REQUESTS (32 * 1024 * 1024 / COPY_CHUNK) + +#define STATES_PER_PAGE(PAGE_SIZE / sizeof(struct copy_state)) +#define MAX_SUB_TABLES (NR_COPY_REQUESTS / STATES_PER_PAGE + (NR_COPY_REQUESTS % STATES_PER_PAGE ? 1 : 0)\ + \ +) +#define MAX_COPY_REQUESTS (PAGE_SIZE / sizeof(struct copy_state *) * STATES_PER_PAGE) + +#define GET_STATE(brick, index) \ + ((brick)->st[(index) / STATES_PER_PAGE][(index) % STATES_PER_PAGE]) + +/ own type definitions ***/ + +#include + +int xio_copy_overlap = 1; + +int xio_copy_read_prio = XIO_PRIO_NORMAL; + +int xio_copy_write_prio = XIO_PRIO_NORMAL; + +int xio_copy_read_max_fly; + +int xio_copy_write_max_fly; + +#define is_read_limited(brick) \ + (xio_copy_read_max_fly > 0 && atomic_read(&(brick)->copy_read_flight) >= xio_copy_read_max_fly) + +#define is_write_limited(brick) \ + (xio_copy_write_max_fly > 0 && atomic_read(&(brick)->copy_write_flight) >= xio_copy_write_max_fly) + +/ own helper functions ***/ + +/* TODO: + * The clash logic is untested / alpha stage (Feb. 2011). + * + * For now, the output is never used, so this cannot do harm. + * + * In order to get the output really working / enterprise grade, + * some larger test effort should be invested. + */ +static inline +void _clash(struct copy_brick *brick) +{ + brick->trigger = true; + set_bit(0, >clash); + atomic_inc(>total_clash_count); + wake_up_interruptible(>event); +} + +static inline +int _clear_clash(struct copy_brick *brick) +{ + int old; + + old = test_and_clear_bit(0, >clash); + return old; +} + +/* Current semantics: + * + * All writes are always going to the original input A. They are _not_ + * replicated to B. + * + * In order to get B really uptodate, you have to replay the right + * transaction logs there (at the right time). + * [If you had no writes on A at all during the copy, of course + * this is not necessary] + * + * When utilize_mode is on, reads can utilize the already copied + * region from B, but only as long as this region has not been + * invalidated by writes (indicated by low_dirty). + * + * TODO: implement replicated writes, together with some transaction + * replay logic applying the transaction logs _only_ after + * crashes during inconsistency caused by partial replication of writes. + */ +static +int _determine_input(struct copy_brick *brick, struct aio_object *aio) +{ + int rw; + int below; + int behind; + loff_t io_end; + + if (!brick->utilize_mode || brick->low_dirty) + return INPUT_A_IO; + + io_end = aio->io_pos + aio->io_len; + below = io_end <= brick->copy_start; + behind = !brick->copy_end || aio->io_pos >= brick->copy_end; + rw = aio->io_may_write | aio->io_rw; + if (rw) { + if (!behind) { + brick->low_dirty = true; + if (!below) { + _clash(brick); + wake_up_interruptible(>event); +
[RFC 20/31] mars: add new module xio_if
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_if.c | 961 +++ include/linux/xio/xio_if.h | 108 2 files changed, 1069 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_if.c create mode 100644 include/linux/xio/xio_if.h diff --git a/drivers/staging/mars/xio_bricks/xio_if.c b/drivers/staging/mars/xio_bricks/xio_if.c new file mode 100644 index 000..65e023c --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_if.c @@ -0,0 +1,961 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Interface to a Linux device. + * 1 Input, 0 Outputs. + */ + +#define REQUEST_MERGING +#define ALWAYS_UNPLUG true +#define PREFETCH_LEN PAGE_SIZE + +/* low-level device parameters */ +#define IF_MAX_SEGMENT_SIZEPAGE_SIZE +#define USE_MAX_SECTORS(IF_MAX_SEGMENT_SIZE >> 9) +#define USE_MAX_PHYS_SEGMENTS (IF_MAX_SEGMENT_SIZE >> 9) +#define USE_MAX_SEGMENT_SIZE IF_MAX_SEGMENT_SIZE +#define USE_LOGICAL_BLOCK_SIZE 512 +#define USE_SEGMENT_BOUNDARY (PAGE_SIZE-1) + +#include +#include +#include + +#include +#include +#include +#include + +#include +#include + +#ifndef XIO_MAJOR +#define XIO_MAJOR (DRBD_MAJOR + 1) +#endif + +/ global tuning ***/ + +int if_throttle_start_size; + +struct rate_limiter if_throttle = { + .lim_max_rate = 5000, +}; + +/ own type definitions ***/ + +#include + +#define IF_HASH_MAX(PAGE_SIZE / sizeof(struct if_hash_anchor)) +#define IF_HASH_CHUNK (PAGE_SIZE * 32) + +struct if_hash_anchor { + spinlock_t hash_lock; + struct list_head hash_anchor; +}; + +/ own static definitions ***/ + +/* TODO: check bounds, ensure that free minor numbers are recycled */ +static int device_minor; + +/*** object * aspect constructors * destructors **/ + +/ linux operations ***/ + +static +void _if_start_io_acct(struct if_input *input, struct bio_wrapper *biow) +{ + struct bio *bio = biow->bio; + const int rw = bio_data_dir(bio); + const int cpu = part_stat_lock(); + + (void)cpu; + part_round_stats(cpu, >disk->part0); + part_stat_inc(cpu, >disk->part0, ios[rw]); + part_stat_add(cpu, >disk->part0, sectors[rw], bio->bi_iter.bi_size >> 9); + part_inc_in_flight(>disk->part0, rw); + part_stat_unlock(); + biow->start_time = jiffies; +} + +static +void _if_end_io_acct(struct if_input *input, struct bio_wrapper *biow) +{ + unsigned long duration = jiffies - biow->start_time; + struct bio *bio = biow->bio; + const int rw = bio_data_dir(bio); + const int cpu = part_stat_lock(); + + (void)cpu; + part_stat_add(cpu, >disk->part0, ticks[rw], duration); + part_round_stats(cpu, >disk->part0); + part_dec_in_flight(>disk->part0, rw); + part_stat_unlock(); +} + +/* callback + */ +static +void if_endio(struct generic_callback *cb) +{ + struct if_aio_aspect *aio_a = cb->cb_private; + struct if_input *input; + int k; + int rw; + int error; + + LAST_CALLBACK(cb); + if (unlikely(!aio_a || !aio_a->object)) { + XIO_FAT("aio_a = %p aio = %p, something is very wrong here!\n", aio_a, aio_a->object); + goto out_return; + } + input = aio_a->input; + CHECK_PTR(input, err); + + rw = aio_a->object->io_rw; + + for (k = 0; k < aio_a->bio_count; k++) { + struct bio_wrapper *biow; + struct bio *bio; + + biow = aio_a->orig_biow[k]; + aio_a->orig_biow[k] = NULL; + CHECK_PTR(biow, err); + + CHECK_ATOMIC(>bi_comp_cnt, 1); + if (!atomic_dec_and_test(>bi_comp_cnt)) + continue; + + bio = biow->bio; + CHECK_PTR_NULL(bio, err); + + _if_end_io_acct(input, biow); + +
[RFC 22/31] mars: add new module xio_trans_logger
Signed-off-by: Thomas Schoebel-Theuer --- drivers/staging/mars/xio_bricks/xio_trans_logger.c | 3309 include/linux/xio/xio_trans_logger.h | 263 ++ 2 files changed, 3572 insertions(+) create mode 100644 drivers/staging/mars/xio_bricks/xio_trans_logger.c create mode 100644 include/linux/xio/xio_trans_logger.h diff --git a/drivers/staging/mars/xio_bricks/xio_trans_logger.c b/drivers/staging/mars/xio_bricks/xio_trans_logger.c new file mode 100644 index 000..04d4c63 --- /dev/null +++ b/drivers/staging/mars/xio_bricks/xio_trans_logger.c @@ -0,0 +1,3309 @@ +/* + * MARS Long Distance Replication Software + * + * Copyright (C) 2010-2014 Thomas Schoebel-Theuer + * Copyright (C) 2011-2014 1&1 Internet AG + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +/* Trans_Logger brick */ + +#define XIO_DEBUGGING + +#include +#include +#include +#include + +#include +#include +#include + +#include + +/* variants */ +#define KEEP_UNIQUE +#define DELAY_CALLERS /* this is _needed_ for production systems */ +/* When possible, queue 1 executes phase3_startio() directly without + * intermediate queueing into queue 3 = > may be irritating, but has better + * performance. NOTICE: when some day the IO scheduling should be + * different between queue 1 and 3, you MUST disable this in order + * to distinguish between them! + */ +#define SHORTCUT_1_to_3 + +/* commenting this out is dangerous for data integrity! use only for testing! */ +#define USE_MEMCPY +#define DO_WRITEBACK /* otherwise FAKE IO */ +#define REPLAY_DATA + +/* tuning */ +#ifdef BRICK_DEBUG_MEM +#define CONF_TRANS_CHUNKSIZE (128 * 1024 - PAGE_SIZE * 2) +#else +#define CONF_TRANS_CHUNKSIZE (128 * 1024) +#endif +#define CONF_TRANS_MAX_AIO_SIZEPAGE_SIZE +#define CONF_TRANS_ALIGN 0 + +#define XIO_RPL(_args...) /*empty*/ + +struct trans_logger_hash_anchor { + struct rw_semaphore hash_mutex; + struct list_head hash_anchor; +}; + +#define NR_HASH_PAGES 64 + +#define MAX_HASH_PAGES (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor *)) +#define HASH_PER_PAGE (PAGE_SIZE / sizeof(struct trans_logger_hash_anchor)) +#define HASH_TOTAL (NR_HASH_PAGES * HASH_PER_PAGE) + +/ global tuning ***/ + +int trans_logger_completion_semantics = 1; + +int trans_logger_do_crc = +#ifdef CONFIG_MARS_DEBUG + true; +#else + false; +#endif + +int trans_logger_mem_usage; /* in KB */ + +int trans_logger_max_interleave = -1; + +int trans_logger_resume = 1; + +int trans_logger_replay_timeout = 1; /* in s */ + +struct writeback_group global_writeback = { + .lock = __RW_LOCK_UNLOCKED(global_writeback.lock), + .group_anchor = LIST_HEAD_INIT(global_writeback.group_anchor), + .until_percent = 30, +}; + +static +void add_to_group(struct writeback_group *gr, struct trans_logger_brick *brick) +{ + unsigned long flags; + + write_lock_irqsave(>lock, flags); + list_add_tail(>group_head, >group_anchor); + write_unlock_irqrestore(>lock, flags); +} + +static +void remove_from_group(struct writeback_group *gr, struct trans_logger_brick *brick) +{ + unsigned long flags; + + write_lock_irqsave(>lock, flags); + list_del_init(>group_head); + gr->leader = NULL; + write_unlock_irqrestore(>lock, flags); +} + +static +struct trans_logger_brick *elect_leader(struct writeback_group *gr) +{ + struct trans_logger_brick *res = gr->leader; + struct list_head *tmp; + unsigned long flags; + + if (res && gr->until_percent >= 0) { + loff_t used = atomic64_read(>shadow_mem_used); + + if (used > gr->biggest * gr->until_percent / 100) + goto done; + } + + read_lock_irqsave(>lock, flags); + for (tmp = gr->group_anchor.next; tmp != >group_anchor; tmp = tmp->next) { + struct trans_logger_brick *test = container_of(tmp, struct trans_logger_brick, group_head); + loff_t new_used = atomic64_read(>shadow_mem_used); + + if (!res || new_used > atomic64_read(>shadow_mem_used)) { + res = test; + gr->biggest = new_used; + } + } + read_unlo