Re: [PATCH v3] RISC-V: Implement ASID allocator

2019-04-24 Thread Michael Clark



>> On 25/04/2019, at 11:36 AM, Palmer Dabbelt  wrote:
>> 
>> On Thu, 28 Mar 2019 21:51:38 PDT (-0700), anup.pa...@wdc.com wrote:
>> Currently, we do local TLB flush on every MM switch. This is very harsh on
>> performance because we are forcing page table walks after every MM switch.
>> 
>> This patch implements ASID allocator for assigning an ASID to a MM context.
>> The number of ASIDs are limited in HW so we create a logical entity named
>> CONTEXTID for assigning to MM context. The lower bits of CONTEXTID are ASID
>> and upper bits are VERSION number. The number of usable ASID bits supported
>> by HW are detected at boot-time by writing 1s to ASID bits in SATP CSR.
>> 
>> We allocate new CONTEXTID on first MM switch for a MM context where the
>> ASID is allocated from an ASID bitmap and VERSION is provide by an atomic
>> counter. At time of allocating new CONTEXTID, if we run out of available
>> ASIDs then:
>> 1. We flush the ASID bitmap
>> 2. Increment current VERSION atomic counter
>> 3. Re-allocate ASID from ASID bitmap
>> 4. Flush TLB on all CPUs
>> 5. Try CONTEXTID re-assignment on all CPUs
>> 
>> Please note that we don't use ASID #0 because it is used at boot-time by
>> all CPUs for initial MM context. Also, newly created context is always
>> assigned CONTEXTID #0 (i.e. VERSION #0 and ASID #0) which is an invalid
>> context in our implementation.
>> 
>> Using above approach, we have virtually infinite CONTEXTIDs on-top-of
>> limited number of HW ASIDs. This approach is inspired from ASID allocator
>> used for Linux ARM/ARM64 but we have adapted it for RISC-V. Overall, this
>> ASID allocator helps us reduce rate of local TLB flushes on every CPU
>> thereby increasing performance.
>> 
>> This patch is tested on QEMU/virt machine and SiFive Unleashed board. On
>> QEMU/virt machine, we see 10% (approx) performance improvement with SW
>> emulated TLBs provided by QEMU. Unfortunately, ASID bits of SATP CSR are
>> not implemented on SiFive Unleashed board so we don't see any change in
>> performance.
> 
> My worry here is the testing: I don't trust QEMU to be a good enough test of
> ASID handling to shake out the bugs in this sort of code -- unless I'm missing
> something, we're currently ignoring ASIDs in QEMU entirely.  As a result I'd
> consider this to be essentially untestable until someone comes up with an
> implementation that takes advantage of ASIDs.  Given that bugs here would be
> super hard to find, I'd prefer to avoid merging this until we're sure it's
> solid.

I agree. Not merging code until there are proofs in the form of independently 
verifiable tests is a good idea and can “cause no harm”. As long as folk know 
where the branch is, they can try it out on its testing or experimental branch. 
This sounds “experimental”. If it breaks mm on hardware supporting ASID, as 
mentioned in other email on this thread, then it’s perhaps better described as 
“very experimental”.

QEMU has no support for ASID and will just unconditionally flush the soft-tlb, 
which is a valid behavior but not helpful for testing ASID. It could hide 
serious bugs where the TLB is left in an inconsistent state, potentially 
leaking privileged data. I would want Linux user-space context switch tests 
with two or more processes created using clone from init pid 1 so there can be 
no interference from daemons that exist on a full system. We can launch the 
test cases as pid 1 using init= in hardware or a simulator.

I think QEMU could potentially map some ASID bits to its soft-tlb. The soft-tlb 
tag bits are limited but it’s possible to customize. That said, it should run 
just fine in spike using CLINT and HTIF. mm tests don’t need PLIC.

I know SiFive is in the business of selling formally verified RISC-V hardware 
and they have sophisticated in-house verification for their cores, but, … from 
Googling a bit, one can quickly see there is demand for more open source 
verification. Trust is important but we can’t really encourage a “trust us” as 
the basis for merging code when the current software-engineering norm is to 
have automated tests in CI.

My 2c.

>> Co-developed-by:: Gary Guo 
>> Signed-off-by: Anup Patel 
>> ---
>> Changes since v2:
>> - Move to lazy TLB flushing because we get slow path warnings if we
>> use flush_tlb_all()
>> - Don't set ASID bits to all 1s in head.s. Instead just do it on
>> boot CPU calling asids_init() for determining number of HW ASID bits
>> - Make CONTEXT version comparison more readable in set_mm_asid()
>> - Fix typo in __flush_context()
>> 
>> Changes since v1:
>> - We adapt good aspects from Gary Guo's ASID allocator implementation
>> and provide due credit to him by adding his SoB.
>> - Track ASIDs active during context flush and mark them as reserved
>> - Set ASID bits to all 1s to simplify number of ASID bit detection
>> - Use atomic_long_t instead of atomic64_t for being 32bit friendly
>> - Use unsigned long instead of u64 for being 32bit friendly
>> - Use flush_tlb_all() 

Re: [PATCH] of: Respect CONFIG_CMDLINE{,_EXTENED,_FORCE) with no chosen node

2018-03-18 Thread Michael Clark


> On 18/03/2018, at 5:51 AM, Rob Herring  wrote:
> 
> On Wed, Mar 14, 2018 at 09:31:05AM -0700, Palmer Dabbelt wrote:
>> Systems that boot without a chosen node in the device tree should still
>> respect the command lines that are built into the kernel.  This patch
>> avoids bailing out of the command line argument parsing code when there
>> is no chosen node in the device tree.  This is necessary to boot
>> straight to a root file system (ie, without an initramfs)
>> 
>> The intent here is that the only functional change is to copy
>> CONFIG_CMDLINE into data if both of them are valid (ie, CONFIG_CMDLINE
>> is defined and data is non-null) on systems where there is no chosen
>> device tree node.  I don't actually know what the return values do so I
>> just preserved them.
>> 
>> Thanks to Trung and Moritz for finding the bug during our ELC hackathon
>> (and providing the first fix), and Michal for suggesting this fix (which
>> is cleaner than what I was doing).  I've given this very minimal
>> testing: it fixes the RISC-V bug (in conjunction with a patch to move
>> from COMMANDLINE_OVERRIDE to COMMANDLINE_FORCE), still works on systems
>> with and without the chosen node, and builds on ARM64.
>> 
>> CC: Michael J Clark 
>> CC: Trung Tran 
>> CC: Moritz Fischer 
>> Signed-off-by: Palmer Dabbelt 
>> ---
>> drivers/of/fdt.c | 14 --
>> 1 file changed, 12 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
>> index 84aa9d676375..60241b1cb024 100644
>> --- a/drivers/of/fdt.c
>> +++ b/drivers/of/fdt.c
>> @@ -1084,9 +1084,12 @@ int __init early_init_dt_scan_chosen(unsigned long 
>> node, const char *uname,
>> 
>>  pr_debug("search \"chosen\", depth: %d, uname: %s\n", depth, uname);
>> 
>> -if (depth != 1 || !data ||
>> +if (!data)
>> +goto no_data;
> 
> Just "return 0" here.
> 
>> +
>> +if (depth != 1 ||
>>  (strcmp(uname, "chosen") != 0 && strcmp(uname, "chosen@0") != 0))
>> -return 0;
>> +goto no_chosen;
>> 
>>  early_init_dt_check_for_initrd(node);
>> 
>> @@ -1117,6 +1120,13 @@ int __init early_init_dt_scan_chosen(unsigned long 
>> node, const char *uname,
>> 
>>  /* break now */
>>  return 1;
>> +
>> +no_chosen:
>> +#ifdef CONFIG_CMDLINE
>> +strlcpy(data, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
> 
> Best case, this is going to needlessly copy the string on every single 
> node that is not /chosen.
> 
> Worst case, I think this changes behavior. For example, first you copy 
> CONFIG_CMDLINE into data, then on a later iteration, you strlcat 
> CONFIG_CMDLINE into data if CONFIG_CMDLINE_EXTEND is enable. There could 
> also be some unintended behavior if data has a string to start with. 
> 
> I'd really like to see this function re-written to just find the /chosen 
> node and then handle each property one by one. The iterating approach is 
> silly. I assume it predates libfdt and we didn't have nice functions to 
> find nodes by path (or any other way).

Thanks very much for the feedback.

I think we had the right intent, which was to fix the issue in the generic 
device tree code rather than add a arch specific hack. Previously we had code 
in arch/riscv/kernel/setup.c which would copy the built-in command line but 
this clashed with the architecture neutral code, which resulted in appending 
the compiled-in command twice if the chosen node was present in device-tree. 
I’m adding Wes to the ‘cc as he has recently changed the HiFive Unleased 
device-tree to include an empty chosen node which is another workaround for the 
issue.

We weren’t aware that this code was called in a loop for each device-tree node. 
I guess one possibility is to hoist the CONFIG_CMDLINE code out of 
early_init_dt_scan_chosen into early_init_dt_scan and have 
early_init_dt_scan_chosen append bootargs or optionally ignore it if 
CONFIG_CMDLINE_OVERRIDE is set.

> I'm working on a patch to re-structure this function. Will send it out 
> in the next day.

Thanks!

It seems logical that the architecture neutral code should set the compiled in 
command-line if CONFIG_CMDLINE_OVERRIDE is set, whether or not a “chosen” node 
is present. So we would prefer a fix in the device-tree code over an 
architecture specific workaround. The problem with an architecture specific 
workaround outside of device-tree code is that it doesn’t know whether the 
chosen node is present. I had previously submitted a patch to remove the 
architecture specific command line code because it duplicated code in 
drivers/of/fdt.c and resulted in the built-in command being appended twice if 
the chosen node was present. It was only later that we became aware that it was 
not possible to use the built-in command line if the “chosen” node was not 
present.

>> +#endif
>> +no_data:
>> +return 0;
>> }
>> 
>> #ifdef CONFIG_HAVE_MEMBLOCK
>> -- 
>> 2.16.1



Re: hi-res mtime userspace interface

2008-01-20 Thread Michael Clark

Stephen Hemminger wrote:

Look at stat.
  


Thanks. OK that was what I wanted. I hadn't looked further than man 2 
stat - I think the stat man page needs an update.



In /usr/include/bits/stat.h:
struct stat
  {
__dev_t st_dev; /* Device.  */
...
#ifdef __USE_MISC
/* Nanosecond resolution timestamps are stored in a format
   equivalent to 'struct timespec'.  This is the type used
   whenever possible but the Unix namespace rules do not allow the
   identifier 'timespec' to appear in the  header.
   Therefore we have to handle the use of this header in strictly
   standard-compliant sources special.  */
struct timespec st_atim;/* Time of last access.  */
struct timespec st_mtim;/* Time of last modification.  */
struct timespec st_ctim;/* Time of last status change.  */
# define st_atime st_atim.tv_sec/* Backward compatibility.  */
# define st_mtime st_mtim.tv_sec
# define st_ctime st_ctim.tv_sec
#else
__time_t st_atime;  /* Time of last access.  */
unsigned long int st_atimensec; /* Nscecs of last access.  */
__time_t st_mtime;  /* Time of last modification.  */
unsigned long int st_mtimensec; /* Nsecs of last modification.  */
__time_t st_ctime;  /* Time of last status change.  */
unsigned long int st_ctimensec; /* Nsecs of last status change.  */
#endif
  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


hi-res mtime userspace interface

2008-01-20 Thread Michael Clark
Is there an existing linux userspace interface for accessing the 
microsecond or nanosecond level (a|m|c)times of filesystems that support 
them (e.g. ext4, xfs)? and possibly also the generation counters used by 
NFS.


I notice sys_utimes is able to set microsecond (c|a|m)times but I can't 
find an associated interface to read them (I've googled to no avail).


(also noticing freebsd and darwin's struct stat contains struct timespec 
for these as well as an st_gen field - although I do realise how 
incredibly difficult it would be to add these to linux's struct stat).


~mc

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: HELP: NFS mount hangs when attempting to copy file

2005-07-24 Thread Michael Clark
Timothy Miller wrote:

>On 7/23/05, Trond Myklebust <[EMAIL PROTECTED]> wrote:
>
>  
>
>>I beg to disagree. A lot of these VPN solutions are unfriendly to MTU
>>path discovery over UDP. Sun uses TCP by default when mounting NFS
>>partitions. Have you tried this on your Linux box?
>>
>>
>
>I changed the protocol to TCP and changed rsize and wsize to 1024.  I
>don't know which of those fixed it, but I'm going to leave it for now.
>
>As for MTU, yeah, the Watchguard box seems to have some hard-coded
>limits, and for whatever reason KDE and GNOME graphical logins do
>something that exceeds those limits, completely independent of NFS,
>and hang up hard.
>  
>

If possible it would also be good to fix the misconfigured VPN box
that's breaking the PMTU discovery if you can (usually it's too
aggressive blocking of ICMP messages). Although a wsize/rzise of 1024
and using TCP probably makes sense for NFS over a VPN (avoid
framentation overhead and let TCP handle the retransmission).

My guess is the Watchguard is blocking ICMP "fragmentation needed"
messages (which is resulting in the PMTU discovery breakage).

Enabling ICMP "fragmentation needed" messages to pass through the VPNs
firewall if you can should fix your other problems aswell.

~mc
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: I have centrino laptop with no freq/voltage tables in BIOS

2005-07-12 Thread Michael Clark
Mariusz Gniazdowski wrote:

>Hi.
>I have centrino laptop with no built-in frequency/voltage pairs in
>BIOS/ACPI. I have found this thread:
>
>http://lkml.org/lkml/2005/7/6/101
>
>  
>
If you read the thread more closely you'll become aware that the static
table approach is not really practicle. There is no way to find out at
runtime what voltage variant of Dothan chip your machine has
(VID#A,VID#B,VID# or VID#D). I became aware of that myself after
creating a static table patch (which I can send you offlist if you wish
although you risk running your chip at the wrong voltage unless you know
which variant of Dothan chip your manufacturer has used).

~mc
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CacheFS

2001-06-08 Thread Michael Clark

"Albert D. Cahalan" wrote:
> 
> Jan Kasprzak writes:
> 
> > Another goal is to use the Linux filesystem
> > as a backing store (as opposed to the block device or single large file
> > used by CODA).
> ...
> > - kernel module, implementing the filesystem of the type "cachefs"
> >   and a character device /dev/cachefs
> > - user-space daemon, which would communicate with the kernel
> >   over /dev/cachefs and which would manage the backing store
> >   in a given directory.
> >
> >   Every file on the front filesystem (NFS or so) volume will be cached
> > in two local files by cachefsd: The first one would contain the (parts of)
> ...
> > * Should the cachefsd be in user space (as it is in the prototype
> > implementation) or should it be moved to the kernel space? The
> > former allows probably better configuration (maybe a deeper
> > directory structure in the backing store), but the later is
> > faster as it avoids copying data between the user and kernel spaces.
> 
> I think that, if speed is your goal, you should have the kernel
> code use swap space for the cache. Look at what tmpfs does, but
> running over top of tmpfs leaves you with the overhead of running
> two filesystems and a daemon. It is better to be direct.

So how would you get persistent caching across reboots which is one of
the major advantages of a cachefs type filesystem. I guess you could tar
the cache on startup and shutdown although would be a little slow :).

I think 'speed' here means faster than NFS or other network filesystems
- you obviously have the overhead of network traffic for cache-coherency
but can avoid a lot of data transfer (even after a reboot).

~mc
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: Real Time Traffic Flow Measurement - anybody working on it?

2001-04-19 Thread Michael Clark

Can't say i'm actively working on it but I've emailed Nevil to see if
he knows of any RTFM work that is being done on Linux.

Although here's some observations:

Userspace pcap meters (such as NeTreMet) can measure traffic IP stack
doesn't even see (useful for a probe on a span port for instance) -
can also meter other protocols.

An obvious kernel improvement for userspace meters like NeTraMet would
be to give libpcap's pcap_read a kernel interface that can return more
than one packet at a time (the libpcap interface has this capability).
The current linux libpcap implementation incurs a read syscall per
packet.

An additional feature for network devices that could support it (not
sure if this is feasible) would be to switch to an 'interrupt when
packet buffer full' when in promiscuous mode.

A libpcap comment lists a bug in packet socket that means you need to
read the entire packet to get the packet length (which I assume means
meters like NeTraMet are forced to set the snaplen really high) -
which is a performance penalty if your just want to read headers but
also need the packet length.

from libpcap-0.6.2/pcap-linux.c

/*
 * XXX: According to the kernel source we should get the real
 * packet len if calling recvfrom with MSG_TRUNC set. It does
 * not seem to work here :(, but it is supported by this code
 * anyway.
 * To be honest the code RELIES on that feature so this is
really
 * broken with 2.2.x kernels.
 * I spend a day to figure out what's going on and I found out
 * that the following is happening:
 *
 * The packet comes from a random interface and the packet_rcv
 * hook is called with a clone of the packet. That code
inserts
 * the packet into the receive queue of the packet socket.
 * If a filter is attached to that socket that filter is run
 * first - and there lies the problem. The default filter
always
 * cuts the packet at the snaplen:
 *
 * # tcpdump -d
 * (000) ret  #68
 *
 * So the packet filter cuts down the packet. The recvfrom
call
 * says "hey, it's only 68 bytes, it fits into the buffer"
with
 * the result that we don't get the real packet length. This
 * is valid at least until kernel 2.2.17pre6.
 *
 * We currently handle this by making a copy of the filter
 * program, fixing all "ret" instructions with non-zero
 * operands to have an operand of 65535 so that the filter
 * doesn't truncate the packet, and supplying that modified
 * filter to the kernel.
 */

~mc

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of
> Manfred Bartz
> Sent: Thursday, 19 April 2001 12:16 p.m.
> To: [EMAIL PROTECTED]
> Subject: Real Time Traffic Flow Measurement - anybody working on it?
>
>
> Through the stimulating discussion we had under ``IP Acounting
> Idea for 2.5'', it appears that a separate Traffic Flow Measure-
> ment and Accounting sub-system would be useful. See:
> 
>
> If anybody is working on Real Time Traffic Flow Measurement (RTFM)
> please reply.
>
> I would also like to know if there are any objections to providing
> a RTFM interface in the kernel (as an optional module).
>
> FYI:
>
> Considerable work has already been done on RTFM in general and
> for other systems:
> 
> 
>
> Relevant RFCs include: 2720 ... 2724
>
> Thanks
> --
> Manfred Bartz
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: IP Acounting Idea for 2.5

2001-04-18 Thread Michael Clark


> I repeat myself, fighting is apparently so pleasant that
> you are stuck on
> fighting over dead-end technology:
>
>   I seriously suggest that for the primary (subject given) topic
>   you are SERIOUSLY OFF TARGET.  Look around, counting hits on
>   some fw rules is waste of time!  (And mightly inaccurate!)

I agree. We could all stop re-inventing the wheel and use a
RFC2724/RFC2722/RFC2720 compliant traffic meter such as NeTraMet -
which has already solved most of the mentioned problems - has a
flexible rule language for matching flows and managing counters -
support for multiple protocols; not just IP - a distributed
architecture - SNMP accessable meter and remote Manager and Controller
(NeMaC) which can concurrently read from multiiple meters (including
NetFlow meters).

>   You absolutely don't want to do any sort of counting
> aggeration policy
>   control within kernel ( = FW rules ).   You want to
> collect accounting
>   per flow, and send those data records to offline analysis.

Yes, the IP accounting effort could do well by creating a fastpath for
feeding packet headers (can't say I know how optimal libpcap is
currently on Linux) to a userspace meter (like NeTraMet) letting it
deal with all of the policy.

I remember the DOS version of NeTraMet performed much better than the
Linux version (some years ago) due to custom ethernet drivers (for
some cards) that only generate interupts when a ring buffer is full of
packet headers - maybe the same sort of infrastructure (some of the
Linux GigE drivers also avoid the interupt per packet performance hit)
could be added to Linux and integrated with libpcap and leave the rest
up to a userspace meter application.

I'm sure Neville (traffic metering god - Hi Neville) would be pleased
to have optimized support for NeTraMet in the Linux kernel.

>   No more fighting of when to clear counters, and when not.
>
>   Having used (with own custom analyzers) cisco netflow, I can say
>   that any sort of "count hits on access-list elements" things are
>   from stone-age:

NetFlow really sucks alot doesn't - I remeber having bad aliasing
problems (trying to generate 5min averages) due to its minumum flow
export interval of 1 minute. Is this still the case?

Michael Clark.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



RE: IP Acounting Idea for 2.5

2001-04-15 Thread Michael Clark

> In the 2.5 series of kernels, working towards 2.6, could you please make the
> IP Accounting so that I can set a single rule that will make it watch all IP
> traffic going from the local network, through the masquerading service to the
> internet, and log local IP Addresses using it? This would allow me to set 1
> rule, but have the information I want on a per IP address system.

You could try using a mature userspace traffic meter like 'NeTraMet' (uses
libpcap).

ftp://ftp.auckland.ac.nz/pub/iawg/NeTraMet/

> One other person I have talked to would like to see this too, and he
> basically says we need a software version of the Cisco IP Accounting
> server/router.

NeTraMet can also account using Cisco Netflow accounting records.

> Could you please add this to the next kernel? Please CC me your responses as
> I am not a member of the kernel mailing list. Thanks,
>
> David
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/