Re: question about parallelism in cp command

2019-06-28 Thread L A Walsh



On 2019/06/28 04:52, Marc Roos wrote:
>  
> There are always exceptions like with clustered filesystem etc etc. That 
> is why I wrote 'most used'. If you take all the issued 'cp' commands of 
> today in the world. I would bet 80%-95% of them would not benefit from 
> some sort of parallel processing.

Single disks already benefit from some parallel processing, and
could benefit more as the write process and the on-disk cache process
is increased.  That's why many hard disks are moving to a SSD+HD combo
with a NVMe SSD being able to handle near-memory speeds of up to
64K/microsecond.  

The benefits of parallelism involve being able to order
reads and writes to minimize the need for disk seeking and start/stop
overhead and moving to disk streaming of tracks where disk speeds can
begin to reach I/O transfer limits.  

With higher capacities, comes a higher write speed since the disk
run at the same linear speed / technology.  The idea is for "cp -r" to 
take 1/10th the iops to copy the same data due to it being re-organizable
by the OS and by drivers, but that can only be done if all of the data on
tracks can be rewritten.  Since the data on tracks rarely even comes from the
same file, you need multiple threads to 1) read all the separate files storing
data in a track, 2) write all the separate files storing something in the
target.

The key is scaling memory usage to allow for a thread to completely
fill its memory buffer between reads or writes to the device.  Unfortunately, 
cp rarely uses the memory it could due to concerns of voiding some cache that
may be used sometime in the future...someday.  A tunable might be deciding how 
much memory to allocate to something like cp, it could write out entire 
files in 1 iop (if the driver allows).  

This type of throughput might involve regular defragmenting of disks
to allow multiple files transfered to/from disk at once if they were all small
enough to fit on, say 1 track, but to do that a demand for all of those files
needs to be there for underlying fs-drivers to r/w multiple full tracks
at a time while performing only 1 iop to write multiple tracks.


> 
> 
> -Original Message-
> From: L A Walsh [mailto:coreut...@tlinx.org] 
> Sent: vrijdag 28 juni 2019 13:15
> To: Marc Roos
> Cc: aglo; coreutils
> Subject: Re: question about parallelism in cp command
> 
> On 2019/06/06 09:25, Marc Roos wrote:
>>  
>> Hmmm without being a maintainer. I would say cp -r is most used on 
>> single disk, so one thread is using the maximum disk iops taking y 
>> time to copy.
> ---
> not exactly true, if the 1 disk as a 20 disk raid10.
> 
> You can target 10 areas at a time and get considerable benefit if they 
> are spread across multiple disks in the raid.
> 
> 
> 



Re: question about parallelism in cp command

2019-06-28 Thread Michael Stone

On Fri, Jun 28, 2019 at 04:15:22AM -0700, L A Walsh wrote:

You can target 10 areas at a time and get considerable benefit
if they are spread across multiple disks in the raid.


Alternatively, the kernel can hide this behind readahead.



RE: question about parallelism in cp command

2019-06-28 Thread Marc Roos
 
There are always exceptions like with clustered filesystem etc etc. That 
is why I wrote 'most used'. If you take all the issued 'cp' commands of 
today in the world. I would bet 80%-95% of them would not benefit from 
some sort of parallel processing.


-Original Message-
From: L A Walsh [mailto:coreut...@tlinx.org] 
Sent: vrijdag 28 juni 2019 13:15
To: Marc Roos
Cc: aglo; coreutils
Subject: Re: question about parallelism in cp command

On 2019/06/06 09:25, Marc Roos wrote:
>  
> Hmmm without being a maintainer. I would say cp -r is most used on 
> single disk, so one thread is using the maximum disk iops taking y 
> time to copy.
---
not exactly true, if the 1 disk as a 20 disk raid10.

You can target 10 areas at a time and get considerable benefit if they 
are spread across multiple disks in the raid.






Re: question about parallelism in cp command

2019-06-28 Thread L A Walsh
On 2019/06/06 09:25, Marc Roos wrote:
>  
> Hmmm without being a maintainer. I would say cp -r is most used on 
> single disk, so one thread is using the maximum disk iops taking y time 
> to copy.
---
not exactly true, if the 1 disk as a 20 disk raid10.

You can target 10 areas at a time and get considerable benefit
if they are spread across multiple disks in the raid.




Re: question about parallelism in cp command

2019-06-06 Thread Olga Kornievskaia
On Thu, Jun 6, 2019 at 2:44 PM Assaf Gordon  wrote:
>
> > -Original Message-
> > From: Olga Kornievskaia [mailto:a...@umich.edu]
> >
> > Is there something philosophically incorrect in making a “cp”
> > multi-threaded and allow for parallel copies when “cp -r” is done? If
> > it’s something that’s possible, are there any plans in making a
> > multi-threaded cp?
>
> On Thu, Jun 06, 2019 at 02:17:40PM -0400, Olga Kornievskaia wrote:
> > The use case I'm consider are network file systems. So perhaps a
> > default can be a single threaded system for the local filesystems but
> > add an option to cp for the -r case that would enable network file
> > system to copy files in parallel.
>
> In an interesting coincidence, see recent post by Paul Kolano here:
> https://lists.gnu.org/archive/html/coreutils/2019-06/msg00011.html
>
> (Note that his suggestions have not been reviewed yet, so this is
> neither endorsement nor criticism of his code.)
>

Interesting! Thank you for the link (since I'm not on the mailing
list). I'm going to try out this code and see how it performs (Thank
you Paul Kolano). It would be great if the maintainers of the
coreutils would consider adding this multi-threaded cp functionality
in.



Re: question about parallelism in cp command

2019-06-06 Thread Assaf Gordon
> -Original Message-
> From: Olga Kornievskaia [mailto:a...@umich.edu]
>
> Is there something philosophically incorrect in making a “cp”
> multi-threaded and allow for parallel copies when “cp -r” is done? If
> it’s something that’s possible, are there any plans in making a
> multi-threaded cp?

On Thu, Jun 06, 2019 at 02:17:40PM -0400, Olga Kornievskaia wrote:
> The use case I'm consider are network file systems. So perhaps a
> default can be a single threaded system for the local filesystems but
> add an option to cp for the -r case that would enable network file
> system to copy files in parallel.

In an interesting coincidence, see recent post by Paul Kolano here:
https://lists.gnu.org/archive/html/coreutils/2019-06/msg00011.html

(Note that his suggestions have not been reviewed yet, so this is
neither endorsement nor criticism of his code.)

regards,
 - assaf



Re: question about parallelism in cp command

2019-06-06 Thread Olga Kornievskaia
The use case I'm consider are network file systems. So perhaps a
default can be a single threaded system for the local filesystems but
add an option to cp for the -r case that would enable network file
system to copy files in parallel.

On Thu, Jun 6, 2019 at 12:25 PM Marc Roos  wrote:
>
>
> Hmmm without being a maintainer. I would say cp -r is most used on
> single disk, so one thread is using the maximum disk iops taking y time
> to copy. What would solve using multiple threads each taking their share
> of the maximum disk iops, and because of the scheduling and other
> overhead finishing later than y time?
>
>
>
> -Original Message-
> From: Olga Kornievskaia [mailto:a...@umich.edu]
> Sent: donderdag 6 juni 2019 17:39
> To: coreutils@gnu.org
> Subject: question about parallelism in cp command
>
> Hi folks,
>
> Is there something philosophically incorrect in making a “cp”
> multi-threaded and allow for parallel copies when “cp -r” is done? If
> it’s something that’s possible, are there any plans in making a
> multi-threaded cp?
>
> I’m not a member of the list so I kindly request you cc me on the
> reply.
>
> Thank you.
>
>
>



RE: question about parallelism in cp command

2019-06-06 Thread Marc Roos
 
Hmmm without being a maintainer. I would say cp -r is most used on 
single disk, so one thread is using the maximum disk iops taking y time 
to copy. What would solve using multiple threads each taking their share 
of the maximum disk iops, and because of the scheduling and other 
overhead finishing later than y time?



-Original Message-
From: Olga Kornievskaia [mailto:a...@umich.edu] 
Sent: donderdag 6 juni 2019 17:39
To: coreutils@gnu.org
Subject: question about parallelism in cp command

Hi folks,

Is there something philosophically incorrect in making a “cp”
multi-threaded and allow for parallel copies when “cp -r” is done? If 
it’s something that’s possible, are there any plans in making a 
multi-threaded cp?

I’m not a member of the list so I kindly request you cc me on the 
reply.

Thank you.






question about parallelism in cp command

2019-06-06 Thread Olga Kornievskaia
Hi folks,

Is there something philosophically incorrect in making a “cp”
multi-threaded and allow for parallel copies when “cp -r” is done? If
it’s something that’s possible, are there any plans in making a
multi-threaded cp?

I’m not a member of the list so I kindly request you cc me on the reply.

Thank you.