Re: [Unicon-group] [SPAM] Re: Walk of file directory

Jeffery, Clint ([email protected]) Fri, 23 Jan 2015 12:41:18 -0800

Are temp files persisting instead of being removed when the process terminates? 
That's a bad leak, if so. Or is it just that we have such large file systems 
now that the common prefix is killing us for the number of tempnames generated 
on a single run?

-------- Original message --------
From: Jafar Al-Gharaibeh <[email protected]>
Date: 01/23/2015 12:01 PM (GMT-05:00)
To: Wade <[email protected]>
Cc: Unicon group <[email protected]>
Subject: [SPAM] Re: [Unicon-group] Walk of file directory

Wade,

   _tempnam(dir, prefix) is provided by Windows, we just use it and it turned 
not to be smart at all - at least on my Windows 7 machine. However, our code 
that uses it could be made smarter to always use randomized prefix every time - 
that is one approach.

Thanks,
Jafar

On Fri, Jan 23, 2015 at 3:44 AM, Wade 
<[email protected]<mailto:[email protected]>> wrote:
Sounds like the _tempnam() function could be a lot smarter in creating 
temporary filenames. Is that our function or is it provided by Windows?

Wade.

On Fri, 23 Jan 2015 20:13:17 +1100, Sergey Logichev 
<[email protected]<mailto:[email protected]>> wrote:

Jafar,

I am very appreciate for your investigations! Actually, my Windows %TMP% folder 
included ~135000 temporary files, so when I cleaned it my run time decreased 
from ~40 secs to ~20. And the very first open() was instant, then its time 
increased as number of temporary files increases too. My proposal to purge all 
temporary files after program finishes or instead use virtual storage at RAM, 
as on every searched subdirectory is created single temporary file. After very 
short time TMP folder will contain a myriad of such files.

Nevertheless I confirm that number of threads practically do not influence on 
execution time. Probably, it's the problem of "lazy cleanup", as you mentioned. 
Hope you could find solution. Compared with Linux - Windows is quite a bag of 
different bugs! Which runs from every holes :-)

Thank you,
Sergey

23.01.2015, 10:19, "Jafar Al-Gharaibeh" 
<[email protected]<mailto:[email protected]>>:
Sergey,

   Thanks for the report. I had in mind to look at why we don't get much speed 
up with more threads. I did look and found that the main thread was grabbing 
most "new thread tokens" and not recycling them fast enough. I have to tweak my 
algorithm to allow quick cleanup and reuse of threads. I will do that when I 
get a chance.

Now the second issue - and you've gotta love this!-, I was able to confirm the 
slow open(). With the help of gdb and after spending a couple of hours digging 
into the C code and the Windows API calls, I found that the problem is in a 
call to _tempnam() to create a temporary file name. The call was taking so long 
to finish. It creates the tmp file under your system TMP folder (%TMP% on 
Windows). I looked in that folder and found that it has more than half a 
million files (~2.7GB)!

It turned out that every time my program runs, Windows was looping through that 
huge pile of tmp files to find a name that doesn't exist so that it can give it 
to the program. Of course I think most of those tmp files were generated by my 
program during previous runs the last couple of days.

As a bonus, I discovered a memory leak in the process of tracking the open() 
problem. I committed a fix for that leak. This is only affecting Windows.

Short  term solution: flush your TMP folder.
Long term: we will into ways to improve our tmp file strategy to overcome the 
shortcoming of Windows API. This will come in a later date! :)

Cheers,
Jafar

On Thu, Jan 22, 2015 at 4:43 AM, Sergey Logichev 
<[email protected]<mailto:[email protected]>> wrote:
Jafar,

You've provided very interesting version of walk directory algorithm. 
Communication with active threads' is a great thing!
I have checked your program under Windows 7. I was confused the fact that 
execution time is negligibly depended on number of concurrent threads. I dug 
into and discovered that the first operation open(s) takes near ALL execution 
time! 95% at least. Check it yourself when you slightly edit getdirs():
...
if ( stat(s).mode ? ="d" ) & ( tm := &time, d := open(s) ) then {
      if n=1 then write(s," : ",&time-tm)
...

So, if first open() is so long then all other enhancements have no sense. 
Please clarify if I am wrong.

Best regards,
Sergey

22.01.2015, 00:58, "Jafar Al-Gharaibeh" 
<[email protected]<mailto:[email protected]>>:
Here is a slightly tweaked/reformatted version. It now by default auto-detect 
the number of available cores in the machine and launch twice as many threads.

--Jafar

On Wed, Jan 21, 2015 at 12:17 PM, Jafar Al-Gharaibeh 
<[email protected]<mailto:[email protected]>> wrote:
David,

    I added a threaded solution @ 
http://rosettacode.org/wiki/Walk_a_directory/Recursively#Icon_and_Unicon
   Please review/edit as you see fit. (The source file is attached). Combining 
recursion with thread might not be the best solution for this problem. If I 
were to put this in real use I'd go with an iterative approach using 
master/workers model. Anyway, this is a excellent demonstration on how to use 
threads!. The key features are:

   1- How to create threads, limit their numbers, self-load balanced (new 
threads  are spawned at the time/place where needed. One they are done, they 
vanish allowing new threads to pop up in new places in the directory structure)
   2- pass data and collect results to/from the threads using the new language 
features.

Here is some sample output from my desktop machine (quad-core with mechanical 
HDD. I will try another machine with an SSD and see if more threads scale 
better).

the first argument to the program is the target directory. The second is the 
maximum number of  concurrent threads to use at any given moment. (soft limit! 
my counters are "unmutexed", so the actual number might deviate). Note that 
this is different from the actual number of threads used during the run which 
is reported at the end. The program can create/destroy threads as needed, but 
cannot  use more than "max" # of threads at any given moment, and again "max" 
is "soft". :)

Cheers,
Jafar

c:\proj>tdir c:\ 1
39708 directories in 99867 ms using 1 threads

c:\proj>tdir c:\ 4
39708 directories in 62222 ms using 4 threads

c:\proj>tdir c:\ 4
39708 directories in 87650 ms using 4 threads

c:\proj>tdir c:\ 1
39708 directories in 92525 ms using 1 threads

c:\proj>tdir c:\ 4
39708 directories in 95655 ms using 4 threads

c:\proj>tdir c:\ 16
39708 directories in 66138 ms using 21 threads

c:\proj>tdir c:\ 8
39708 directories in 69307 ms using 8 threads

c:\proj>tdir c:\ 4
39708 directories in 70539 ms using 4 threads

c:\proj>tdir c:\ 16
39708 directories in 76392 ms using 32 threads

On Sun, Jan 11, 2015 at 1:25 PM, David Gamey 
<[email protected]<mailto:[email protected]>> wrote:
Sergey,

I am responsible for much of the Rosetta code contributions (thanks also to 
Steve, Andrew, Matt, Peter, and about 4 others) and this one in particular 
dating from 2010. As I recall this was before the multi-threading versions were 
widely available. I think multi-threading is underrepresented in Rosetta/Unicon.

If you come up with a multi-threading version, we should add it to the post as 
an alternative version.  If you don't feel comfortable doing this, post the 
code and I can add it.

David

________________________________
From: Sergey Logichev <[email protected]<mailto:[email protected]>>
To: Jafar Al-Gharaibeh <[email protected]<mailto:[email protected]>>
Cc: Unicon group 
<[email protected]<mailto:[email protected]>>
Sent: Sunday, January 11, 2015 1:16 AM
Subject: Re: [Unicon-group] Walk of file directory

Jafar,

Thank you for a whole bundle of advices and suggestions! Threads are worth to 
try. The thought of search by file attributes is very useful too. Your 
suggestion about slow I/O partly is right. For UNIX I tried the program on 
Raspberry Pi with 6 Class microSD as HDD (it's slow, agree). But for Windows it 
was quite fast HDD. It would be interesting to compare performance of the 
program on Windows with classic approach based on Win32 _FINDFIRST, _FINDNEXT 
functions. I have threaded Delphi/Lazarus implementations of this algorithm. 
Feel that it will be faster but in which degree?

Sergey

10.01.2015, 21:50, "Jafar Al-Gharaibeh" 
<[email protected]<mailto:[email protected]>>:

Sergey,

  There are so many things that came to mind when I saw your program.

1-  At the end of your email, sourceforge ad says "Go Parallel", Which is not a 
bad idea for this highly parallel application.

 There is a similar program "wordcount" listed in my dissertation (available on 
unicon.org<http://unicon.org/>) that go through directories and count words in 
every file using threads (Chapter 7, page 107)

2- Unicon open() already supports " pattern matching that would greatly (I 
believe) speedup your program. For example you can do this:
    L := open("*.icn")

   to get a list of all of Unicon source files in the current directory.

  Note: It would be nice if there were a way to tell open() to return files not 
only based on a pattern, but also on file attribute to allow something like 
"get me all directories in the current directory", or "get me all read only 
file". There are a lot of situations where filtering directory names for 
example is very useful - like this program

3- The program on Rosetta Code is not optimized for speed. You can minimize the 
number of lists created and put() by careful rewriting of the code.

4- Depending on how deep the directory tree is, there might be a lot of I/O 
going on. A slow disk might limit how fast you can go regardless of how 
optimized your code is.

I will share results if get around trying any of these options.

Cheers,
Jafar

On Sat, Jan 10, 2015 at 5:51 AM, Sergey Logichev 
<[email protected]<mailto:[email protected]>> wrote:
Hello all!

Now I investigate the best approach to get list of files in specified directory 
and beneath in Unicon.
I found excellent example at rosettacode.org<http://rosettacode.org/>: 
http://rosettacode.org/wiki/Walk_a_directory/Recursively#Icon_and_Unicon

I reconstructed this one to implement matching of filenames to specified 
pattern (regular expression). My program recursively walks a directory and 
prints appropriate filenames. The same as dir (ls) does. All working fine 
except performance. If directory has a lot of subdirs the search may took 10-20 
seconds before starting output. Could you provide some advices how to enchance 
the performance?

Some notes how to make and use. Unpack content of udir.zip to your local 
directory. Define which environment you use in env.icn file - uncomment line 
"$define _UNIX 1" in the case of UNIX. Nothing to do in the case of Windows.
Make udir program:
unicon -c futils.icn
unicon -c options.icn
unicon -c regexp.icn
unicon udir.icn

Usage: udir -f<filemask>
for example: udir -f*.icn
shall list of icn files in the current dir and all its subdirectories.

Best regards,
Sergey Logichev

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. 
http://goparallel.sourceforge.net<http://goparallel.sourceforge.net/>
_______________________________________________
Unicon-group mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/unicon-group

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. 
http://goparallel.sourceforge.net<http://goparallel.sourceforge.net/>

_______________________________________________
Unicon-group mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/unicon-group

------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Unicon-group mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/unicon-group

------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet

_______________________________________________
Unicon-group mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/unicon-group

Re: [Unicon-group] [SPAM] Re: Walk of file directory

Reply via email to