Re: [racket-users] Re: Places performance & channel capacity

2016-01-21 Thread Robby Findler
It may be the overhead of communicating the data is dominating the
time spent working.

Would it work to the main place open the file, count the number of
lines, and then just tell the worker places which chunks of the file
are theirs? Or maybe just do the counting at the byte level and then
have some way to clean up the partial lines that each place would get?

Robby


On Wed, Jan 20, 2016 at 10:47 PM, Brian Adkins  wrote:
> On Wednesday, January 20, 2016 at 11:28:59 PM UTC-5, Brian Adkins wrote:
>> My initial experiment with places is a bit disappointing:
>>
>> Sequential version: cpu time: 2084 real time: 2091 gc time: 91
>>
>> Places version: cpu time: 16895 real time: 3988 gc time: 4244
>>
>> Using 8x the CPU time seems quite high.
>>
>> And more importantly, the places version only wrote 128,541 lines to the 
>> output file vs. the correct number of 198,480 (65%). I suspect the program 
>> is ending while some of the worker places are still active, but I think that 
>> implies that if I correctly waited for all activity to finish, the numbers 
>> would be even worse. Putting a sleep after the main input reading loop gets 
>> me all but 5 of the records.
>>
>> My simplistic design was to have the main place read lines from an input 
>> file, write the lines to the place channels of N workers for processing (in 
>> a round robin manner using modulo # workers). The workers write the parsed 
>> lines to a single output place for writing to the output file.
>>
>> Is it possible to limit the number of messages on a place channel? I suspect 
>> the input place is moving faster than the workers, so the workers' place 
>> channels may be getting huge.
>>
>> Someone mentioned buffered asynchronous channels on IRC since you can set a 
>> limit, but it appears you can't send them across a place channel, so I'm 
>> unsure how to make use of them with places.
>>
>> Sequential & parallel code is here:
>>
>> https://gist.github.com/lojic/283aa3eec777e4810efc
>>
>> Relevant lines are lines 44 to 105 of the parallel version.
>>
>> Are there any projects, papers, etc. of best practices with respect to 
>> places that I can look at?
>>
>> Thanks,
>> Brian
>
> FYI - after some trial and error, I was able to specify a sleep time of ~ 
> 500ms after the input file has been read to allow sufficient time for the 
> workers to write to the output place and the output place to populate the 
> file (minus ~ 5 records). The output from time for various number of workers 
> is:
>
> 2 workers => cpu time: 9396 real time: 3279 gc time: 2233
>
> 3 workers => cpu time: 11835 real time: 3543 gc time: 2137
>
> 4 workers => cpu time: 17078 real time: 4461 gc time: 4234
>
> 5 workers => cpu time: 21610 real time: 4988 gc time: 5366
>
> I have 4 cores and 8 hyperthreads, but adding workers just means more CPU 
> time and more elapsed time :(
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to racket-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: Places performance & channel capacity

2016-01-21 Thread Brian Adkins
On Thursday, January 21, 2016 at 7:48:44 AM UTC-5, Robby Findler wrote:
> It may be the overhead of communicating the data is dominating the
> time spent working.
> 
> Would it work to the main place open the file, count the number of
> lines, and then just tell the worker places which chunks of the file
> are theirs? Or maybe just do the counting at the byte level and then
> have some way to clean up the partial lines that each place would get?
> 
> Robby

I think one of the problems might be the fact that every mutable byte string 
needs to be converted to an immutable bytes string before it can be sent across 
a place channel, right?

I have no idea how costly that is, but that could be a huge part of the 
overhead. Unfortunately, it seems that the result of most byte string 
operations is a mutable byte string (with no way to tell the system to place it 
in the shared area), so it might be unavoidable.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: Places performance & channel capacity

2016-01-21 Thread Brian Adkins
On Thursday, January 21, 2016 at 7:48:44 AM UTC-5, Robby Findler wrote:
> It may be the overhead of communicating the data is dominating the
> time spent working.
> 
> Would it work to the main place open the file, count the number of
> lines, and then just tell the worker places which chunks of the file
> are theirs? Or maybe just do the counting at the byte level and then
> have some way to clean up the partial lines that each place would get?
> 
> Robby

The sequential version takes 10 microseconds per line (including input, 
processing, output), so the processing part is clearly less than that. Maybe 
that amount of work is too fine grained? Can anyone with a lot of places 
experience comment on the suitability of a 10 microsecond unit of work for 
places?

The overhead should be limited to copying a byte string to the worker place, 
and then copying a different byte string to the output place. The CPU time and 
GC time for the places version is much higher than I expected.

I think my approach is pretty natural - hand off input records to workers for 
processing. I think your suggestion might complicate the program quite a bit.

Although in this *particular* case, the input file is fixed length, I'm not 
sure I'd feel comfortable depending on all 45M records being exactly the same 
size. As an experiment, I may verify that they are, and then compute a file 
offset for each worker.

*But*, that's not really what I had in mind for language supported parallelism 
- that's not very different than me just firing up N Racket processes via the 
operating system and using command-line cat to combine their individual output 
files.

The good news is that in preparation for creating the places version I switched 
from:

(fprintf e-out "~a\t~a\t..." field1 field2 ...)

to

(write-bytes (bytes-append field1 #"\t" field 2 #"\t" ...) e-out)

so that the parse-case function could return (bytes-append ...) instead of 
doing the actual output. This shaved off about 20% from the runtime! So the 
sequential version is now only 4.23x the C program where Ruby is 15.7x the C 
program. 

The sequential version would be fine in production, but after many years of 
seeing only one of my cores pegged at 100% by Ruby, I was really looking 
forward to seeing all four cores pegged :) Actually, the parallel version 
*does* in fact peg all the cores, but that's not exciting when it actually 
takes longer in total!

I was naively hoping for about a 3x speedup for the 4 core places version. I 
still wonder if causing the worker place channels to get huge is a problem - 
that might explain the very high GC time.

If anyone knows how to use buffered asynchronous channels with places, or 
simply how to block when attempting to put more than N messages in a place 
channel, I *think* that would help, but maybe not enough.

Brian

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: Places performance & channel capacity

2016-01-21 Thread Neil Van Dyke
If I understand correctly, you're ultimately looking for a general way 
that you can write this kind of record processing code simply in the 
future.  And that, right now, you're investing some one-time 
experimental effort, to assess feasibility and to find an 
approach/guidelines that you can apply more simply in the future.


Regarding feasibility, unless I misunderstand this pilot application, I 
think that it could be done in Racket in a way that scales almost 
perfectly with each additional core added, limited only by the ultimate 
filesystem I/O.  That might involve Places, perhaps with larger work 
units or more efficient communication and coordination; or, as I think 
you said earlier, you could simply fork separate processes after 
partitioning (just via file positions) the input file.  These approaches 
could get towards the shallow end of what people do when writing really 
performance-sensitive systems, like some Internet or transaction 
processing servers (or database engines, or operating system kernels) -- 
they are not necessarily simple.  Though you can generalize the hard 
work for simple future use (e.g., 
`for-each-bytes-line-from-file/unordered/indexes`, 
`for-each-foo-format-record`, `query-from-foo-format-file-to-bar-file`, 
`define-foo-file-transformation`, etc.).  Sometimes people on this list 
will spend 5-30 minutes figuring out a code solution to a problem posted 
on the list, but harder performance-sensitive stuff can take days or 
longer to do well, and it has to be done well, by definition.


Going back to "simply", rather than 
simply-after-upfront-hard-work-done-by-application-programmer, maybe 
there's opportunity for 
simply-after-further-hard-work-done-by-core-Racket-programmers... For 
example, perhaps some of the core Racket string routines could be 
optimized further (I imagine they're already got a working-over, when 
Unicode was added), so that even simple programs run faster. And maybe 
there are Places facilities that could be optimized further, or some new 
facility added.  And maybe there's a research project for better 
parallelizing support.


BTW, there might still be a few relatively simple efficiency tweaks in 
your current approach (e.g., while skimming, I think I saw a snippet of 
code doing something like `(write-bytes (bytes-append ...))`, perhaps to 
try to keep a chunk of bytes contiguous, for interleaving with other 
threads' writes).


If the work was parsing HTML or JSON, then the places version would 
probably be worth it on a 4 core machine.


For HTML and JSON parsing, unlike your records application, I think the 
parser itself has to be one thread, but you could probably put some 
expensive application-specific behavior that happens during the parse in 
other threads.  Neither my HTML nor JSON parsers was designed to be used 
that way, but my streaming JSON parser might be amenable to it. The HTML 
parser is intended to build a potentially-big AST in one shot, so no 
other threads while it's working, though it should be reasonably fast 
about it (it was written on a 166MHz Pentium laptop with 48MB RAM, 
usually on battery power).


Neil V.

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] Re: Places performance & channel capacity

2016-01-21 Thread Brian Adkins
On Thursday, January 21, 2016 at 6:54:54 PM UTC-5, Neil Van Dyke wrote:
> If I understand correctly, you're ultimately looking for a general way 
> that you can write this kind of record processing code simply in the 
> future.  And that, right now, you're investing some one-time 
> experimental effort, to assess feasibility and to find an 
> approach/guidelines that you can apply more simply in the future.

I think I'm past the stage of assessing the feasibility of Racket for my work 
in general - I simply enjoy expressing code in Racket more than other 
languages. This particular project was really exploring the boundaries of 
Racket with respect to performance (and parallelization). In other words, which 
tasks are appropriate for Racket and which ones may be more suitable for 
something like C.

I primarily develop web applications, but most projects involve some sort of 
data import either from files, web services, etc. However, for most of these, 
performance isn't a huge factor, so a simple, straightforward, sequential 
solution in Racket would be fine.

This case study involves one atypical program that imports 45 million criminal 
records four times a year, so even for that, it's only mildly annoying that it 
takes a while, and the postgres index creation takes longer than creating the 
bulk import file. So, it's more of an excuse to use a real world program to 
explore than a bona fide need for absolute performance.

> Regarding feasibility, unless I misunderstand this pilot application, I 
> think that it could be done in Racket in a way that scales almost 
> perfectly with each additional core added, limited only by the ultimate 
> filesystem I/O.  That might involve Places, perhaps with larger work 
> units or more efficient communication and coordination; 

Yes, I think "more efficient communication and coordination" is the key, and I 
expect that's going to be a learning experience for me. I would probably get 
better parallel results by coding up an Othello or Chess game and parallelizing 
the game tree search.

> [...]
> Going back to "simply", rather than 
> simply-after-upfront-hard-work-done-by-application-programmer, maybe 
> there's opportunity for 
> simply-after-further-hard-work-done-by-core-Racket-programmers... For 
> example, perhaps some of the core Racket string routines could be 
> optimized further (I imagine they're already got a working-over, when 
> Unicode was added), so that even simple programs run faster. And maybe 
> there are Places facilities that could be optimized further, or some new 
> facility added.  And maybe there's a research project for better 
> parallelizing support.

I may be interested in further researching "places" performance later in a more 
organized and helpful manner vs. my current seat-of-the-pants, 
throw-up-some-mud-and-see-what-sticks approach :)

> BTW, there might still be a few relatively simple efficiency tweaks in 
> your current approach (e.g., while skimming, I think I saw a snippet of 
> code doing something like `(write-bytes (bytes-append ...))`, perhaps to 
> try to keep a chunk of bytes contiguous, for interleaving with other 
> threads' writes).

I used bytes-append to join fields with tabs plus a trailing newline to format 
the output record. I suppose I could use bytes-join instead and manually append 
the newline.

> > If the work was parsing HTML or JSON, then the places version would 
> > probably be worth it on a 4 core machine.
> 
> For HTML and JSON parsing, unlike your records application, I think the 
> parser itself has to be one thread, but you could probably put some 
> expensive application-specific behavior that happens during the parse in 
> other threads.  Neither my HTML nor JSON parsers was designed to be used 
> that way, but my streaming JSON parser might be amenable to it. The HTML 
> parser is intended to build a potentially-big AST in one shot, so no 
> other threads while it's working, though it should be reasonably fast 
> about it (it was written on a 166MHz Pentium laptop with 48MB RAM, 
> usually on battery power).

I simply meant that the unit of work for parsing HTML or JSON (i.e. one web 
page or one JSON document) is probably large enough to warrant the overhead of 
copying the bytes to a place vs. my program which is copying individual lines.

A streaming parser would like wind up with the same problem as my program - one 
line of HTML or JSON is too fine grained to be worth copying to a place for 
parsing.

Brian

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.