Re: [racket-users] Re: Places performance & channel capacity
It may be the overhead of communicating the data is dominating the time spent working. Would it work to the main place open the file, count the number of lines, and then just tell the worker places which chunks of the file are theirs? Or maybe just do the counting at the byte level and then have some way to clean up the partial lines that each place would get? Robby On Wed, Jan 20, 2016 at 10:47 PM, Brian Adkinswrote: > On Wednesday, January 20, 2016 at 11:28:59 PM UTC-5, Brian Adkins wrote: >> My initial experiment with places is a bit disappointing: >> >> Sequential version: cpu time: 2084 real time: 2091 gc time: 91 >> >> Places version: cpu time: 16895 real time: 3988 gc time: 4244 >> >> Using 8x the CPU time seems quite high. >> >> And more importantly, the places version only wrote 128,541 lines to the >> output file vs. the correct number of 198,480 (65%). I suspect the program >> is ending while some of the worker places are still active, but I think that >> implies that if I correctly waited for all activity to finish, the numbers >> would be even worse. Putting a sleep after the main input reading loop gets >> me all but 5 of the records. >> >> My simplistic design was to have the main place read lines from an input >> file, write the lines to the place channels of N workers for processing (in >> a round robin manner using modulo # workers). The workers write the parsed >> lines to a single output place for writing to the output file. >> >> Is it possible to limit the number of messages on a place channel? I suspect >> the input place is moving faster than the workers, so the workers' place >> channels may be getting huge. >> >> Someone mentioned buffered asynchronous channels on IRC since you can set a >> limit, but it appears you can't send them across a place channel, so I'm >> unsure how to make use of them with places. >> >> Sequential & parallel code is here: >> >> https://gist.github.com/lojic/283aa3eec777e4810efc >> >> Relevant lines are lines 44 to 105 of the parallel version. >> >> Are there any projects, papers, etc. of best practices with respect to >> places that I can look at? >> >> Thanks, >> Brian > > FYI - after some trial and error, I was able to specify a sleep time of ~ > 500ms after the input file has been read to allow sufficient time for the > workers to write to the output place and the output place to populate the > file (minus ~ 5 records). The output from time for various number of workers > is: > > 2 workers => cpu time: 9396 real time: 3279 gc time: 2233 > > 3 workers => cpu time: 11835 real time: 3543 gc time: 2137 > > 4 workers => cpu time: 17078 real time: 4461 gc time: 4234 > > 5 workers => cpu time: 21610 real time: 4988 gc time: 5366 > > I have 4 cores and 8 hyperthreads, but adding workers just means more CPU > time and more elapsed time :( > > -- > You received this message because you are subscribed to the Google Groups > "Racket Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to racket-users+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: Places performance & channel capacity
On Thursday, January 21, 2016 at 7:48:44 AM UTC-5, Robby Findler wrote: > It may be the overhead of communicating the data is dominating the > time spent working. > > Would it work to the main place open the file, count the number of > lines, and then just tell the worker places which chunks of the file > are theirs? Or maybe just do the counting at the byte level and then > have some way to clean up the partial lines that each place would get? > > Robby I think one of the problems might be the fact that every mutable byte string needs to be converted to an immutable bytes string before it can be sent across a place channel, right? I have no idea how costly that is, but that could be a huge part of the overhead. Unfortunately, it seems that the result of most byte string operations is a mutable byte string (with no way to tell the system to place it in the shared area), so it might be unavoidable. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: Places performance & channel capacity
On Thursday, January 21, 2016 at 7:48:44 AM UTC-5, Robby Findler wrote: > It may be the overhead of communicating the data is dominating the > time spent working. > > Would it work to the main place open the file, count the number of > lines, and then just tell the worker places which chunks of the file > are theirs? Or maybe just do the counting at the byte level and then > have some way to clean up the partial lines that each place would get? > > Robby The sequential version takes 10 microseconds per line (including input, processing, output), so the processing part is clearly less than that. Maybe that amount of work is too fine grained? Can anyone with a lot of places experience comment on the suitability of a 10 microsecond unit of work for places? The overhead should be limited to copying a byte string to the worker place, and then copying a different byte string to the output place. The CPU time and GC time for the places version is much higher than I expected. I think my approach is pretty natural - hand off input records to workers for processing. I think your suggestion might complicate the program quite a bit. Although in this *particular* case, the input file is fixed length, I'm not sure I'd feel comfortable depending on all 45M records being exactly the same size. As an experiment, I may verify that they are, and then compute a file offset for each worker. *But*, that's not really what I had in mind for language supported parallelism - that's not very different than me just firing up N Racket processes via the operating system and using command-line cat to combine their individual output files. The good news is that in preparation for creating the places version I switched from: (fprintf e-out "~a\t~a\t..." field1 field2 ...) to (write-bytes (bytes-append field1 #"\t" field 2 #"\t" ...) e-out) so that the parse-case function could return (bytes-append ...) instead of doing the actual output. This shaved off about 20% from the runtime! So the sequential version is now only 4.23x the C program where Ruby is 15.7x the C program. The sequential version would be fine in production, but after many years of seeing only one of my cores pegged at 100% by Ruby, I was really looking forward to seeing all four cores pegged :) Actually, the parallel version *does* in fact peg all the cores, but that's not exciting when it actually takes longer in total! I was naively hoping for about a 3x speedup for the 4 core places version. I still wonder if causing the worker place channels to get huge is a problem - that might explain the very high GC time. If anyone knows how to use buffered asynchronous channels with places, or simply how to block when attempting to put more than N messages in a place channel, I *think* that would help, but maybe not enough. Brian -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: Places performance & channel capacity
If I understand correctly, you're ultimately looking for a general way that you can write this kind of record processing code simply in the future. And that, right now, you're investing some one-time experimental effort, to assess feasibility and to find an approach/guidelines that you can apply more simply in the future. Regarding feasibility, unless I misunderstand this pilot application, I think that it could be done in Racket in a way that scales almost perfectly with each additional core added, limited only by the ultimate filesystem I/O. That might involve Places, perhaps with larger work units or more efficient communication and coordination; or, as I think you said earlier, you could simply fork separate processes after partitioning (just via file positions) the input file. These approaches could get towards the shallow end of what people do when writing really performance-sensitive systems, like some Internet or transaction processing servers (or database engines, or operating system kernels) -- they are not necessarily simple. Though you can generalize the hard work for simple future use (e.g., `for-each-bytes-line-from-file/unordered/indexes`, `for-each-foo-format-record`, `query-from-foo-format-file-to-bar-file`, `define-foo-file-transformation`, etc.). Sometimes people on this list will spend 5-30 minutes figuring out a code solution to a problem posted on the list, but harder performance-sensitive stuff can take days or longer to do well, and it has to be done well, by definition. Going back to "simply", rather than simply-after-upfront-hard-work-done-by-application-programmer, maybe there's opportunity for simply-after-further-hard-work-done-by-core-Racket-programmers... For example, perhaps some of the core Racket string routines could be optimized further (I imagine they're already got a working-over, when Unicode was added), so that even simple programs run faster. And maybe there are Places facilities that could be optimized further, or some new facility added. And maybe there's a research project for better parallelizing support. BTW, there might still be a few relatively simple efficiency tweaks in your current approach (e.g., while skimming, I think I saw a snippet of code doing something like `(write-bytes (bytes-append ...))`, perhaps to try to keep a chunk of bytes contiguous, for interleaving with other threads' writes). If the work was parsing HTML or JSON, then the places version would probably be worth it on a 4 core machine. For HTML and JSON parsing, unlike your records application, I think the parser itself has to be one thread, but you could probably put some expensive application-specific behavior that happens during the parse in other threads. Neither my HTML nor JSON parsers was designed to be used that way, but my streaming JSON parser might be amenable to it. The HTML parser is intended to build a potentially-big AST in one shot, so no other threads while it's working, though it should be reasonably fast about it (it was written on a 166MHz Pentium laptop with 48MB RAM, usually on battery power). Neil V. -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [racket-users] Re: Places performance & channel capacity
On Thursday, January 21, 2016 at 6:54:54 PM UTC-5, Neil Van Dyke wrote: > If I understand correctly, you're ultimately looking for a general way > that you can write this kind of record processing code simply in the > future. And that, right now, you're investing some one-time > experimental effort, to assess feasibility and to find an > approach/guidelines that you can apply more simply in the future. I think I'm past the stage of assessing the feasibility of Racket for my work in general - I simply enjoy expressing code in Racket more than other languages. This particular project was really exploring the boundaries of Racket with respect to performance (and parallelization). In other words, which tasks are appropriate for Racket and which ones may be more suitable for something like C. I primarily develop web applications, but most projects involve some sort of data import either from files, web services, etc. However, for most of these, performance isn't a huge factor, so a simple, straightforward, sequential solution in Racket would be fine. This case study involves one atypical program that imports 45 million criminal records four times a year, so even for that, it's only mildly annoying that it takes a while, and the postgres index creation takes longer than creating the bulk import file. So, it's more of an excuse to use a real world program to explore than a bona fide need for absolute performance. > Regarding feasibility, unless I misunderstand this pilot application, I > think that it could be done in Racket in a way that scales almost > perfectly with each additional core added, limited only by the ultimate > filesystem I/O. That might involve Places, perhaps with larger work > units or more efficient communication and coordination; Yes, I think "more efficient communication and coordination" is the key, and I expect that's going to be a learning experience for me. I would probably get better parallel results by coding up an Othello or Chess game and parallelizing the game tree search. > [...] > Going back to "simply", rather than > simply-after-upfront-hard-work-done-by-application-programmer, maybe > there's opportunity for > simply-after-further-hard-work-done-by-core-Racket-programmers... For > example, perhaps some of the core Racket string routines could be > optimized further (I imagine they're already got a working-over, when > Unicode was added), so that even simple programs run faster. And maybe > there are Places facilities that could be optimized further, or some new > facility added. And maybe there's a research project for better > parallelizing support. I may be interested in further researching "places" performance later in a more organized and helpful manner vs. my current seat-of-the-pants, throw-up-some-mud-and-see-what-sticks approach :) > BTW, there might still be a few relatively simple efficiency tweaks in > your current approach (e.g., while skimming, I think I saw a snippet of > code doing something like `(write-bytes (bytes-append ...))`, perhaps to > try to keep a chunk of bytes contiguous, for interleaving with other > threads' writes). I used bytes-append to join fields with tabs plus a trailing newline to format the output record. I suppose I could use bytes-join instead and manually append the newline. > > If the work was parsing HTML or JSON, then the places version would > > probably be worth it on a 4 core machine. > > For HTML and JSON parsing, unlike your records application, I think the > parser itself has to be one thread, but you could probably put some > expensive application-specific behavior that happens during the parse in > other threads. Neither my HTML nor JSON parsers was designed to be used > that way, but my streaming JSON parser might be amenable to it. The HTML > parser is intended to build a potentially-big AST in one shot, so no > other threads while it's working, though it should be reasonably fast > about it (it was written on a 166MHz Pentium laptop with 48MB RAM, > usually on battery power). I simply meant that the unit of work for parsing HTML or JSON (i.e. one web page or one JSON document) is probably large enough to warrant the overhead of copying the bytes to a place vs. my program which is copying individual lines. A streaming parser would like wind up with the same problem as my program - one line of HTML or JSON is too fine grained to be worth copying to a place for parsing. Brian -- You received this message because you are subscribed to the Google Groups "Racket Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to racket-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.