RE the large file challenge
| Maybe we need a new name for what Transcript does. | | Transcript pre-processes scripts into pointer-based bytecode, which | generally outperforms purely interpreted xTalk by anywhere from several | times to a few orders of magnitude. | Maybe? This is an excellent clarification. If MC is seen as trying to compete with Java and Sun has decided to redefine 'compiled' then, hey! Why not. Come to think of it they used to call UCSD Pascal compiled but it was p-code possibly similar?? There is an exception, that is, when MC is used as a scripting language such as cgi scripts, or such as the tests I have been running. In that case there is no preprocessing. In that case, I believe interpreted would be the correct description. The good news is, it _still_ compares in speed to the compiled languages. For an interesting read on security and high level languages, this is fun: http://m.bacarella.com/papers/secsoft/html Sadhu ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: RE the large file challenge
Sadhunathan Nadesan wrote: For an interesting read on security and high level languages, this is fun: http://m.bacarella.com/papers/secsoft/html Great article -- thanks for posting that! -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
RE: the large file challenge
| | Message: 1 | Date: Thu, 14 Nov 2002 10:39:01 -0700 | Subject: RE: the large file challenge | From: John Vokey [EMAIL PROTECTED] | To: [EMAIL PROTECTED] | Reply-To: [EMAIL PROTECTED] | | To be fair: most of metacard is coded in metatalk; it is a | boot-strapped language, much like many of the TILs (threaded | interpreted languages) of yesteryears (e.g., forth, apl). | John, I agree with you too. I take his point that a C program should not be slower than a bash script invoking 2 utilities written in C. If anyone cares to contribute a better C program, go for it! Right now I'm running Pierre's MC revision to see how it does. This has been fun but I think we've come to the end. I think it has come to light that MC holds it's own with compiled languages. That was where this whole thing began, I was explaining to Swami that MC is not a compiled language, then Scott kinda said, so what, there is not that much difference between compiled and interpreted languages these days. That would be supported by the results of the timing tests, so I'd have to agree with Scott. However, I'm still sticking to my guns - MC is not a compiled language. Swami apparently thought it was. So I was trying to clarify it for him. And that led to all this fun! :-) Sadhu ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Sadhunathan Nadesan wrote: I think it has come to light that MC holds it's own with compiled languages. That was where this whole thing began, I was explaining to Swami that MC is not a compiled language, then Scott kinda said, so what, there is not that much difference between compiled and interpreted languages these days. That would be supported by the results of the timing tests, so I'd have to agree with Scott. However, I'm still sticking to my guns - MC is not a compiled language. Maybe we need a new name for what Transcript does. Transcript pre-processes scripts into pointer-based bytecode, which generally outperforms purely interpreted xTalk by anywhere from several times to a few orders of magnitude. Sun calls their arguably less-efficient form of bytecode compiled, and make no bones that they're only compiling for a virtual machine. In that sense, compiled seems the more appropriate term. Yet Transcript does not store its bytecode, so there is one pass of pure interpretation to create the bytecode when an object first loads. So in that sense, interpreted seems the more appropriate term. :\ Erring on the side of underselling, I prefer tokenized. But that makes for useless marketing copy since it takes several paragraphs to explain what tokenized means for the general public. There's also a good case for just calling it interpreted without the unnecessary apology, given the benefits of scripting for the sorts of tasks one is likely to use Rev for. But communicating those benefits takes even more explanation; Osterhout wrote the single clearest paper on the subject I've seen yet, but few have read http://dev.scriptics.com/doc/scripting.html, and you'd have to remind people to mentally replace TCL with Rev when reading it. Moreover, the strength of the argument there appeals primarily to those with experience in both 3GLs and 4GLs, and would be lost on most non-geeks (I'm writing a version of Osterhaut's argument focusing on Transcript the way he focuses on TCL, but it'll be a little while before it's finished; gotta ship a few products first). Maybe we should just call Transcript fast. :) -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Sadhunathan Nadesan a écrit : Ok, here are the results so far, bash Sun Nov 10 13:01:59 PST 2002 17333 Sun Nov 10 13:03:43 PST 2002 pascal Sun Nov 10 13:03:43 PST 2002 17333 Sun Nov 10 13:05:47 PST 2002 andu's metacard Sun Nov 10 13:05:47 PST 2002 29623 Sun Nov 10 13:08:10 PST 2002 pierre's metacard Sun Nov 10 13:08:10 PST 2002 17338 Sun Nov 10 13:10:21 PST 2002 bruce's metacard Sun Nov 10 13:10:21 PST 2002 33351 Sun Nov 10 13:14:59 PST 2002 That would be bash1:44 pascal 2:04 Andu2:23 Pierre 2:11 Bruce 4:38 Now, it is likely I have become confused and mixed up exactly what came from who, sorry about that! My apologies if your name is not associated with your contribution, or vice versa. Now, why did we get different counts? I believe the count of 17333 is correct. Maybe someone can debug that. Here's the code Andu --- #!/usr/local/bin/mc on startup put 0 into the_counter put 1 into the_offset put 333491183 into file_size put 3 into the_increment put /gig/tmp/log/access_log into the_file put mystic_mouse into pattern open file the_file for read repeat until (the_offset = file_size) read from file the_file at the_offset for the_increment put it into the_text repeat for each line this_line in the_text get offset(pattern, this_line) if (it is not 0) then add 1 to the_counter end repeat add the_increment to the_offset end repeat put the_counter end startup Pierre -- #!/usr/local/bin/mc on startup put 0 into the_counter put 1 into the_offset put 333491183 into file_size put 3 into the_increment put /gig/tmp/log/access_log into the_file put mystic_mouse into pattern open file the_file for read repeat until (the_offset = file_size) read from file the_file at the_offset for the_increment put filter it with mystic_mouse into tempo add the num of lines in tempo to the_counter # put it into the_text # repeat until lineoffset(mystic_mouse, the_text) = 0 #if (lineoffset(mystic_mouse, the_text) is not 0) then # add 1 to the_counter # delete line 1 to lineoffset(mystic_mouse, the_text) of the_text #end if # end repeat add the_increment to the_offset end repeat put the_counter end startup Bruce - #!/usr/local/bin/mc on startup ## initialize variables: try adjusting numLines put /gig/tmp/log/access_log into the_file put $1 into numLines -- called with 1 as parameter put 0 into counter open file the_file repeat until (isEOF = TRUE) ## read the specified number of lines, check if we are at the end of the file read from file the_file for numLines lines put it into thisChunk put (the result = eof) into isEOF ## count the number of matches in this chunk put offset(mystic_mouse, thisChunk) into theOffset repeat until (theOffset = 0) add 1 to counter put offset(mystic_mouse, thisChunk, theOffset) into tempOffset if (tempOffset 0) then add tempOffset to theOffset else put 0 into theOffset end repeat end repeat close file the_file put counter end startup ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard Aloha, What does it do in using the filter command instead of the lineoffset one ? Faster, slower ? -- Cordialement, Pierre Sahores Inspection académique de Seine-Saint-Denis. Applications et bases de données WEB et VPN Qualifier et produire l'avantage compétitif ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
RE: the large file challenge
To be fair: most of metacard is coded in metatalk; it is a boot-strapped language, much like many of the TILs (threaded interpreted languages) of yesteryears (e.g., forth, apl). On Thursday, November 14, 2002, at 10:01 AM, [EMAIL PROTECTED] wrote: | MC, as well, is also coded in C, so in many interpreted languages (bash, | perl, MC) while the script itself is interpreted, much of the real work is | done by compiled code. -- John R. Vokey ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
RE: the large file challenge
| Actually, this says more about your specific implementation of the algorithm | and/or the quality of your compiler than it does about the relative speed | merits of any given language. As in your bash example, the bash shell | actually calls functions from libraries of well written highly optimized C | code. So, all things being equal, straight C code could never be slower than | a bash shell script. | | MC, as well, is also coded in C, so in many interpreted languages (bash, | perl, MC) while the script itself is interpreted, much of the real work is | done by compiled code. Yes, I agree. ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
RE: the large file challenge
Here's the latest round of times bash 1:44 pascal 2:04 C 2:28 MC 2:10 goodness, C is slowest of all?!? Actually, this says more about your specific implementation of the algorithm and/or the quality of your compiler than it does about the relative speed merits of any given language. As in your bash example, the bash shell actually calls functions from libraries of well written highly optimized C code. So, all things being equal, straight C code could never be slower than a bash shell script. MC, as well, is also coded in C, so in many interpreted languages (bash, perl, MC) while the script itself is interpreted, much of the real work is done by compiled code. -Glen Yates ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Pierre Sahores wrote: Richard Gaskin wrote: Pierre Sahores wrote: So ! MC as far so fast than Pascal ! Is'nt it great ? And, thanks again to Scott, for that too ! It's enough to make a Java programmer cry. ;) Java ? Help me to remember... Are you speaking, Richard, in about this dead marketed toy that crashes any time he search some more ram to eat ? If you're thinking of the one with the slow development cycle and the even slower runtime speed, yep, that's the critter. Anyone care to write this challenge algorithm in Java for laughs? Or would we need Raney to add a new time token in addition to seconds, ticks, and milliseconds: eons. :) -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
| I'm confused: if the point is to avoid reading the entire file into memory, | isn't what what line 8 does? And if it's already in memory, why is it read | again inside the loop? | | I think I missed something from the original post Hi Sorry, yes you missed something but not from the original post, the part you missed wasn't posted at all. It went like this 1. (not posted) - conversation in progress regarding the difference between compiled programs, like C, and interpreted programs, like Metacard. 2. (not posted) example sent of a shell script (bash or bourne shell) on Unix - interpreted of course - and Pascal program doing the same thing - compiled of course. Question asked: how would one do this in MC? I am not an experienced MC developer and I had no idea. 3. (not posted) a code snippet was sent to me as an example and I turned this into a working program. Yes it starts out by reading the whole file to count the lines which is not very efficient. In fact it failed when run on the large access file with an out of memory error. 4. (where you came in) - I sent a post inquiring, basically, isn't there a better way? I got a lot of good responses and it seems there are much better ways, so I am going to try them all. Clear it up for you? Sadhu ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
| If we're allowed to read the whole thing into RAM and the goal is the count | the occurences of the string mystic_mouse, then to optimize speed we can | just remove the redundant read commands and use offset to search for us: | | #!/usr/local/bin/mc | on startup | put /gig/tmp/log/xaa into the_file | put url (file:the_file) into the_text | put 0 into the_counter | put 1 into tPointer | -- | repeat for each line this_line in the_text | get offset(mystic_mouse, the_text, tPointer) | if it = 0 then exit repeat | add 1 to the_counter | add it to tPointer | end repeat | put the_counter | end startup | | This is off the top of my head. If it runs I'd be interested in how it | compares. Richard, Thanks much for the code and suggestions. We aren't allowed to read the whole thing into memory because the real access file is 300meg and my poor little Linux box has only 128meg RAM. One of the great things about Linux of course is that it will run fine on minimal hardware. Anyway, alas, the program failed with this message mc: out of memory 0 Ok, on to the next suggestion! Sadhu ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
| | I'm pretty sure the problem with speed here is from reading in the entire | file. | Unless of course you have enough free RAM- but that's hard to imagine when | the files are 300MB+. | | How about this, which you can adjust to read any given number of lines at a | time. | Try it with 10, 1000, 1, etc and see what gives you the best performance! | Hasn't been tested but hopefully it'll run with a tweak or less. | | #!/usr/local/bin/mc | on startup | ## initialize variables: try adjusting numLines | put /gig/tmp/log/xaa into the_file | put 1000 into numLines | put 0 into counter | | open file the_file | | repeat until (isEOF = TRUE) | ## read the specified number of lines, check if we are at the end of the | file | read from file the_file for numLines lines | put it into thisChunk | put (the result = eof) into isEOF | | ## count the number of matches in this chunk | put offset(mystic_mouse, thisChunk) into theOffset | repeat until (theOffset = 0) | add 1 to counter | put offset(mystic_mouse, thisChunk, theOffset) into tempOffset | if (tempOffset 0) then add tempOffset to theOffset | else put 0 into theOffset | end repeat | | end repeat | | close file the_file | | put counter | end startup | | HTH, | Brian --- Hey Brian, thanks, excellent. I tried it with 10, 1000, 1 and it got slightly faster (just a few seconds) with each increase so I'll leave it at 1 and compare against other suggested algorithms, and let everyone knkow results.. Sadhu ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
| | One last note: | | Be careful of using read from file xxx for yyy | | If you do not read for lines, you run the risk of cutting a line in half on | the spot where your magic string occurs. | | So always use read from file xxx for yyy LINES | | HTH. | Brian | Good point. For this particular use of the program a close count is ok - no problem if it's not perfect but clearly, that might matter in other instances. It is interesting that the different algorithms are varying slightly with the count, probaby for reasons like you mention. ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Sadhunathan Nadesan wrote: | | One last note: | | Be careful of using read from file xxx for yyy | | If you do not read for lines, you run the risk of cutting a line in half on | the spot where your magic string occurs. | | So always use read from file xxx for yyy LINES | | HTH. | Brian | Good point. For this particular use of the program a close count is ok - no problem if it's not perfect but clearly, that might matter in other instances. It is interesting that the different algorithms are varying slightly with the count, probaby for reasons like you mention. My hunch is that reading for lines is slower than reading a specified number of chars, since with lines it needs to evaluate each incoming character to determine if it's a return -- Scott, am I right or should they be about the same? -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
| # repeat for each line this_line in the_text | # if (not eof) then | # if (this_line contains mystic_mouse) then | # put the_counter + 1 into the_counter | # end if | # end if | # end repeat | | close file the_file | Allo Sadhu, | | Perhaps is it way to speed up your script in using the lineoffset | statement, as the upon proposal ;) | -- Allo! I'll try that Merci! ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
| So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good! | | But golly, I thought an interpreted language like MetaTalk was supposed to | be slow, certainly much slower than compiled Pascal. | | :) | By golly, that would be I think the conventional wisdom alright! Another myth goes by the wayside? :-) Of course, now the C programmers will probably come out of the closet. (they might want to know, what compiler, what flags set, etc.) Point might be - that is a non issue with MC. Assembly language programmers need not apply. ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Sadhunathan Nadesan wrote: By golly, that would be I think the conventional wisdom alright! Another myth goes by the wayside? :-) Of course, now the C programmers will probably come out of the closet. Not if Tom Pittman is around. I've never seen objective data on the subject, but he has the opinion that Pascal can and should be optimized to outperform C for most operations if compilers are designed to do so. -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
I got this suggestion from Jeanne A. E. DeVoto ~ [EMAIL PROTECTED] repeat read from stdin until mystic_mouse if the result is not empty then add 1 to the_counter -- found it else exit repeat -- encountered end of file, no more occurrences end repeat put the_counter But I was not able to make it actually run. Any suggestions? ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Ok, here are the results so far, bash Sun Nov 10 13:01:59 PST 2002 17333 Sun Nov 10 13:03:43 PST 2002 pascal Sun Nov 10 13:03:43 PST 2002 17333 Sun Nov 10 13:05:47 PST 2002 andu's metacard Sun Nov 10 13:05:47 PST 2002 29623 Sun Nov 10 13:08:10 PST 2002 pierre's metacard Sun Nov 10 13:08:10 PST 2002 17338 Sun Nov 10 13:10:21 PST 2002 bruce's metacard Sun Nov 10 13:10:21 PST 2002 33351 Sun Nov 10 13:14:59 PST 2002 That would be bash1:44 pascal 2:04 Andu2:23 Pierre 2:11 Bruce 4:38 Now, it is likely I have become confused and mixed up exactly what came from who, sorry about that! My apologies if your name is not associated with your contribution, or vice versa. Now, why did we get different counts? I believe the count of 17333 is correct. Maybe someone can debug that. Here's the code Andu --- #!/usr/local/bin/mc on startup put 0 into the_counter put 1 into the_offset put 333491183 into file_size put 3 into the_increment put /gig/tmp/log/access_log into the_file put mystic_mouse into pattern open file the_file for read repeat until (the_offset = file_size) read from file the_file at the_offset for the_increment put it into the_text repeat for each line this_line in the_text get offset(pattern, this_line) if (it is not 0) then add 1 to the_counter end repeat add the_increment to the_offset end repeat put the_counter end startup Pierre -- #!/usr/local/bin/mc on startup put 0 into the_counter put 1 into the_offset put 333491183 into file_size put 3 into the_increment put /gig/tmp/log/access_log into the_file put mystic_mouse into pattern open file the_file for read repeat until (the_offset = file_size) read from file the_file at the_offset for the_increment put it into the_text repeat until lineoffset(mystic_mouse, the_text) = 0 if (lineoffset(mystic_mouse, the_text) is not 0) then add 1 to the_counter delete line 1 to lineoffset(mystic_mouse, the_text) of the_text end if end repeat add the_increment to the_offset end repeat put the_counter end startup Bruce - #!/usr/local/bin/mc on startup ## initialize variables: try adjusting numLines put /gig/tmp/log/access_log into the_file put $1 into numLines -- called with 1 as parameter put 0 into counter open file the_file repeat until (isEOF = TRUE) ## read the specified number of lines, check if we are at the end of the file read from file the_file for numLines lines put it into thisChunk put (the result = eof) into isEOF ## count the number of matches in this chunk put offset(mystic_mouse, thisChunk) into theOffset repeat until (theOffset = 0) add 1 to counter put offset(mystic_mouse, thisChunk, theOffset) into tempOffset if (tempOffset 0) then add tempOffset to theOffset else put 0 into theOffset end repeat end repeat close file the_file put counter end startup ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Sadhunathan Nadesan a écrit : Ok, here are the results so far, bash Sun Nov 10 13:01:59 PST 2002 17333 Sun Nov 10 13:03:43 PST 2002 pascal Sun Nov 10 13:03:43 PST 2002 17333 Sun Nov 10 13:05:47 PST 2002 andu's metacard Sun Nov 10 13:05:47 PST 2002 29623 Sun Nov 10 13:08:10 PST 2002 pierre's metacard Sun Nov 10 13:08:10 PST 2002 17338 Sun Nov 10 13:10:21 PST 2002 bruce's metacard Sun Nov 10 13:10:21 PST 2002 33351 Sun Nov 10 13:14:59 PST 2002 That would be bash1:44 pascal 2:04 Andu2:23 Pierre 2:11 Bruce 4:38 Now, it is likely I have become confused and mixed up exactly what came from who, sorry about that! My apologies if your name is not associated with your contribution, or vice versa. Now, why did we get different counts? I believe the count of 17333 is correct. Maybe someone can debug that. Here's the code Andu --- #!/usr/local/bin/mc on startup put 0 into the_counter put 1 into the_offset put 333491183 into file_size put 3 into the_increment put /gig/tmp/log/access_log into the_file put mystic_mouse into pattern open file the_file for read repeat until (the_offset = file_size) read from file the_file at the_offset for the_increment put it into the_text repeat for each line this_line in the_text get offset(pattern, this_line) if (it is not 0) then add 1 to the_counter end repeat add the_increment to the_offset end repeat put the_counter end startup Pierre -- #!/usr/local/bin/mc on startup put 0 into the_counter put 1 into the_offset put 333491183 into file_size put 3 into the_increment put /gig/tmp/log/access_log into the_file put mystic_mouse into pattern open file the_file for read repeat until (the_offset = file_size) read from file the_file at the_offset for the_increment repeat until lineoffset(mystic_mouse,it) = 0 if (lineoffset(mystic_mouse,it) is not 0) then add 1 to the_counter delete line 1 to lineoffset(mystic_mouse,it) of it end if end repeat # put it into the_text # repeat until lineoffset(mystic_mouse, the_text) = 0 # if (lineoffset(mystic_mouse, the_text) is not 0) then #add 1 to the_counter #delete line 1 to lineoffset(mystic_mouse, the_text) of the_text # end if # end repeat add the_increment to the_offset end repeat put the_counter end startup Bruce - #!/usr/local/bin/mc on startup ## initialize variables: try adjusting numLines put /gig/tmp/log/access_log into the_file put $1 into numLines -- called with 1 as parameter put 0 into counter open file the_file repeat until (isEOF = TRUE) ## read the specified number of lines, check if we are at the end of the file read from file the_file for numLines lines put it into thisChunk put (the result = eof) into isEOF ## count the number of matches in this chunk put offset(mystic_mouse, thisChunk) into theOffset repeat until (theOffset = 0) add 1 to counter put offset(mystic_mouse, thisChunk, theOffset) into tempOffset if (tempOffset 0) then add tempOffset to theOffset else put 0 into theOffset end repeat end repeat close file the_file put counter end startup ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard Hi Friends, Does the it upon improvement proposal work as expected... Perhaps yes, perhaps not ;) -- Cordialement, Pierre Sahores Inspection académique de Seine-Saint-Denis. Applications et bases de données WEB et VPN Qualifier et produire l'avantage compétitif ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
--On Sunday, November 10, 2002 13:21:04 -0800 Sadhunathan Nadesan [EMAIL PROTECTED] wrote: Here's another try for whatever it's worth. I tested it on a file with 7000 lines of about 800k and it takes less then a sec: on startup put 0 into tCount put mystic_mouse into tWord put empty into line 3000 of tChunk put /gig/tmp/log/access_log into tFile open file tFile for read put 0 into fOffset repeat read from file tFile at fOffset+1 for 3000 lines #can play with that number for best results put it into tChunk put 0 into tSkip repeat get offset (tWord,tChunk,tSkip) if it is not 0 then add 1 to tCount add it+length(tWord) to tSkip else put 0 into tSkip exit repeat end if end repeat add length(tChunk) to fOffset if the num of lines of tChunk3000 then exit repeat end repeat put tCount end startup Regards, Andu Novac ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
On Sun, 10 Nov 2002 Richard Gaskin [EMAIL PROTECTED] wrote: My hunch is that reading for lines is slower than reading a specified number of chars, since with lines it needs to evaluate each incoming character to determine if it's a return -- Scott, am I right or should they be about the same? You're right, though I wouldn't think it would make *that* much difference. As for my guess as to the fastest way to do this, it'd probably be a hybrid approach, using both read for x and repeat for each line. You'd start by opening the file for binary read (faster than other modes). Then read for X characters, where X would be some large number experimentally determined for each system (it'd probably some large percentage of the free RAM, and so probably on the order of a few MB), and then use repeat for each line l in it. The trick is that the last line will be incomplete in this case, so for the second and subsequent reads you subtract the length of the last line from X, and do read for X at Y, where Y is a running total of what's been read, after subtracting the partial lines of course. Some extra bookkeeping will be required in this case (e.g., if the tag you're looking for is in the partial last line you need to subtract 1 from the count so you don't count it twice). Exactly how to do this part most efficiently is left as an excercise for the reader ;-) Regards, Scott -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc Scott Raney [EMAIL PROTECTED] http://www.metacard.com MetaCard: You know, there's an easier way to do that... ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
All right... I tweaked a little more outside of email. For accuracy in the case where "mystic_mouse" occurs multiple times on one line, uncomment the line: "add offset(return, thisChunk, theOffset) to theOffset" This just skips to the next line whenever a match is found. This should run faster than my previous attempts: on startup ## initialize variables: try adjusting numLines put "/gig/tmp/log/access_log" into the_file put ($1*1024*1024) into chunkSize ## this is for MB put 0 into counter put FALSE into isEOF open file the_file repeat until (isEOF = TRUE) ## read the specified number of lines, check if we are at the end of the file read from file the_file for chunkSize put it into thisChunk put (the result = "eof") into isEOF ## count the number of matches in this chunk put offset("mystic_mouse", thisChunk) into theOffset repeat add 1 to counter get offset("mystic_mouse", thisChunk, theOffset) if (it = 0) then exit repeat put theOffset + it + 12 into theOffset ## add offset(return, thisChunk, theOffset) to theOffset end repeat end repeat close file the_file put counter end startup HTH. Brian
Re: the large file challenge
Sadhunathan Nadesan a écrit : | Try something alike : | | on mouseup | put 1 into startread | open file thefile for read | read from file thefile until eof | put the num of lines of it in endtoread | close file thefile | repeat while startread endtoread | open file thefile for read | read from file thefile at startread for 99 lines | ... | do what you need with it | ... | close file thefile | add 100 to startread | end repeat | end mouseup Alors, Pierre, Many thanks. This turned out to be more efficient than I thought. I had to modify it slightly because the 'read from file at' command takes an offset in characters, not lines. (Code below). Anyway, on those 3 sample programs, here are the times on the last run, not my full access log, but a chopped (50,000 lines) snippet. Bash shell script (interpreted) 24 seconds Pascal (compiled) 7 seconds Metacard (interpreted) 2 minutes 50 seconds So, any takers on the speed challenge? Here's the code I used. #!/usr/local/bin/mc on startup put /gig/tmp/log/xaa into the_file put 1 into start_read put 0 into the_counter put 1 into the_offset open file the_file for read read from file the_file until eof put the num of lines of it into end_read close file the_file repeat while (start_read end_read) open file the_file for read read from file the_file at the_offset for 99 lines put it into the_text put the number of chars of it + the_offset into the_offset repeat until lineoffset(mystic_mouse,the_text) = 0 if lineoffset(mystic_mouse,the_text) is not 0 then put the_counter + 1 into the_counter delete line 1 to lineoffset(mystic_mouse,the_text) of the_text end if end repeat # repeat for each line this_line in the_text # if (not eof) then # if (this_line contains mystic_mouse) then # put the_counter + 1 into the_counter # end if # end if # end repeat close file the_file add 100 to start_read end repeat put the_counter end startup Now, I feel sure we could improve this, fix my errors, etc anyone? Sadhu ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard Allo Sadhu, Perhaps is it way to speed up your script in using the lineoffset statement, as the upon proposal ;) -- Cordialement, Pierre Sahores Inspection académique de Seine-Saint-Denis. Applications et bases de données WEB et VPN Qualifier et produire l'avantage compétitif ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Wow, Just logged on to work and saw all the great responses. Thanks all, what fun. Anyway I will respond to each later and try your code too. I have to run right now, appointment. I did however have some code from Andu via Swami that I modifed somewhat and got enormous speed improvement. Here's the latest run (ran this several times so file would be in cache equally for all programs). . bash Sat Nov 9 16:48:12 PST 2002 17333 Sat Nov 9 16:50:05 PST 2002 pascal Sat Nov 9 16:50:05 PST 2002 17333 Sat Nov 9 16:52:09 PST 2002 metacard Sat Nov 9 16:52:09 PST 2002 17338 Sat Nov 9 16:54:28 PST 2002 . So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good! Here's the code, gott go... #!/usr/local/bin/mc on startup put 0 into the_counter put 1 into the_offset put 333491183 into file_size put 3 into the_increment put /gig/tmp/log/access_log into the_file open file the_file for read repeat until (the_offset = file_size) read from file the_file at the_offset for the_increment put it into the_text repeat for each line this_line in the_text get offset(mystic_mouse, this_line) if (it is not 0) then add 1 to the_counter end repeat add the_increment to the_offset end repeat put the_counter end startup ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Sadhunathan Nadesan wrote: So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good! But golly, I thought an interpreted language like MetaTalk was supposed to be slow, certainly much slower than compiled Pascal. :) -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Om Sadhunathan: Excellent! i had been thinking that we should probably save access logs from our servers in honolulu, but then parsing those was a blind spot. This will help immensely. Now, do i read this to say that there were 17,338 attempts to look at Mystic Mouse PDF's ? and if so, over what period of time? A small addtion to the script and we could determine if the download was completed. (204, i think... or may be 304?) On Saturday, November 9, 2002, at 03:17 PM, Sadhunathan Nadesan wrote: metacard Sat Nov 9 16:52:09 PST 2002 17338 Sat Nov 9 16:54:28 PST 2002 . So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good! Here's the code, gott go... #!/usr/local/bin/mc on startup put 0 into the_counter put 1 into the_offset put 333491183 into file_size put 3 into the_increment put "/gig/tmp/log/access_log" into the_file open file the_file for read repeat until (the_offset >= file_size) read from file the_file at the_offset for the_increment put it into the_text repeat for each line this_line in the_text get offset("mystic_mouse", this_line) if (it is not 0) then add 1 to the_counter end repeat add the_increment to the_offset end repeat put the_counter end startup
Re: the large file challenge
Sannyasin Sivakatirswami wrote: Om Sadhunathan: Excellent! i had been thinking that we should probably save access logs from our servers in honolulu, but then parsing those was a blind spot. This will help immensely. Now, do i read this to say that there were 17,338 attempts to look at Mystic Mouse PDF's ? and if so, over what period of time? A small addtion to the script and we could determine if the download was completed. (204, i think... or may be 304?) On Saturday, November 9, 2002, at 03:17 PM, Sadhunathan Nadesan wrote: metacard Sat Nov 9 16:52:09 PST 2002 17338 Sat Nov 9 16:54:28 PST 2002 . So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good! Here's the code, gott go... #!/usr/local/bin/mc on startup put 0 into the_counter put 1 into the_offset put 333491183 into file_size put 3 into the_increment put /gig/tmp/log/access_log into the_file open file the_file for read repeat until (the_offset = file_size) read from file the_file at the_offset for the_increment put it into the_text repeat for each line this_line in the_text get offset(mystic_mouse, this_line) if (it is not 0) then add 1 to the_counter end repeat add the_increment to the_offset end repeat put the_counter end startup Aloha Friends, So ! MC as far so fast than Pascal ! Is'nt it great ? And, thanks again to Scott, for that too ! Just one more question : could you say us, Scott, when MC will become faster than C, or is it a secret ? ;-) -- Cordialement, Pierre Sahores Inspection académique de Seine-Saint-Denis. Applications et bases de données WEB et VPN Qualifier et produire l'avantage compétitif ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
Sadhunathan Nadesan wrote: #!/usr/local/bin/mc on startup put /gig/tmp/log/xaa into the_file put 1 into start_read put 0 into the_counter put 1 into the_offset open file the_file for read read from file the_file until eof put the num of lines of it into end_read close file the_file repeat while (start_read end_read) open file the_file for read read from file the_file at the_offset for 99 lines put it into the_text put the number of chars of it + the_offset into the_offset repeat for each line this_line in the_text if (not eof) then if (this_line contains mystic_mouse) then put the_counter + 1 into the_counter end if end if end repeat close file the_file add 100 to start_read end repeat put the_counter end startup Now, I feel sure we could improve this, fix my errors, etc anyone? I'm confused: if the point is to avoid reading the entire file into memory, isn't what what line 8 does? And if it's already in memory, why is it read again inside the loop? I think I missed something from the original post -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
--On Friday, November 08, 2002 18:24:56 -0800 Richard Gaskin [EMAIL PROTECTED] wrote: Sadhunathan Nadesan wrote: # !/usr/local/bin/mc on startup put /gig/tmp/log/xaa into the_file put 1 into start_read put 0 into the_counter put 1 into the_offset open file the_file for read read from file the_file until eof put the num of lines of it into end_read close file the_file repeat while (start_read end_read) open file the_file for read read from file the_file at the_offset for 99 lines put it into the_text put the number of chars of it + the_offset into the_offset repeat for each line this_line in the_text if (not eof) then if (this_line contains mystic_mouse) then put the_counter + 1 into the_counter end if end if end repeat close file the_file add 100 to start_read end repeat put the_counter end startup Now, I feel sure we could improve this, fix my errors, etc anyone? I'm confused: if the point is to avoid reading the entire file into memory, isn't what what line 8 does? And if it's already in memory, why is it read again inside the loop? I think I missed something from the original post No, you got it right. -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard Regards, Andu Novac ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
andu wrote: I think I missed something from the original post No, you got it right. Thanks, Andu. I thought I was losin' it. If we're allowed to read the whole thing into RAM and the goal is the count the occurences of the string mystic_mouse, then to optimize speed we can just remove the redundant read commands and use offset to search for us: #!/usr/local/bin/mc on startup put /gig/tmp/log/xaa into the_file put url (file:the_file) into the_text put 0 into the_counter put 1 into tPointer -- repeat for each line this_line in the_text get offset(mystic_mouse, the_text, tPointer) if it = 0 then exit repeat add 1 to the_counter add it to tPointer end repeat put the_counter end startup This is off the top of my head. If it runs I'd be interested in how it compares. -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
--On Friday, November 08, 2002 19:15:59 -0800 Richard Gaskin [EMAIL PROTECTED] wrote: # !/usr/local/bin/mc on startup put /gig/tmp/log/xaa into the_file put url (file:the_file) into the_text put 0 into the_counter put 1 into tPointer -- repeat for each line this_line in the_text get offset(mystic_mouse, the_text, tPointer) if it = 0 then exit repeat add 1 to the_counter add it to tPointer end repeat put the_counter end startup Here's my take considering mystic_mouse can occur only once on a line and loading 300MG into ram is not an issue: on startup put url (file:/gig/tmp/log/xaa) into the_text put 0 into the_counter repeat for each line this_line in the_text get offset(mystic_mouse, this_line) if it is not 0 then add 1 to the_counter end repeat put the_counter end startup ...and I'm sure it could be improved. This is off the top of my head. If it runs I'd be interested in how it compares. -- Richard Gaskin Fourth World Media Corporation Developer of WebMerge 2.0: Publish any database on any site ___ [EMAIL PROTECTED] http://www.FourthWorld.com Tel: 323-225-3717 AIM: FourthWorldInc ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard Regards, Andu Novac ___ metacard mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/metacard
Re: the large file challenge
I'm pretty sure the problem with speed here is from reading in the entire file. Unless of course you have enough free RAM- but that's hard to imagine when the files are 300MB+. How about this, which you can adjust to read any given number of lines at a time. Try it with 10, 1000, 1, etc and see what gives you the best performance! Hasn't been tested but hopefully it'll run with a tweak or less. #!/usr/local/bin/mc on startup ## initialize variables: try adjusting numLines put "/gig/tmp/log/xaa" into the_file put 1000 into numLines put 0 into counter open file the_file repeat until (isEOF = TRUE) ## read the specified number of lines, check if we are at the end of the file read from file the_file for numLines lines put it into thisChunk put (the result = "eof") into isEOF ## count the number of matches in this chunk put offset("mystic_mouse", thisChunk) into theOffset repeat until (theOffset = 0) add 1 to counter put offset("mystic_mouse", thisChunk, theOffset) into tempOffset if (tempOffset 0) then add tempOffset to theOffset else put 0 into theOffset end repeat end repeat close file the_file put counter end startup HTH, Brian
Re: the large file challenge
One last note: Be careful of using read from file xxx for yyy If you do not read for "lines", you run the risk of cutting a line in half on the spot where your magic string occurs. So always use read from file xxx for yyy LINES HTH. Brian