subject:"Re\: the large file challenge"

RE the large file challenge

2002-11-16 Thread Sadhunathan Nadesan

| Maybe we need a new name for what Transcript does.
| 
| Transcript pre-processes scripts into pointer-based bytecode, which
| generally outperforms purely interpreted xTalk by anywhere from several
| times to a few orders of magnitude.
| 


Maybe?  This is an excellent clarification.   If MC is seen as trying to
compete with Java and Sun has decided to redefine 'compiled' then, hey!
Why not.  Come to think of it they used to call UCSD Pascal compiled
but it was p-code possibly similar??

There is an exception, that is, when MC is used as a scripting language
such as cgi scripts, or such as the tests I have been running.  In that
case there is no preprocessing.  In that case, I believe interpreted
would be the correct description.  The good news is, it _still_ compares 
in speed to the compiled languages.

For an interesting read on security and high level languages, this is fun:

 http://m.bacarella.com/papers/secsoft/html


Sadhu
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: RE the large file challenge

2002-11-16 Thread Richard Gaskin

Sadhunathan Nadesan wrote:

 
 For an interesting read on security and high level languages, this is fun:
 
 http://m.bacarella.com/papers/secsoft/html

Great article -- thanks for posting that!

-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

RE: the large file challenge

2002-11-15 Thread Sadhunathan Nadesan

| 
| Message: 1
| Date: Thu, 14 Nov 2002 10:39:01 -0700
| Subject: RE: the large file challenge
| From: John Vokey [EMAIL PROTECTED]
| To: [EMAIL PROTECTED]
| Reply-To: [EMAIL PROTECTED]
| 
| To be fair: most of metacard is coded in metatalk; it is a 
| boot-strapped language, much like many of the TILs (threaded 
| interpreted languages) of yesteryears (e.g., forth, apl).
| 


John,

I agree with you too.  I take his point that a C program should not be
slower than a bash script invoking 2 utilities written in C.  If anyone
cares to contribute a better C program, go for it!  Right now I'm running
Pierre's MC revision to see how it does.  This has been fun but I think
we've come to the end.

I think it has come to light that MC holds it's own with compiled
languages.  That was where this whole thing began, I was explaining
to Swami that MC is not a compiled language, then Scott kinda said, so
what, there is not that much difference between compiled and interpreted
languages these days.  That would be supported by the results of the
timing tests, so I'd have to agree with Scott.  However, I'm still
sticking to my guns - MC is not a compiled language. Swami apparently
thought it was.  So I was trying to clarify it for him.  And
that led to all this fun!  :-)

Sadhu
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-15 Thread Richard Gaskin

Sadhunathan Nadesan wrote:

 I think it has come to light that MC holds it's own with compiled
 languages.  That was where this whole thing began, I was explaining
 to Swami that MC is not a compiled language, then Scott kinda said, so
 what, there is not that much difference between compiled and interpreted
 languages these days.  That would be supported by the results of the
 timing tests, so I'd have to agree with Scott.  However, I'm still
 sticking to my guns - MC is not a compiled language.

Maybe we need a new name for what Transcript does.

Transcript pre-processes scripts into pointer-based bytecode, which
generally outperforms purely interpreted xTalk by anywhere from several
times to a few orders of magnitude.

Sun calls their arguably less-efficient form of bytecode compiled, and
make no bones that they're only compiling for a virtual machine.  In that
sense, compiled seems the more appropriate term.

Yet Transcript does not store its bytecode, so there is one pass of pure
interpretation to create the bytecode when an object first loads. So in that
sense, interpreted seems the more appropriate term.

:\

Erring on the side of underselling, I prefer tokenized.  But that makes
for useless marketing copy since it takes several paragraphs to explain what
tokenized means for the general public.

There's also a good case for just calling it interpreted without the
unnecessary apology, given the benefits of scripting for the sorts of tasks
one is likely to use Rev for.  But communicating those benefits takes even
more explanation; Osterhout wrote the single clearest paper on the subject
I've seen yet, but few have read
http://dev.scriptics.com/doc/scripting.html, and you'd have to remind
people to mentally replace TCL with Rev when reading it.  Moreover, the
strength of the argument there appeals primarily to those with experience in
both 3GLs and 4GLs, and would be lost on most non-geeks (I'm writing a
version of Osterhaut's argument focusing on Transcript the way he focuses on
TCL, but it'll be a little while before it's finished; gotta ship a few
products first).

Maybe we should just call Transcript fast. :)

-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-14 Thread Pierre Sahores

Sadhunathan Nadesan a écrit :
 
 Ok, here are the results so far,
 
 bash
 Sun Nov 10 13:01:59 PST 2002
 17333
 Sun Nov 10 13:03:43 PST 2002
 
 pascal
 Sun Nov 10 13:03:43 PST 2002
 17333
 Sun Nov 10 13:05:47 PST 2002
 
 andu's metacard
 Sun Nov 10 13:05:47 PST 2002
 29623
 Sun Nov 10 13:08:10 PST 2002
 
 pierre's metacard
 Sun Nov 10 13:08:10 PST 2002
 17338
 Sun Nov 10 13:10:21 PST 2002
 
 bruce's metacard
 Sun Nov 10 13:10:21 PST 2002
 33351
 Sun Nov 10 13:14:59 PST 2002
 
 That would be
 
 bash1:44
 pascal  2:04
 Andu2:23
 Pierre  2:11
 Bruce   4:38
 
 Now, it is likely I have become confused and mixed up exactly what came
 from who, sorry about that!  My apologies if your name is not associated
 with your contribution, or vice versa.
 
 Now, why did we get different counts?  I believe the count of 17333 is
 correct.  Maybe someone can debug that.
 
 Here's the code
 
 Andu
 ---
 #!/usr/local/bin/mc
 
 on startup
   put 0 into the_counter
   put 1 into the_offset
   put 333491183 into file_size
   put   3 into the_increment
   put /gig/tmp/log/access_log into the_file
   put mystic_mouse into pattern
 
   open file the_file for read
 
   repeat until (the_offset = file_size)
 read from file the_file at the_offset for the_increment
 put it into the_text
 repeat for each line this_line in the_text
   get offset(pattern, this_line)
   if (it is not 0) then add 1 to the_counter
 end repeat
 add the_increment to the_offset
   end repeat
 
   put the_counter
 end startup
 
 Pierre
 --
 #!/usr/local/bin/mc
 
 on startup
   put 0 into the_counter
   put 1 into the_offset
   put 333491183 into file_size
   put   3 into the_increment
   put /gig/tmp/log/access_log into the_file
   put mystic_mouse into pattern
 
   open file the_file for read
 
   repeat until (the_offset = file_size)
 read from file the_file at the_offset for the_increment

 put filter it with mystic_mouse into tempo
 add the num of lines in tempo to the_counter

# put it into the_text
 
#  repeat until lineoffset(mystic_mouse, the_text) = 0
#if (lineoffset(mystic_mouse, the_text) is not 0) then
#  add 1 to the_counter
#  delete line 1 to lineoffset(mystic_mouse, the_text) of the_text
#end if
#  end repeat
 
 add the_increment to the_offset
   end repeat
 
   put the_counter
 end startup
 
 Bruce
 -
 #!/usr/local/bin/mc
 on startup
   ## initialize variables: try adjusting numLines
   put /gig/tmp/log/access_log into the_file
   put $1 into numLines  -- called with 1 as parameter
   put 0 into counter
 
   open file the_file
 
   repeat until (isEOF = TRUE)
  ## read the specified number of lines, check if we are at the end of the file
  read from file the_file for numLines lines
  put it into thisChunk
  put (the result = eof) into isEOF
 
  ## count the number of matches in this chunk
  put offset(mystic_mouse, thisChunk) into theOffset
  repeat until (theOffset = 0)
 add 1 to counter
 put offset(mystic_mouse, thisChunk, theOffset) into tempOffset
 if (tempOffset  0) then add tempOffset to theOffset
 else put 0 into theOffset
  end repeat
 
   end repeat
 
   close file the_file
 
   put counter
 end startup
 
 ___
 metacard mailing list
 [EMAIL PROTECTED]
 http://lists.runrev.com/mailman/listinfo/metacard

Aloha,

What does it do in using the filter command instead of the lineoffset
one ? Faster, slower ?
-- 
Cordialement, Pierre Sahores

Inspection académique de Seine-Saint-Denis.
Applications et bases de données WEB et VPN
Qualifier et produire l'avantage compétitif
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

RE: the large file challenge

2002-11-14 Thread John Vokey

To be fair: most of metacard is coded in metatalk; it is a 
boot-strapped language, much like many of the TILs (threaded 
interpreted languages) of yesteryears (e.g., forth, apl).

On Thursday, November 14, 2002, at 10:01  AM, 
[EMAIL PROTECTED] wrote:

| MC, as well, is also coded in C, so in many interpreted languages 
(bash,
| perl, MC) while the script itself is interpreted, much of the real 
work is
| done by compiled code.

--
John R. Vokey

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

RE: the large file challenge

2002-11-13 Thread Sadhunathan Nadesan

| Actually, this says more about your specific implementation of the algorithm
| and/or the quality of your compiler than it does about the relative speed
| merits of any given language. As in your bash example, the bash shell
| actually calls functions from libraries of well written highly optimized C
| code. So, all things being equal, straight C code could never be slower than
| a bash shell script.
| 
| MC, as well, is also coded in C, so in many interpreted languages (bash,
| perl, MC) while the script itself is interpreted, much of the real work is
| done by compiled code.


Yes, I agree.
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

RE: the large file challenge

2002-11-12 Thread Yates, Glen

 Here's the latest round of times
 
 
 bash 1:44
 pascal 2:04
 C 2:28
 MC 2:10
 
 goodness, C is slowest of all?!?

Actually, this says more about your specific implementation of the algorithm
and/or the quality of your compiler than it does about the relative speed
merits of any given language. As in your bash example, the bash shell
actually calls functions from libraries of well written highly optimized C
code. So, all things being equal, straight C code could never be slower than
a bash shell script.

MC, as well, is also coded in C, so in many interpreted languages (bash,
perl, MC) while the script itself is interpreted, much of the real work is
done by compiled code.

-Glen Yates
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Richard Gaskin

Pierre Sahores wrote:

 Richard Gaskin wrote:
 
 Pierre Sahores wrote:
 
 So ! MC as far so fast than Pascal ! Is'nt it great ? And, thanks again
 to Scott, for that too !
 
 It's enough to make a Java programmer cry. ;)
 
 Java ? Help me to remember... Are you speaking, Richard, in about this
 dead marketed toy that crashes any time he search some more ram to eat ?

If you're thinking of the one with the slow development cycle and the even
slower runtime speed, yep, that's the critter.

Anyone care to write this challenge algorithm in Java for laughs?  Or would
we need Raney to add a new time token in addition to seconds, ticks, and
milliseconds:  eons.

:)

-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Sadhunathan Nadesan

| I'm confused:  if the point is to avoid reading the entire file into memory,
| isn't what what line 8 does?  And if it's already in memory, why is it read
| again inside the loop?
| 
| I think I missed something from the original post



Hi

Sorry, yes you missed something but not from the original post, the
part you missed wasn't posted at all.

It went like this

1.  (not posted) - conversation in progress regarding the difference 
between compiled programs, like C, and interpreted programs, like
Metacard.

2. (not posted) example sent of a shell script (bash or bourne shell)
on Unix - interpreted of course - and Pascal program doing the same
thing - compiled of course.  Question asked:  how  would one do this in MC?
I am not an experienced MC developer and I had no idea.

3. (not posted) a code snippet was sent to me as an example and I turned
this into a working program.  Yes it starts out by reading the whole file
to count the lines which is not very efficient.  In fact it failed when
run on the large access file with an out of memory error.

4. (where you came in) - I sent a post inquiring, basically, isn't there
a better way?

I got a lot of good responses and it seems there are much better
ways, so I am going to try them all.  

Clear it up for you?

Sadhu
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Sadhunathan Nadesan

| If we're allowed to read the whole thing into RAM and the goal is the count
| the occurences of the string mystic_mouse, then to optimize speed we can
| just remove the redundant read commands and use offset to search for us:
| 
| #!/usr/local/bin/mc
| on startup
|   put /gig/tmp/log/xaa into the_file
|   put url (file:the_file) into the_text
|   put 0 into the_counter
|   put 1 into tPointer
|   --
|   repeat for each line this_line in the_text
| get offset(mystic_mouse, the_text, tPointer)
| if it = 0 then exit repeat
| add 1 to the_counter
| add it to tPointer
|   end repeat
|   put the_counter
| end startup
| 
| This is off the top of my head.  If it runs I'd be interested in how it
| compares.


Richard,

Thanks much for the code and suggestions.  We aren't allowed to read
the whole thing into memory because the real access file is 300meg
and my poor little Linux box has only 128meg RAM.  One of the great things
about Linux of course is that it will run fine on minimal hardware.

Anyway, alas, the program failed with this message

mc: out of memory
0

Ok, on to the next suggestion!

Sadhu
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Sadhunathan Nadesan

| 
| I'm pretty sure the problem with speed here is from reading in the entire 
| file.
| Unless of course you have enough free RAM- but that's hard to imagine when 
| the files are 300MB+.
| 
| How about this, which you can adjust to read any given number of lines at a 
| time.
| Try it with 10, 1000, 1, etc and see what gives you the best performance!
| Hasn't been tested but hopefully it'll run with a tweak or less.
| 
| #!/usr/local/bin/mc
| on startup
|   ## initialize variables: try adjusting numLines
|   put /gig/tmp/log/xaa into the_file
|   put 1000 into numLines
|   put 0 into counter
| 
|   open file the_file
| 
|   repeat until (isEOF = TRUE)
|  ## read the specified number of lines, check if we are at the end of the 
| file
|  read from file the_file for numLines lines
|  put it into thisChunk
|  put (the result = eof) into isEOF
| 
|  ## count the number of matches in this chunk
|  put offset(mystic_mouse, thisChunk) into theOffset
|  repeat until (theOffset = 0)
| add 1 to counter
| put offset(mystic_mouse, thisChunk, theOffset) into tempOffset
| if (tempOffset  0) then add tempOffset to theOffset
| else put 0 into theOffset
|  end repeat
| 
|   end repeat
| 
|   close file the_file
| 
|   put counter
| end startup
| 
| HTH,
| Brian
---



Hey Brian, thanks, excellent.

I tried it with 10, 1000, 1 and it got slightly faster (just a few
seconds) with each increase so I'll leave it at 1 and compare against other 
suggested algorithms, and let everyone knkow results..



Sadhu

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Sadhunathan Nadesan

| 
| One last note:
| 
| Be careful of using read from file xxx for yyy
| 
| If you do not read for lines, you run the risk of cutting a line in half on 
| the spot where your magic string occurs.
| 
| So always use read from file xxx for yyy LINES
| 
| HTH.
| Brian
| 

Good point.  For this particular use of the
program a close count is ok - no problem if
it's not perfect but clearly, that might matter
in other instances.

It is interesting that the different algorithms
are varying slightly with the count, probaby
for reasons like you mention.
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Richard Gaskin

Sadhunathan Nadesan wrote:

 | 
 | One last note:
 | 
 | Be careful of using read from file xxx for yyy
 | 
 | If you do not read for lines, you run the risk of cutting a line in half
 on 
 | the spot where your magic string occurs.
 | 
 | So always use read from file xxx for yyy LINES
 | 
 | HTH.
 | Brian
 | 
 
 Good point.  For this particular use of the
 program a close count is ok - no problem if
 it's not perfect but clearly, that might matter
 in other instances.
 
 It is interesting that the different algorithms
 are varying slightly with the count, probaby
 for reasons like you mention.

My hunch is that reading for lines is slower than reading a specified number
of chars, since with lines it needs to evaluate each incoming character to
determine if it's a return -- Scott, am I right or should they be about the
same?

-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Sadhunathan Nadesan

| # repeat for each line this_line in the_text
| #   if (not eof) then
| # if (this_line contains mystic_mouse) then
| #   put the_counter + 1 into the_counter
| # end if
| #   end if
| # end repeat
| 
|  close file the_file

| Allo Sadhu,
| 
| Perhaps is it way to speed up your script in using the lineoffset
| statement, as the upon proposal ;)
| -- 


Allo!

I'll try that

Merci!



___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Sadhunathan Nadesan

|  So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good!
| 
| But golly, I thought an interpreted language like MetaTalk was supposed to
| be slow, certainly much slower than compiled Pascal.
| 
| :)
| 


By golly, that would be I think the conventional wisdom alright!

Another myth goes by the wayside?  :-)

Of course, now the C programmers will probably come out of
the closet.  (they might want to know, what compiler,
what flags set, etc.)  Point might be - that is a non issue
with MC.

Assembly language programmers need not apply.
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Richard Gaskin

Sadhunathan Nadesan wrote:

 By golly, that would be I think the conventional wisdom alright!
 
 Another myth goes by the wayside?  :-)
 
 Of course, now the C programmers will probably come out of
 the closet. 

Not if Tom Pittman is around.  I've never seen objective data on the
subject, but he has the opinion that Pascal can and should be optimized to
outperform C for most operations if compilers are designed to do so.

-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Sadhunathan Nadesan


I got this suggestion from Jeanne A. E. DeVoto ~ [EMAIL PROTECTED]

repeat
 read from stdin until mystic_mouse
 if the result is not empty then add 1 to the_counter -- found it
 else exit repeat -- encountered end of file, no more occurrences
end repeat
put the_counter


But I was not able to make it actually run.

Any suggestions?
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Sadhunathan Nadesan

Ok, here are the results so far,


bash
Sun Nov 10 13:01:59 PST 2002
17333
Sun Nov 10 13:03:43 PST 2002

pascal
Sun Nov 10 13:03:43 PST 2002
17333
Sun Nov 10 13:05:47 PST 2002

andu's metacard
Sun Nov 10 13:05:47 PST 2002
29623
Sun Nov 10 13:08:10 PST 2002

pierre's metacard
Sun Nov 10 13:08:10 PST 2002
17338
Sun Nov 10 13:10:21 PST 2002

bruce's metacard
Sun Nov 10 13:10:21 PST 2002
33351
Sun Nov 10 13:14:59 PST 2002


That would be 

bash1:44
pascal  2:04
Andu2:23
Pierre  2:11
Bruce   4:38

Now, it is likely I have become confused and mixed up exactly what came
from who, sorry about that!  My apologies if your name is not associated
with your contribution, or vice versa.

Now, why did we get different counts?  I believe the count of 17333 is
correct.  Maybe someone can debug that.



Here's the code

Andu
---
#!/usr/local/bin/mc

on startup
  put 0 into the_counter
  put 1 into the_offset
  put 333491183 into file_size
  put   3 into the_increment
  put /gig/tmp/log/access_log into the_file
  put mystic_mouse into pattern

  open file the_file for read

  repeat until (the_offset = file_size)
read from file the_file at the_offset for the_increment
put it into the_text
repeat for each line this_line in the_text
  get offset(pattern, this_line)
  if (it is not 0) then add 1 to the_counter
end repeat
add the_increment to the_offset
  end repeat

  put the_counter
end startup


Pierre
--
#!/usr/local/bin/mc

on startup
  put 0 into the_counter
  put 1 into the_offset
  put 333491183 into file_size
  put   3 into the_increment
  put /gig/tmp/log/access_log into the_file
  put mystic_mouse into pattern

  open file the_file for read

  repeat until (the_offset = file_size)
read from file the_file at the_offset for the_increment
put it into the_text

 repeat until lineoffset(mystic_mouse, the_text) = 0
   if (lineoffset(mystic_mouse, the_text) is not 0) then
 add 1 to the_counter
 delete line 1 to lineoffset(mystic_mouse, the_text) of the_text
   end if
 end repeat

add the_increment to the_offset
  end repeat

  put the_counter
end startup


Bruce
-
#!/usr/local/bin/mc
on startup
  ## initialize variables: try adjusting numLines
  put /gig/tmp/log/access_log into the_file
  put $1 into numLines  -- called with 1 as parameter
  put 0 into counter

  open file the_file

  repeat until (isEOF = TRUE)
 ## read the specified number of lines, check if we are at the end of the file
 read from file the_file for numLines lines
 put it into thisChunk
 put (the result = eof) into isEOF

 ## count the number of matches in this chunk
 put offset(mystic_mouse, thisChunk) into theOffset
 repeat until (theOffset = 0)
add 1 to counter
put offset(mystic_mouse, thisChunk, theOffset) into tempOffset
if (tempOffset  0) then add tempOffset to theOffset
else put 0 into theOffset
 end repeat

  end repeat

  close file the_file

  put counter
end startup


___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Pierre Sahores

Sadhunathan Nadesan a écrit :
 
 Ok, here are the results so far,
 
 bash
 Sun Nov 10 13:01:59 PST 2002
 17333
 Sun Nov 10 13:03:43 PST 2002
 
 pascal
 Sun Nov 10 13:03:43 PST 2002
 17333
 Sun Nov 10 13:05:47 PST 2002
 
 andu's metacard
 Sun Nov 10 13:05:47 PST 2002
 29623
 Sun Nov 10 13:08:10 PST 2002
 
 pierre's metacard
 Sun Nov 10 13:08:10 PST 2002
 17338
 Sun Nov 10 13:10:21 PST 2002
 
 bruce's metacard
 Sun Nov 10 13:10:21 PST 2002
 33351
 Sun Nov 10 13:14:59 PST 2002
 
 That would be
 
 bash1:44
 pascal  2:04
 Andu2:23
 Pierre  2:11
 Bruce   4:38
 
 Now, it is likely I have become confused and mixed up exactly what came
 from who, sorry about that!  My apologies if your name is not associated
 with your contribution, or vice versa.
 
 Now, why did we get different counts?  I believe the count of 17333 is
 correct.  Maybe someone can debug that.
 
 Here's the code
 
 Andu
 ---
 #!/usr/local/bin/mc
 
 on startup
   put 0 into the_counter
   put 1 into the_offset
   put 333491183 into file_size
   put   3 into the_increment
   put /gig/tmp/log/access_log into the_file
   put mystic_mouse into pattern
 
   open file the_file for read
 
   repeat until (the_offset = file_size)
 read from file the_file at the_offset for the_increment
 put it into the_text
 repeat for each line this_line in the_text
   get offset(pattern, this_line)
   if (it is not 0) then add 1 to the_counter
 end repeat
 add the_increment to the_offset
   end repeat
 
   put the_counter
 end startup
 
 Pierre
 --
 #!/usr/local/bin/mc
 
 on startup
   put 0 into the_counter
   put 1 into the_offset
   put 333491183 into file_size
   put   3 into the_increment
   put /gig/tmp/log/access_log into the_file
   put mystic_mouse into pattern
 
   open file the_file for read
 
   repeat until (the_offset = file_size)
 read from file the_file at the_offset for the_increment
 
  repeat until lineoffset(mystic_mouse,it) = 0
if (lineoffset(mystic_mouse,it) is not 0) then
  add 1 to the_counter
  delete line 1 to lineoffset(mystic_mouse,it) of it
end if
  end repeat

  # put it into the_text
 
  # repeat until lineoffset(mystic_mouse, the_text) = 0
  #  if (lineoffset(mystic_mouse, the_text) is not 0) then
  #add 1 to the_counter
  #delete line 1 to lineoffset(mystic_mouse, the_text) of the_text
  #  end if
  # end repeat
 
 add the_increment to the_offset
   end repeat
 
   put the_counter
 end startup
 
 Bruce
 -
 #!/usr/local/bin/mc
 on startup
   ## initialize variables: try adjusting numLines
   put /gig/tmp/log/access_log into the_file
   put $1 into numLines  -- called with 1 as parameter
   put 0 into counter
 
   open file the_file
 
   repeat until (isEOF = TRUE)
  ## read the specified number of lines, check if we are at the end of the file
  read from file the_file for numLines lines
  put it into thisChunk
  put (the result = eof) into isEOF
 
  ## count the number of matches in this chunk
  put offset(mystic_mouse, thisChunk) into theOffset
  repeat until (theOffset = 0)
 add 1 to counter
 put offset(mystic_mouse, thisChunk, theOffset) into tempOffset
 if (tempOffset  0) then add tempOffset to theOffset
 else put 0 into theOffset
  end repeat
 
   end repeat
 
   close file the_file
 
   put counter
 end startup
 
 ___
 metacard mailing list
 [EMAIL PROTECTED]
 http://lists.runrev.com/mailman/listinfo/metacard

Hi Friends,

Does the it upon improvement proposal work as expected... Perhaps yes,
perhaps not ;)
-- 
Cordialement, Pierre Sahores

Inspection académique de Seine-Saint-Denis.
Applications et bases de données WEB et VPN
Qualifier et produire l'avantage compétitif
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread andu



--On Sunday, November 10, 2002 13:21:04 -0800 Sadhunathan Nadesan 
[EMAIL PROTECTED] wrote:


Here's another try for whatever it's worth. I tested it on a file with 7000 
lines of about 800k and it takes less then a sec:

on startup
put 0 into tCount
 put mystic_mouse into tWord
 put empty into line 3000 of tChunk
 put /gig/tmp/log/access_log into tFile
 open file tFile for read
 put 0 into fOffset
 repeat
   read from file tFile at fOffset+1 for 3000 lines
#can play with that number for best results
   put it into tChunk
   put 0 into tSkip
   repeat
 get offset (tWord,tChunk,tSkip)
 if it is not 0 then
   add 1 to tCount
   add it+length(tWord) to tSkip
 else
   put 0 into tSkip
   exit repeat
 end if
   end repeat
   add length(tChunk) to fOffset
   if the num of lines of tChunk3000 then exit repeat
 end repeat
 put tCount
end startup

Regards, Andu Novac
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Scott Raney

On Sun, 10 Nov 2002 Richard Gaskin [EMAIL PROTECTED] wrote:

 My hunch is that reading for lines is slower than reading a
 specified number of chars, since with lines it needs to evaluate
 each incoming character to determine if it's a return -- Scott, am I
 right or should they be about the same?

You're right, though I wouldn't think it would make *that* much
difference.

As for my guess as to the fastest way to do this, it'd probably be a
hybrid approach, using both read for x and repeat for each line.
You'd start by opening the file for binary read (faster than other
modes).  Then read for X characters, where X would be some large
number experimentally determined for each system (it'd probably some
large percentage of the free RAM, and so probably on the order of a
few MB), and then use repeat for each line l in it.

The trick is that the last line will be incomplete in this case, so
for the second and subsequent reads you subtract the length of the
last line from X, and do read for X at Y, where Y is a running total
of what's been read, after subtracting the partial lines of course.
Some extra bookkeeping will be required in this case (e.g., if the tag
you're looking for is in the partial last line you need to subtract 1
from the count so you don't count it twice).  Exactly how to do this
part most efficiently is left as an excercise for the reader ;-)
  Regards,
Scott

 -- 
  Richard Gaskin 
  Fourth World Media Corporation
  Developer of WebMerge 2.0: Publish any database on any site
  ___
  [EMAIL PROTECTED]   http://www.FourthWorld.com
  Tel: 323-225-3717   AIM: FourthWorldInc


Scott Raney  [EMAIL PROTECTED]  http://www.metacard.com
MetaCard: You know, there's an easier way to do that...

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-10 Thread Yennie

All right... I tweaked a little more outside of email.
For accuracy in the case where "mystic_mouse" occurs multiple times on one line, uncomment the line: 
"add offset(return, thisChunk, theOffset) to theOffset"

This just skips to the next line whenever a match is found.

This should run faster than my previous attempts:

on startup
  ## initialize variables: try adjusting numLines
  put "/gig/tmp/log/access_log" into the_file
  put ($1*1024*1024) into chunkSize ## this is for MB
  put 0 into counter
  put FALSE into isEOF
  
  open file the_file
  
  repeat until (isEOF = TRUE)
  ## read the specified number of lines, check if we are at the end of the file
  read from file the_file for chunkSize
  put it into thisChunk
  put (the result = "eof") into isEOF
  
  ## count the number of matches in this chunk
  put offset("mystic_mouse", thisChunk) into theOffset
  repeat
  add 1 to counter
  get offset("mystic_mouse", thisChunk, theOffset)
  if (it = 0) then exit repeat
  put theOffset + it + 12 into theOffset
  ## add offset(return, thisChunk, theOffset) to theOffset
  end repeat
  
  end repeat
  
  close file the_file

  put counter
end startup

HTH.
Brian

Re: the large file challenge

2002-11-09 Thread Pierre Sahores

Sadhunathan Nadesan a écrit :
 
 | Try something alike :
 |
 |  on mouseup
 |  put 1 into startread
 |  open file thefile for read
 |  read from file thefile until eof
 |  put the num of lines of it in endtoread
 |  close file thefile
 |  repeat while startread  endtoread
 |  open file thefile for read
 |  read from file thefile at startread for 99 lines
 |  ...
 |  do what you need with it
 |  ...
 |  close file thefile
 |  add 100 to startread
 |  end repeat
 |  end mouseup
 
 Alors, Pierre,
 
 Many thanks.  This turned out to be more efficient than
 I thought.  I had to modify it slightly because the
 'read from file at' command takes an offset in characters,
 not lines.  (Code below).  Anyway, on those 3 sample
 programs, here are the times on the last run, not my full
 access log, but a chopped (50,000 lines) snippet.
 
 Bash shell script (interpreted) 24 seconds
 Pascal (compiled) 7 seconds
 Metacard (interpreted) 2 minutes 50 seconds
 
 So, any takers on the speed challenge?  Here's the code
 I used.
 
 #!/usr/local/bin/mc
 on startup
   put /gig/tmp/log/xaa into the_file
   put 1 into start_read
   put 0 into the_counter
   put 1 into the_offset
   open file the_file for read
   read from file the_file until eof
   put the num of lines of it into end_read
   close file the_file
   repeat while (start_read  end_read)
 open file the_file for read
 read from file the_file at the_offset for 99 lines
 put it into the_text
 put the number of chars of it + the_offset into the_offset

 repeat until lineoffset(mystic_mouse,the_text) = 0
   if lineoffset(mystic_mouse,the_text) is not 0 then
   put the_counter + 1 into the_counter
   delete line 1 to lineoffset(mystic_mouse,the_text) of the_text
   end if
 end repeat

# repeat for each line this_line in the_text
#   if (not eof) then
# if (this_line contains mystic_mouse) then
#   put the_counter + 1 into the_counter
# end if
#   end if
# end repeat

 close file the_file
 add 100 to start_read
   end repeat
   put the_counter
 end startup
 
 Now, I feel sure we could improve this, fix my errors, etc   anyone?
 
 Sadhu
 ___
 metacard mailing list
 [EMAIL PROTECTED]
 http://lists.runrev.com/mailman/listinfo/metacard

Allo Sadhu,

Perhaps is it way to speed up your script in using the lineoffset
statement, as the upon proposal ;)
-- 
Cordialement, Pierre Sahores

Inspection académique de Seine-Saint-Denis.
Applications et bases de données WEB et VPN
Qualifier et produire l'avantage compétitif
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-09 Thread Sadhunathan Nadesan

Wow,

Just logged on to work and saw all the great responses.  Thanks all,
what fun.

Anyway I will respond to each later and try your code too.  I have
to run right now, appointment.

I did however have some code from Andu via Swami that I modifed 
somewhat and got enormous speed improvement.

Here's the latest run (ran this several times so file
would be in cache equally for all programs).

.
bash
Sat Nov  9 16:48:12 PST 2002
17333
Sat Nov  9 16:50:05 PST 2002

pascal
Sat Nov  9 16:50:05 PST 2002
17333
Sat Nov  9 16:52:09 PST 2002

metacard
Sat Nov  9 16:52:09 PST 2002
17338
Sat Nov  9 16:54:28 PST 2002
.



So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good!

Here's the code, gott go...


#!/usr/local/bin/mc
on startup
 put 0 into the_counter
 put 1 into the_offset
 put 333491183 into file_size
 put   3 into the_increment
 put /gig/tmp/log/access_log into the_file

 open file the_file for read

 repeat until (the_offset = file_size)
  read from file the_file at the_offset for the_increment
  put it into the_text
  repeat for each line this_line in the_text
   get offset(mystic_mouse, this_line)
   if (it is not 0) then add 1 to the_counter
  end repeat
  add the_increment to the_offset
 end repeat

 put the_counter
end startup

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-09 Thread Richard Gaskin

Sadhunathan Nadesan wrote:

 So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good!

But golly, I thought an interpreted language like MetaTalk was supposed to
be slow, certainly much slower than compiled Pascal.

:)

-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-09 Thread Sannyasin Sivakatirswami

Om Sadhunathan:

Excellent! i had been thinking that we should probably save access logs from our servers in honolulu, but then parsing those was a blind spot. This will help immensely.

Now, do i read this to say that there were 17,338 attempts to look at Mystic Mouse PDF's ? and if so, over what period of time? A small addtion to the script and we could determine if the download was completed. (204, i think... or may be 304?)  


On Saturday, November 9, 2002, at 03:17  PM, Sadhunathan Nadesan wrote:

metacard
Sat Nov  9 16:52:09 PST 2002
17338
Sat Nov  9 16:54:28 PST 2002
.



So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good!

Here's the code, gott go...


#!/usr/local/bin/mc
on startup

put 0 into the_counter

put 1 into the_offset

put 333491183 into file_size

put   3 into the_increment

put "/gig/tmp/log/access_log" into the_file


open file the_file for read


repeat until (the_offset >= file_size)

read from file the_file at the_offset for the_increment

put it into the_text

repeat for each line this_line in the_text

get offset("mystic_mouse", this_line)

if (it is not 0) then add 1 to the_counter

end repeat

add the_increment to the_offset

end repeat


put the_counter
end startup

Re: the large file challenge

2002-11-09 Thread Pierre Sahores

Sannyasin Sivakatirswami wrote:
 
 Om Sadhunathan:
 
 Excellent! i had been thinking that we should probably save access logs from our 
servers in honolulu, but then parsing those was a blind spot. This will help 
immensely.
 
 Now, do i read this to say that there were 17,338 attempts to look at Mystic Mouse 
PDF's ? and if so, over what period of time? A small addtion to the script and we 
could determine if the download was completed. (204, i think... or may be 304?)
 
 On Saturday, November 9, 2002, at 03:17 PM, Sadhunathan Nadesan wrote:
 
  metacard
  Sat Nov 9 16:52:09 PST 2002
  17338
  Sat Nov 9 16:54:28 PST 2002
  .
 
  So that is 1:53 for bash, 2:04 for pascal, and 2:19 for MC. darn good!
 
  Here's the code, gott go...
 
  #!/usr/local/bin/mc
  on startup
 
  put 0 into the_counter
 
  put 1 into the_offset
 
  put 333491183 into file_size
 
  put 3 into the_increment
 
  put /gig/tmp/log/access_log into the_file
 
  open file the_file for read
 
  repeat until (the_offset = file_size)
 
  read from file the_file at the_offset for the_increment
 
  put it into the_text
 
  repeat for each line this_line in the_text
 
  get offset(mystic_mouse, this_line)
 
  if (it is not 0) then add 1 to the_counter
 
  end repeat
 
  add the_increment to the_offset
 
  end repeat
 
  put the_counter
  end startup

Aloha Friends,

So ! MC as far so fast than Pascal ! Is'nt it great ? And, thanks again
to Scott, for that too !
Just one more question : could you say us, Scott, when MC will become
faster than C, or is it a secret ? ;-)
-- 
Cordialement, Pierre Sahores

Inspection académique de Seine-Saint-Denis.
Applications et bases de données WEB et VPN
Qualifier et produire l'avantage compétitif
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-08 Thread Richard Gaskin

Sadhunathan Nadesan wrote:

 #!/usr/local/bin/mc
 on startup
 put /gig/tmp/log/xaa into the_file
 put 1 into start_read
 put 0 into the_counter
 put 1 into the_offset
 open file the_file for read
 read from file the_file until eof
 put the num of lines of it into end_read
 close file the_file
 repeat while (start_read  end_read)
 open file the_file for read
 read from file the_file at the_offset for 99 lines
 put it into the_text
 put the number of chars of it + the_offset into the_offset
 repeat for each line this_line in the_text
 if (not eof) then
 if (this_line contains mystic_mouse) then
 put the_counter + 1 into the_counter
 end if
 end if
 end repeat
 close file the_file
 add 100 to start_read
 end repeat
 put the_counter
 end startup
 
 
 Now, I feel sure we could improve this, fix my errors, etc   anyone?

I'm confused:  if the point is to avoid reading the entire file into memory,
isn't what what line 8 does?  And if it's already in memory, why is it read
again inside the loop?

I think I missed something from the original post

-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-08 Thread andu



--On Friday, November 08, 2002 18:24:56 -0800 Richard Gaskin 
[EMAIL PROTECTED] wrote:

Sadhunathan Nadesan wrote:


# !/usr/local/bin/mc
on startup
put /gig/tmp/log/xaa into the_file
put 1 into start_read
put 0 into the_counter
put 1 into the_offset
open file the_file for read
read from file the_file until eof
put the num of lines of it into end_read
close file the_file
repeat while (start_read  end_read)
open file the_file for read
read from file the_file at the_offset for 99 lines
put it into the_text
put the number of chars of it + the_offset into the_offset
repeat for each line this_line in the_text
if (not eof) then
if (this_line contains mystic_mouse) then
put the_counter + 1 into the_counter
end if
end if
end repeat
close file the_file
add 100 to start_read
end repeat
put the_counter
end startup


Now, I feel sure we could improve this, fix my errors, etc   anyone?


I'm confused:  if the point is to avoid reading the entire file into
memory, isn't what what line 8 does?  And if it's already in memory, why
is it read again inside the loop?

I think I missed something from the original post


No, you got it right.



--
 Richard Gaskin
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard





Regards, Andu Novac
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-08 Thread Richard Gaskin

andu wrote:

 I think I missed something from the original post
 
 No, you got it right.

Thanks, Andu.  I thought I was losin' it.

If we're allowed to read the whole thing into RAM and the goal is the count
the occurences of the string mystic_mouse, then to optimize speed we can
just remove the redundant read commands and use offset to search for us:


#!/usr/local/bin/mc
on startup
  put /gig/tmp/log/xaa into the_file
  put url (file:the_file) into the_text
  put 0 into the_counter
  put 1 into tPointer
  --
  repeat for each line this_line in the_text
get offset(mystic_mouse, the_text, tPointer)
if it = 0 then exit repeat
add 1 to the_counter
add it to tPointer
  end repeat
  put the_counter
end startup


This is off the top of my head.  If it runs I'd be interested in how it
compares.

-- 
 Richard Gaskin 
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-08 Thread andu



--On Friday, November 08, 2002 19:15:59 -0800 Richard Gaskin 
[EMAIL PROTECTED] wrote:


# !/usr/local/bin/mc
on startup
  put /gig/tmp/log/xaa into the_file
  put url (file:the_file) into the_text
  put 0 into the_counter
  put 1 into tPointer
  --
  repeat for each line this_line in the_text
get offset(mystic_mouse, the_text, tPointer)
if it = 0 then exit repeat
add 1 to the_counter
add it to tPointer
  end repeat
  put the_counter
end startup


Here's my take considering mystic_mouse can occur only once on a line and 
loading 300MG into ram is not an issue:

on startup
  put url (file:/gig/tmp/log/xaa) into the_text
 put 0 into the_counter
 repeat for each line this_line in the_text
get offset(mystic_mouse, this_line)
if it is not 0 then add 1 to the_counter
  end repeat
  put the_counter
end startup

...and I'm sure it could be improved.



This is off the top of my head.  If it runs I'd be interested in how it
compares.

--
 Richard Gaskin
 Fourth World Media Corporation
 Developer of WebMerge 2.0: Publish any database on any site
 ___
 [EMAIL PROTECTED]   http://www.FourthWorld.com
 Tel: 323-225-3717   AIM: FourthWorldInc

___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard





Regards, Andu Novac
___
metacard mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/metacard

Re: the large file challenge

2002-11-08 Thread Yennie

I'm pretty sure the problem with speed here is from reading in the entire file.
Unless of course you have enough free RAM- but that's hard to imagine when the files are 300MB+.

How about this, which you can adjust to read any given number of lines at a time.
Try it with 10, 1000, 1, etc and see what gives you the best performance!
Hasn't been tested but hopefully it'll run with a tweak or less.

#!/usr/local/bin/mc
on startup
 ## initialize variables: try adjusting numLines
 put "/gig/tmp/log/xaa" into the_file
 put 1000 into numLines
 put 0 into counter

 open file the_file

 repeat until (isEOF = TRUE)
 ## read the specified number of lines, check if we are at the end of the file
 read from file the_file for numLines lines
 put it into thisChunk
 put (the result = "eof") into isEOF

 ## count the number of matches in this chunk
 put offset("mystic_mouse", thisChunk) into theOffset
 repeat until (theOffset = 0)
 add 1 to counter
 put offset("mystic_mouse", thisChunk, theOffset) into tempOffset
 if (tempOffset  0) then add tempOffset to theOffset
 else put 0 into theOffset
 end repeat

 end repeat

 close file the_file

 put counter
end startup


HTH,
Brian

Re: the large file challenge

2002-11-08 Thread Yennie

One last note:

Be careful of using read from file xxx for yyy

If you do not read for "lines", you run the risk of cutting a line in half on the spot where your magic string occurs.

So always use read from file xxx for yyy LINES

HTH.
Brian

RE the large file challenge

Re: RE the large file challenge

RE: the large file challenge

Re: the large file challenge

Re: the large file challenge

RE: the large file challenge

RE: the large file challenge

RE: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

Re: the large file challenge

34 matches

Site Navigation

Mail list logo

Footer information