>I have a script that takes a list of URL's into an array
>and gets the page, strips the anchors to an array, cleans
>it up and matches the contents of each page against a
>search variable, printing the hits to a web page.
>
>This works well with a smaller list of up to 12 or so URL's.
>After that it slows down way out of proportion to the number
>of URL's in the list.

you're probably iterating over something proportional to the size of the
list for each item in the list.   if the size of the list increases by N,
the amount of processing time increases by N^2.   the entire discipline of
algorithmic analysis is devoted to finding out just how fast a given
algorithm can be expected to run for any amount of input.



> It also spits it back in 8K chunks. Any ideas?

this is probably just output buffering by the OS.   a standard filehandle
in perl is represented in the OS as an N-byte buffer.   data sits in that
buffer until something triggers a flush.. the two most common triggers
being that the filehandle has been closed, or that it's filled up and needs
to be emptied so it can hold more data.

to unbuffer a filehandle, you need to select() that filehandle, and set the
internal variable $| to one:

    $temp = select (FILEHANDLE);  $| = 1;  select ($temp);

where the temporary variable is used to restore whatever filhandle was
currently active after you do the unbuffering.



>How do I run multiple sockets at the same time? I don't know
>how to leave one and start another. The Programming Perl book
>does not say much about it. I'd like to use one to go through
>the primary array creating secondary arrays and spawning a socket
>for each secondary array. Is this possible?

you do it by using a different filehandle for each socket, just like for
files.   since you're pulling URLs out of an array, and don't know ahead of
time how many items the array will hold, you'll probably want to use
indirect filehandles.

as a quick refresher, an indirect filehandle is simply a variable that
holds the name you want to use as the filehandle identifier..

the absolute filehandle:

    open (FILE, "some_file.txt");
    $line = <FILE>;
    close FILE;

and the indirect filehandle:

    $FH = "FILE";
    open ($FH, "some_file.txt");
    $line = <$FH>;
    close $FH;

are functionally identical.


to open multiple filehandles, all you have to do is keep changing the
contents of the variable every time you open a new filehandle:


    for $i (0..9) {
        $FH = sprintf ("FILE_%02d", $i);
        open ($FH, $FILE_LIST[$i]);
    }

will open ten separate and distinct filehandles named FILE_00, FILE_01, ...
FILE_09.   sockets are bound to filehandles for i/o, so the same trick can
be used to open multiple sockets.

one caveat:

there's usually a limit to the number of filehandles the OS can have open
at any given time.. 32 is a common limit in unix.   i don't know if a
similar limit exists on sockets, but i wouldn't be surprised.   if you want
to do things the official way and spawn a subprocess for each new socket,
you'll need to check that the socket could actually be bound to the
filehandle, and will probably want to include a delay loop for retries if
there are no filehandles available at the moment:

    while ( ! open ($FH, "some_file.txt")) {
        sleep 1;
    }

except with socket code.   you can also include a counter within the loop
so it gives up after a certain number of tries, which will keep you from
spawning subprocesses that loop forever.. but then you have to start
worrying about fail-case handling, and things get just slightly complex.






mike stone  <[EMAIL PROTECTED]>




____________________________________________________________________
--------------------------------------------------------------------
 Join The Web Consultants Association :  Register on our web site Now
Web Consultants Web Site : http://just4u.com/webconsultants
If you lose the instructions All subscription/unsubscribing can be done
directly from our website for all our lists.
---------------------------------------------------------------------

Reply via email to