RE: FW: --output_filter - filtering fetched pages

Sergey Martynoff Tue, 06 Sep 2005 08:37:35 -0700

> Excellent, thanks. I've built it, and installed a patched .deb on my
> debian machine. Works nice, especially for my case, only thing is I'd
> have preferred to keep the original files.


If you can use perl for filtering, try something like 'perl -i.orig' - this
command-line switch will instruct perl to save original files with .orig
extension.

> My original idea was that the
> output filter would be expected to receive a file on stdin and output

Yes, traditional filtering would be a better choise - but 'system' call was
easier to implement :)

> some lines with new stuff to download on stdout. I hope you can get
> something similar included in the official distribution anyways.

As for future development, I see three filtering possibilities:

1) Output filter (nearly to my version) - after fetching the page it is
passed through filtering process (via stdin/stdout).

2) Url parser/extractor - each page source is passed to external filter,
which returns urls to be fetched (one per line).

3) Url filter - after url extraction (internal or external) url list is
passed to filtering process (one per line), which can modify, shrink or
extend the list.

First filtering ability may be used for custom modification of pages (e.g.
converting html to text), second - for unusual url extraction (e.g.
extracting urls from plaintext), and third would be most convenient way to
use traditional tools (like grep or sed) on urls.


-- 
Sergey Martynoff

RE: FW: --output_filter - filtering fetched pages

Reply via email to