1) I am trying to query google and recursively get the first ten
results (with no images and other stuff) with this line:

wget --output-document=search.html --recursive --level=1 \
--span-hosts --exclude-domains="google.com" --convert-links \ 
-A "*.html,*.htm" http://www.google.com/search?q=my+query

Two problems arise:

- When wget tries to spider the search page, it does not find it
because it has been saved as search.html, while it is looking for a
file named search?q=my+query.  Is this a bug or a feature?  However,
it is easy to work around this by renaming search?q=my+query only
after wget has finished.

- *.php*, *.asp*, *.shtml won't be fetched.  Yes, you could just add
these suffixes after -A, but still there is danger of losing something
(like *.pl?*, *.rhtml and who knows how many else, seems like
extensions are coming out from everywhere).  Is there a way to tell
wget to get just ``pages'', i.e. everything except inline content?

2) The same, but with deja (aka groups.google.com):

wget --recursive --level=1 --convert-links -A "*rnum*" \
http://groups.google.com/search?q=my+query

The only recurrence I have spotted in the links to messages on the
groups.google results page is that "*rnum*".  However, it does not
seem to work.  What I *think* I am telling wget here is: recursively
get only those files linked from groups.google.com/search?q=my+query
which that have a "rnum" somewhere in their name.  What am I missing?


Thanks to anyone who's got a clue and is willing to share. :-)

Massimiliano





Reply via email to