http://www.google.se/robots.txt
google disallows it.
User-agent: *
Allow: /searchhistory/
Disallow: /search
Larsson85 schrieb:
Why isnt nutch able to handle links from google?
I tried to start a crawl from the following url
http://www.google.se/search?q=site:sehl=svstart=100sa=N
And all
it seems that google is blocking the user agent
i get this reply with lwp-request
Your client does not have permission to get URL
code/search?q=site:seamp;hl=svamp;start=100amp;sa=N/code from
this server. (Client IP address: XX.XX.XX.XX)brbr
Please see Google's Terms of Service posted at
pages of
search result.
Why wont this page be parsed and links extracted so the crawl can start?
--
View this message in context:
http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html
Sent from the Nutch - User mailing list archive at Nabble.com.
you can check the response of google by dumping the segment
bin/nutch readseg -dump crawl/segments/... somedirectory
reinhard schwab schrieb:
it seems that google is blocking the user agent
i get this reply with lwp-request
Your client does not have permission to get URL
result.
Why wont this page be parsed and links extracted so the crawl can start?
--
View this message in context:
http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Doğacan
this is because I had a tought on maby I
could use google to generate my start list of urls by injecting pages of
search result.
Why wont this page be parsed and links extracted so the crawl can start?
--
View this message in context:
http://www.nabble.com/Why-cant-I-inject-a-google-link
identify nutch as popular user agent such as firefox.
Larsson85 schrieb:
Any workaround for this? Making nutch identify as something else or something
similar?
reinhard schwab wrote:
http://www.google.se/robots.txt
google disallows it.
User-agent: *
Allow: /searchhistory/
This isn't a user agent problem. No matter what user agent you use,
Nutch is still not going to crawl this page because Nutch is correctly
following robots.txt directives which block access. To change this
would be to make the crawler impolite. A well behaved crawler should
follow the
could use google to generate my start list of urls by injecting pages
of
search result.
Why wont this page be parsed and links extracted so the crawl can
start?
--
View this message in context:
http://www.nabble.com/Why-cant-I-inject-a-google-link
?
--
View this message in context:
http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
Sent from the Nutch - User mailing list archive at Nabble.com.
your are right.
robots.txt clearly disallows this page.
this page will not be fetched.
i remember google has some APIs to access the search.
http://code.google.com/intl/de-DE/apis/soapsearch/index.html
http://code.google.com/intl/de-DE/apis/ajaxsearch/
reinhard
Dennis Kubes schrieb:
This isn't
start list of urls by injecting pages
of
search result.
Why wont this page be parsed and links extracted so the crawl can
start?
--
View this message in context:
http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database
Brian Ulicny wrote:
1. Save the results page.
2. Grep the links out of it.
3. Put the results in a doc in your urls directory
4. Do: bin/nutch crawl urls
Please note, we are not saying this is impossible to do this with Nutch
(e.g. by setting the agent string to mimick a browser), but we
://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
Sent from the Nutch - User mailing list archive at Nabble.com.
14 matches
Mail list logo