Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
it seems that google is blocking the user agent i get this reply with lwp-request Your client does not have permission to get URL code/search?q=site:seamp;hl=svamp;start=100amp;sa=N/code from this server. (Client IP address: XX.XX.XX.XX)brbr Please see Google's Terms of Service posted at

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Larsson85
pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
you can check the response of google by dumping the segment bin/nutch readseg -dump crawl/segments/... somedirectory reinhard schwab schrieb: it seems that google is blocking the user agent i get this reply with lwp-request Your client does not have permission to get URL

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Doğacan Güney
result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Doğacan

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Doğacan Güney
this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Dennis Kubes
This isn't a user agent problem. No matter what user agent you use, Nutch is still not going to crawl this page because Nutch is correctly following robots.txt directives which block access. To change this would be to make the crawler impolite. A well behaved crawler should follow the

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Larsson85
could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Jake Jacobson
? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
your are right. robots.txt clearly disallows this page. this page will not be fetched. i remember google has some APIs to access the search. http://code.google.com/intl/de-DE/apis/soapsearch/index.html http://code.google.com/intl/de-DE/apis/ajaxsearch/ reinhard Dennis Kubes schrieb: This isn't

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Brian Ulicny
start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Andrzej Bialecki
Brian Ulicny wrote: 1. Save the results page. 2. Grep the links out of it. 3. Put the results in a doc in your urls directory 4. Do: bin/nutch crawl urls Please note, we are not saying this is impossible to do this with Nutch (e.g. by setting the agent string to mimick a browser), but we

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html Sent from the Nutch - User mailing list archive at Nabble.com.