Re: Why cant I inject a google link to the database?
you can also use commons-httpclient or htmlunit to access the search of google. these tools are not crawlers. with htmlunit it would be easy to get the outlinks. i strongly advice you not to misuse google search by too many requests. google will block you i assume. by using a search api, you are allowed to request it 1000 times per day if i remember correct, it is mentioned there in the terms of use or elsewhere in the documentation. google returns a maximum of 1000 links in a search result and a maximum of 100 links in one page. if you set this search parameter, &num=100 you will get 100 links per result page. Brian Ulicny schrieb: > 1. Save the results page. > 2. Grep the links out of it. > 3. Put the results in a doc in your urls directory > 4. Do: bin/nutch crawl urls > > > On Fri, 17 Jul 2009 02:32 -0700, "Larsson85" > wrote: > >> I think I need more help on how to do this. >> >> I tried using >> >> http.robots.agents >> Mozilla/5.0* >> The agent strings we'll look for in robots.txt files, >> comma-separated, in decreasing order of precedence. You should >> put the value of http.agent.name as the first agent name, and keep the >> default * at the end of the list. E.g.: BlurflDev,Blurfl,* >> >> >> >> If I dont have the star in the end I get the same as earlier, "No URLs to >> fetch". And if I do I get 0 records selected for fetching, exiting >> >> >> >> reinhard schwab wrote: >> >>> identify nutch as popular user agent such as firefox. >>> >>> Larsson85 schrieb: >>> Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: > http://www.google.se/robots.txt > > google disallows it. > > User-agent: * > Allow: /searchhistory/ > Disallow: /search > > > Larsson85 schrieb: > > >> Why isnt nutch able to handle links from google? >> >> I tried to start a crawl from the following url >> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >> >> And all I get is "no more URLs to fetch" >> >> The reason for why I want to do this is because I had a tought on maby >> I >> could use google to generate my start list of urls by injecting pages >> of >> search result. >> >> Why wont this page be parsed and links extracted so the crawl can >> start? >> >> >> > > >>> >>> >> -- >> View this message in context: >> http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >>
Re: Why cant I inject a google link to the database?
Brian Ulicny wrote: 1. Save the results page. 2. Grep the links out of it. 3. Put the results in a doc in your urls directory 4. Do: bin/nutch crawl urls Please note, we are not saying this is impossible to do this with Nutch (e.g. by setting the agent string to mimick a browser), but we insist on saying that it's RUDE to do this. Anyway, Google monitors such attempts and after you issue too many requests your IP will be blocked for a duration - so no matter if you go the polite or the impolite way you won't be able to do this. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Why cant I inject a google link to the database?
1. Save the results page. 2. Grep the links out of it. 3. Put the results in a doc in your urls directory 4. Do: bin/nutch crawl urls On Fri, 17 Jul 2009 02:32 -0700, "Larsson85" wrote: > > I think I need more help on how to do this. > > I tried using > > http.robots.agents > Mozilla/5.0* > The agent strings we'll look for in robots.txt files, > comma-separated, in decreasing order of precedence. You should > put the value of http.agent.name as the first agent name, and keep the > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > > > > If I dont have the star in the end I get the same as earlier, "No URLs to > fetch". And if I do I get 0 records selected for fetching, exiting > > > > reinhard schwab wrote: > > > > identify nutch as popular user agent such as firefox. > > > > Larsson85 schrieb: > >> Any workaround for this? Making nutch identify as something else or > >> something > >> similar? > >> > >> > >> reinhard schwab wrote: > >> > >>> http://www.google.se/robots.txt > >>> > >>> google disallows it. > >>> > >>> User-agent: * > >>> Allow: /searchhistory/ > >>> Disallow: /search > >>> > >>> > >>> Larsson85 schrieb: > >>> > Why isnt nutch able to handle links from google? > > I tried to start a crawl from the following url > http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N > > And all I get is "no more URLs to fetch" > > The reason for why I want to do this is because I had a tought on maby > I > could use google to generate my start list of urls by injecting pages > of > search result. > > Why wont this page be parsed and links extracted so the crawl can > start? > > > >>> > >>> > >> > >> > > > > > > > > -- > View this message in context: > http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- Brian Ulicny bulicny at alum dot mit dot edu home: 781-721-5746 fax: 360-361-5746
Re: Why cant I inject a google link to the database?
your are right. robots.txt clearly disallows this page. this page will not be fetched. i remember google has some APIs to access the search. http://code.google.com/intl/de-DE/apis/soapsearch/index.html http://code.google.com/intl/de-DE/apis/ajaxsearch/ reinhard Dennis Kubes schrieb: > This isn't a user agent problem. No matter what user agent you use, > Nutch is still not going to crawl this page because Nutch is correctly > following robots.txt directives which block access. To change this > would be to make the crawler impolite. A well behaved crawler should > follow the robots.txt directives. > > Dennis > > reinhard schwab wrote: >> identify nutch as popular user agent such as firefox. >> >> Larsson85 schrieb: >>> Any workaround for this? Making nutch identify as something else or >>> something >>> similar? >>> >>> >>> reinhard schwab wrote: >>> http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: > Why isnt nutch able to handle links from google? > > I tried to start a crawl from the following url > http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N > > And all I get is "no more URLs to fetch" > > The reason for why I want to do this is because I had a tought on > maby I > could use google to generate my start list of urls by injecting > pages of > search result. > > Why wont this page be parsed and links extracted so the crawl can > start? > >>> >> >
Re: Why cant I inject a google link to the database?
Larsson85, Please read past responses. Google is blocking all crawlers, not just yours from indexing their search results. Because of their robots.txt file directives you will not be able to do this. If you place a sign on your house, "DO NOT ENTER", and I entered, you would be very upset. That is what the robots.txt file does for a site. It tells visiting bots what they can enter and what they can't enter. Jake Jacobson http://www.linkedin.com/in/jakejacobson http://www.facebook.com/jakecjacobson http://twitter.com/jakejacobson Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter. -- ANONYMOUS On Fri, Jul 17, 2009 at 9:32 AM, Larsson85 wrote: > > I think I need more help on how to do this. > > I tried using > > http.robots.agents > Mozilla/5.0* > The agent strings we'll look for in robots.txt files, > comma-separated, in decreasing order of precedence. You should > put the value of http.agent.name as the first agent name, and keep the > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > > > > If I dont have the star in the end I get the same as earlier, "No URLs to > fetch". And if I do I get 0 records selected for fetching, exiting > > > > reinhard schwab wrote: >> >> identify nutch as popular user agent such as firefox. >> >> Larsson85 schrieb: >>> Any workaround for this? Making nutch identify as something else or >>> something >>> similar? >>> >>> >>> reinhard schwab wrote: >>> http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: > Why isnt nutch able to handle links from google? > > I tried to start a crawl from the following url > http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N > > And all I get is "no more URLs to fetch" > > The reason for why I want to do this is because I had a tought on maby > I > could use google to generate my start list of urls by injecting pages > of > search result. > > Why wont this page be parsed and links extracted so the crawl can > start? > > >>> >>> >> >> >> > > -- > View this message in context: > http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html > Sent from the Nutch - User mailing list archive at Nabble.com. > >
Re: Why cant I inject a google link to the database?
I think I need more help on how to do this. I tried using http.robots.agents Mozilla/5.0* The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* If I dont have the star in the end I get the same as earlier, "No URLs to fetch". And if I do I get 0 records selected for fetching, exiting reinhard schwab wrote: > > identify nutch as popular user agent such as firefox. > > Larsson85 schrieb: >> Any workaround for this? Making nutch identify as something else or >> something >> similar? >> >> >> reinhard schwab wrote: >> >>> http://www.google.se/robots.txt >>> >>> google disallows it. >>> >>> User-agent: * >>> Allow: /searchhistory/ >>> Disallow: /search >>> >>> >>> Larsson85 schrieb: >>> Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N And all I get is "no more URLs to fetch" The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? >>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Why cant I inject a google link to the database?
This isn't a user agent problem. No matter what user agent you use, Nutch is still not going to crawl this page because Nutch is correctly following robots.txt directives which block access. To change this would be to make the crawler impolite. A well behaved crawler should follow the robots.txt directives. Dennis reinhard schwab wrote: identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N And all I get is "no more URLs to fetch" The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start?
Re: Why cant I inject a google link to the database?
identify nutch as popular user agent such as firefox. Larsson85 schrieb: > Any workaround for this? Making nutch identify as something else or something > similar? > > > reinhard schwab wrote: > >> http://www.google.se/robots.txt >> >> google disallows it. >> >> User-agent: * >> Allow: /searchhistory/ >> Disallow: /search >> >> >> Larsson85 schrieb: >> >>> Why isnt nutch able to handle links from google? >>> >>> I tried to start a crawl from the following url >>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >>> >>> And all I get is "no more URLs to fetch" >>> >>> The reason for why I want to do this is because I had a tought on maby I >>> could use google to generate my start list of urls by injecting pages of >>> search result. >>> >>> Why wont this page be parsed and links extracted so the crawl can start? >>> >>> >> >> > >
Re: Why cant I inject a google link to the database?
2009/7/17 Doğacan Güney : > On Fri, Jul 17, 2009 at 15:23, Larsson85 wrote: >> >> Any workaround for this? Making nutch identify as something else or something >> similar? >> > > Also note that nutch does not crawl anything with '?', or '&' in URL. Check > out Oops. I mean nutch does not crawl any such URL *by default*. > crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you > use crawl command > or inject/generate/fetch/parse etc. commands). > >> >> reinhard schwab wrote: >>> >>> http://www.google.se/robots.txt >>> >>> google disallows it. >>> >>> User-agent: * >>> Allow: /searchhistory/ >>> Disallow: /search >>> >>> >>> Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N And all I get is "no more URLs to fetch" The reason for why I want to do this is because I had a tought on maby I could use google to generate my start list of urls by injecting pages of search result. Why wont this page be parsed and links extracted so the crawl can start? >>> >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > > -- > Doğacan Güney > -- Doğacan Güney
Re: Why cant I inject a google link to the database?
On Fri, Jul 17, 2009 at 15:23, Larsson85 wrote: > > Any workaround for this? Making nutch identify as something else or something > similar? > Also note that nutch does not crawl anything with '?', or '&' in URL. Check out crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you use crawl command or inject/generate/fetch/parse etc. commands). > > reinhard schwab wrote: >> >> http://www.google.se/robots.txt >> >> google disallows it. >> >> User-agent: * >> Allow: /searchhistory/ >> Disallow: /search >> >> >> Larsson85 schrieb: >>> Why isnt nutch able to handle links from google? >>> >>> I tried to start a crawl from the following url >>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >>> >>> And all I get is "no more URLs to fetch" >>> >>> The reason for why I want to do this is because I had a tought on maby I >>> could use google to generate my start list of urls by injecting pages of >>> search result. >>> >>> Why wont this page be parsed and links extracted so the crawl can start? >>> >> >> >> > > -- > View this message in context: > http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
Re: Why cant I inject a google link to the database?
you can check the response of google by dumping the segment bin/nutch readseg -dump crawl/segments/... somedirectory reinhard schwab schrieb: > it seems that google is blocking the user agent > > i get this reply with lwp-request > > Your client does not have permission to get URL > /search?q=site:se&hl=sv&start=100&sa=N from > this server. (Client IP address: XX.XX.XX.XX) > Please see Google's Terms of Service posted at > http://www.google.com/terms_of_service.html > > if you set the user agent properties to a client such as firefox, > google will serve your request. > > reinhard schwab schrieb: > >> http://www.google.se/robots.txt >> >> google disallows it. >> >> User-agent: * >> Allow: /searchhistory/ >> Disallow: /search >> >> >> Larsson85 schrieb: >> >> >>> Why isnt nutch able to handle links from google? >>> >>> I tried to start a crawl from the following url >>> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >>> >>> And all I get is "no more URLs to fetch" >>> >>> The reason for why I want to do this is because I had a tought on maby I >>> could use google to generate my start list of urls by injecting pages of >>> search result. >>> >>> Why wont this page be parsed and links extracted so the crawl can start? >>> >>> >>> >> >> > > >
Re: Why cant I inject a google link to the database?
Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: > > http://www.google.se/robots.txt > > google disallows it. > > User-agent: * > Allow: /searchhistory/ > Disallow: /search > > > Larsson85 schrieb: >> Why isnt nutch able to handle links from google? >> >> I tried to start a crawl from the following url >> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >> >> And all I get is "no more URLs to fetch" >> >> The reason for why I want to do this is because I had a tought on maby I >> could use google to generate my start list of urls by injecting pages of >> search result. >> >> Why wont this page be parsed and links extracted so the crawl can start? >> > > > -- View this message in context: http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Why cant I inject a google link to the database?
it seems that google is blocking the user agent i get this reply with lwp-request Your client does not have permission to get URL /search?q=site:se&hl=sv&start=100&sa=N from this server. (Client IP address: XX.XX.XX.XX) Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html if you set the user agent properties to a client such as firefox, google will serve your request. reinhard schwab schrieb: > http://www.google.se/robots.txt > > google disallows it. > > User-agent: * > Allow: /searchhistory/ > Disallow: /search > > > Larsson85 schrieb: > >> Why isnt nutch able to handle links from google? >> >> I tried to start a crawl from the following url >> http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N >> >> And all I get is "no more URLs to fetch" >> >> The reason for why I want to do this is because I had a tought on maby I >> could use google to generate my start list of urls by injecting pages of >> search result. >> >> Why wont this page be parsed and links extracted so the crawl can start? >> >> > > >
Re: Why cant I inject a google link to the database?
http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: > Why isnt nutch able to handle links from google? > > I tried to start a crawl from the following url > http://www.google.se/search?q=site:se&hl=sv&start=100&sa=N > > And all I get is "no more URLs to fetch" > > The reason for why I want to do this is because I had a tought on maby I > could use google to generate my start list of urls by injecting pages of > search result. > > Why wont this page be parsed and links extracted so the crawl can start? >