Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
http://www.google.se/robots.txt

google disallows it.

User-agent: *
Allow: /searchhistory/
Disallow: /search


Larsson85 schrieb:
 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby I
 could use google to generate my start list of urls by injecting pages of
 search result.

 Why wont this page be parsed and links extracted so the crawl can start?
   



Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
it seems that google is blocking the user agent

i get this reply with lwp-request

Your client does not have permission to get URL
code/search?q=site:seamp;hl=svamp;start=100amp;sa=N/code from
this server.  (Client IP address: XX.XX.XX.XX)brbr
Please see Google's Terms of Service posted at
http://www.google.com/terms_of_service.html

if you set the user agent properties to a client such as firefox,
google will serve your request.

reinhard schwab schrieb:
 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:
   
 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby I
 could use google to generate my start list of urls by injecting pages of
 search result.

 Why wont this page be parsed and links extracted so the crawl can start?
   
 


   



Re: Why cant I inject a google link to the database?

2009-07-17 Thread Larsson85

Any workaround for this? Making nutch identify as something else or something
similar?


reinhard schwab wrote:
 
 http://www.google.se/robots.txt
 
 google disallows it.
 
 User-agent: *
 Allow: /searchhistory/
 Disallow: /search
 
 
 Larsson85 schrieb:
 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby I
 could use google to generate my start list of urls by injecting pages of
 search result.

 Why wont this page be parsed and links extracted so the crawl can start?
   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
you can check the response of google by dumping the segment

bin/nutch readseg -dump crawl/segments/...   somedirectory


reinhard schwab schrieb:
 it seems that google is blocking the user agent

 i get this reply with lwp-request

 Your client does not have permission to get URL
 code/search?q=site:seamp;hl=svamp;start=100amp;sa=N/code from
 this server.  (Client IP address: XX.XX.XX.XX)brbr
 Please see Google's Terms of Service posted at
 http://www.google.com/terms_of_service.html

 if you set the user agent properties to a client such as firefox,
 google will serve your request.

 reinhard schwab schrieb:
   
 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:
   
 
 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby I
 could use google to generate my start list of urls by injecting pages of
 search result.

 Why wont this page be parsed and links extracted so the crawl can start?
   
 
   
   
 


   



Re: Why cant I inject a google link to the database?

2009-07-17 Thread Doğacan Güney
On Fri, Jul 17, 2009 at 15:23, Larsson85kristian1...@hotmail.com wrote:

 Any workaround for this? Making nutch identify as something else or something
 similar?


Also note that nutch does not crawl anything with '?', or '' in URL. Check out
crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you
use crawl command
or inject/generate/fetch/parse etc. commands).


 reinhard schwab wrote:

 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:
 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby I
 could use google to generate my start list of urls by injecting pages of
 search result.

 Why wont this page be parsed and links extracted so the crawl can start?





 --
 View this message in context: 
 http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
Doğacan Güney


Re: Why cant I inject a google link to the database?

2009-07-17 Thread Doğacan Güney
2009/7/17 Doğacan Güney doga...@gmail.com:
 On Fri, Jul 17, 2009 at 15:23, Larsson85kristian1...@hotmail.com wrote:

 Any workaround for this? Making nutch identify as something else or something
 similar?


 Also note that nutch does not crawl anything with '?', or '' in URL. Check 
 out


Oops. I mean nutch does not crawl any such URL *by default*.

 crawl-urlfilter.txt or regex-urlfilter.txt (depending on whether you
 use crawl command
 or inject/generate/fetch/parse etc. commands).


 reinhard schwab wrote:

 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:
 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby I
 could use google to generate my start list of urls by injecting pages of
 search result.

 Why wont this page be parsed and links extracted so the crawl can start?





 --
 View this message in context: 
 http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24533426.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





 --
 Doğacan Güney




-- 
Doğacan Güney


Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
identify nutch as popular user agent such as firefox.

Larsson85 schrieb:
 Any workaround for this? Making nutch identify as something else or something
 similar?


 reinhard schwab wrote:
   
 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:
 
 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby I
 could use google to generate my start list of urls by injecting pages of
 search result.

 Why wont this page be parsed and links extracted so the crawl can start?
   
   

 

   



Re: Why cant I inject a google link to the database?

2009-07-17 Thread Dennis Kubes
This isn't a user agent problem.  No matter what user agent you use, 
Nutch is still not going to crawl this page because Nutch is correctly 
following robots.txt directives which block access.  To change this 
would be to make the crawler impolite.  A well behaved crawler should 
follow the robots.txt directives.


Dennis

reinhard schwab wrote:

identify nutch as popular user agent such as firefox.

Larsson85 schrieb:

Any workaround for this? Making nutch identify as something else or something
similar?


reinhard schwab wrote:
  

http://www.google.se/robots.txt

google disallows it.

User-agent: *
Allow: /searchhistory/
Disallow: /search


Larsson85 schrieb:


Why isnt nutch able to handle links from google?

I tried to start a crawl from the following url
http://www.google.se/search?q=site:sehl=svstart=100sa=N

And all I get is no more URLs to fetch

The reason for why I want to do this is because I had a tought on maby I
could use google to generate my start list of urls by injecting pages of
search result.

Why wont this page be parsed and links extracted so the crawl can start?
  
  

  




Re: Why cant I inject a google link to the database?

2009-07-17 Thread Larsson85

I think I need more help on how to do this.

I tried using
property
  namehttp.robots.agents/name
  valueMozilla/5.0*/value
  descriptionThe agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  /description
/property

If I dont have the star in the end I get the same as earlier, No URLs to
fetch. And if I do I get 0 records selected for fetching, exiting



reinhard schwab wrote:
 
 identify nutch as popular user agent such as firefox.
 
 Larsson85 schrieb:
 Any workaround for this? Making nutch identify as something else or
 something
 similar?


 reinhard schwab wrote:
   
 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:
 
 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby
 I
 could use google to generate my start list of urls by injecting pages
 of
 search result.

 Why wont this page be parsed and links extracted so the crawl can
 start?
   
   

 

   
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Why cant I inject a google link to the database?

2009-07-17 Thread Jake Jacobson
Larsson85,

Please read past responses.  Google is blocking all crawlers, not just
yours from indexing their search results.  Because of their robots.txt
file directives you will not be able to do this.

If you place a sign on your house, DO NOT ENTER, and I entered, you
would be very upset.  That is what the robots.txt file does for a
site.  It tells visiting bots what they can enter and what they can't
enter.

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Fri, Jul 17, 2009 at 9:32 AM, Larsson85kristian1...@hotmail.com wrote:

 I think I need more help on how to do this.

 I tried using
 property
  namehttp.robots.agents/name
  valueMozilla/5.0*/value
  descriptionThe agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  /description
 /property

 If I dont have the star in the end I get the same as earlier, No URLs to
 fetch. And if I do I get 0 records selected for fetching, exiting



 reinhard schwab wrote:

 identify nutch as popular user agent such as firefox.

 Larsson85 schrieb:
 Any workaround for this? Making nutch identify as something else or
 something
 similar?


 reinhard schwab wrote:

 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:

 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby
 I
 could use google to generate my start list of urls by injecting pages
 of
 search result.

 Why wont this page be parsed and links extracted so the crawl can
 start?










 --
 View this message in context: 
 http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
your are right.
robots.txt clearly disallows this page.
this page will not be fetched.

i remember google has some APIs to access the search.
http://code.google.com/intl/de-DE/apis/soapsearch/index.html
http://code.google.com/intl/de-DE/apis/ajaxsearch/

reinhard

Dennis Kubes schrieb:
 This isn't a user agent problem.  No matter what user agent you use,
 Nutch is still not going to crawl this page because Nutch is correctly
 following robots.txt directives which block access.  To change this
 would be to make the crawler impolite.  A well behaved crawler should
 follow the robots.txt directives.

 Dennis

 reinhard schwab wrote:
 identify nutch as popular user agent such as firefox.

 Larsson85 schrieb:
 Any workaround for this? Making nutch identify as something else or
 something
 similar?


 reinhard schwab wrote:
  
 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:

 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on
 maby I
 could use google to generate my start list of urls by injecting
 pages of
 search result.

 Why wont this page be parsed and links extracted so the crawl can
 start?
 
 
   





Re: Why cant I inject a google link to the database?

2009-07-17 Thread Brian Ulicny
1. Save the results page.
2. Grep the links out of it.
3. Put the results in a doc in your urls directory
4. Do: bin/nutch crawl urls 


On Fri, 17 Jul 2009 02:32 -0700, Larsson85 kristian1...@hotmail.com
wrote:
 
 I think I need more help on how to do this.
 
 I tried using
 property
   namehttp.robots.agents/name
   valueMozilla/5.0*/value
   descriptionThe agent strings we'll look for in robots.txt files,
   comma-separated, in decreasing order of precedence. You should
   put the value of http.agent.name as the first agent name, and keep the
   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
   /description
 /property
 
 If I dont have the star in the end I get the same as earlier, No URLs to
 fetch. And if I do I get 0 records selected for fetching, exiting
 
 
 
 reinhard schwab wrote:
  
  identify nutch as popular user agent such as firefox.
  
  Larsson85 schrieb:
  Any workaround for this? Making nutch identify as something else or
  something
  similar?
 
 
  reinhard schwab wrote:

  http://www.google.se/robots.txt
 
  google disallows it.
 
  User-agent: *
  Allow: /searchhistory/
  Disallow: /search
 
 
  Larsson85 schrieb:
  
  Why isnt nutch able to handle links from google?
 
  I tried to start a crawl from the following url
  http://www.google.se/search?q=site:sehl=svstart=100sa=N
 
  And all I get is no more URLs to fetch
 
  The reason for why I want to do this is because I had a tought on maby
  I
  could use google to generate my start list of urls by injecting pages
  of
  search result.
 
  Why wont this page be parsed and links extracted so the crawl can
  start?


 
  
 

  
  
  
 
 -- 
 View this message in context:
 http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
 Sent from the Nutch - User mailing list archive at Nabble.com.
 
-- 
  Brian Ulicny
  bulicny at alum dot mit dot edu
  home: 781-721-5746
  fax: 360-361-5746




Re: Why cant I inject a google link to the database?

2009-07-17 Thread Andrzej Bialecki

Brian Ulicny wrote:

1. Save the results page.
2. Grep the links out of it.
3. Put the results in a doc in your urls directory
4. Do: bin/nutch crawl urls 


Please note, we are not saying this is impossible to do this with Nutch 
(e.g. by setting the agent string to mimick a browser), but we insist on 
saying that it's RUDE to do this.


Anyway, Google monitors such attempts and after you issue too many 
requests your IP will be blocked for a duration - so no matter if you go 
the polite or the impolite way you won't be able to do this.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
you can also use commons-httpclient or htmlunit to access the search of
google.
these tools are not crawlers. with htmlunit it would be easy to get the
outlinks.
i strongly advice you not to misuse google search by too many requests.
google will block you i assume.

by using a search api, you are allowed to request it 1000 times per day
if i remember correct,
it is mentioned there in the terms of use or elsewhere in the documentation.

google returns a maximum of 1000 links in a search result and
a maximum of 100 links in one page.
if you set this search parameter,
num=100
you will get 100 links per result page.


Brian Ulicny schrieb:
 1. Save the results page.
 2. Grep the links out of it.
 3. Put the results in a doc in your urls directory
 4. Do: bin/nutch crawl urls 


 On Fri, 17 Jul 2009 02:32 -0700, Larsson85 kristian1...@hotmail.com
 wrote:
   
 I think I need more help on how to do this.

 I tried using
 property
   namehttp.robots.agents/name
   valueMozilla/5.0*/value
   descriptionThe agent strings we'll look for in robots.txt files,
   comma-separated, in decreasing order of precedence. You should
   put the value of http.agent.name as the first agent name, and keep the
   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
   /description
 /property

 If I dont have the star in the end I get the same as earlier, No URLs to
 fetch. And if I do I get 0 records selected for fetching, exiting



 reinhard schwab wrote:
 
 identify nutch as popular user agent such as firefox.

 Larsson85 schrieb:
   
 Any workaround for this? Making nutch identify as something else or
 something
 similar?


 reinhard schwab wrote:
   
 
 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:
 
   
 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby
 I
 could use google to generate my start list of urls by injecting pages
 of
 search result.

 Why wont this page be parsed and links extracted so the crawl can
 start?
   
   
 
 
   
   
 

   
 -- 
 View this message in context:
 http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
 Sent from the Nutch - User mailing list archive at Nabble.com.