Re: Please share your experience of using Nutch in production

2014-06-24 Thread Gora Mohanty
On 23 June 2014 01:44, Meraj A. Khan  wrote:
> Gora,
>
> Thanks for sharing your admin perspective , rest assured  I am not trying
> to circumvent any politeness requirements in any way , as I mentioned
> earlier , I am with in the crawl-delay limits that are being set by the web
> masters if any , however , you have confirmed my hunch that I might have to
> reach out to individual webmasters to try and convince them to not block my
> IP address .
[...]

If you are taking the reasonable precautions that you mentioned
earlier, there is
no reason that you should be getting banned by webmasters. Unless a crawler
is actually causing issues for the site performance, it might not even come to
the attention of the webmaster at all.

> By being at a disadvantage , I meant at a disadvantage compared to major
> players like Google, Bing and Yahoo bots , whom the webmasters probably
> would not block access, and by Nutch variant , I meant an instance of a
> customized crawler based on Nutch.

People are unlikely to ban Google et al, as there are clear benefits
to having them
search one's site. If you would like special privileges, such as being
able to hit
the site hard, you will have to convince the webmaster that it your crawler also
brings some such benefit to them.

Regards,
Gora


Re: Please share your experience of using Nutch in production

2014-06-23 Thread Jorge Luis Betancourt Gonzalez
Why are you assuming that the web masters are effectively going to block you? 
In my experience this is the least probable escenario.

On Jun 22, 2014, at 4:14 PM, Meraj A. Khan  wrote:

> Gora,
> 
> Thanks for sharing your admin perspective , rest assured  I am not trying
> to circumvent any politeness requirements in any way , as I mentioned
> earlier , I am with in the crawl-delay limits that are being set by the web
> masters if any , however , you have confirmed my hunch that I might have to
> reach out to individual webmasters to try and convince them to not block my
> IP address .
> 
> Even if I have as small a number as 100 web sites to crawl , it would be a
> huge challenge for us to communicate with each and every webmaster , how
> would one go about doing that ? Also is there a standard way the web
> masters list their contact info so as to sell them the pitch to or persuade
> them to allows us to crawl their websites at a reasonable frequency?
> 
> By being at a disadvantage , I meant at a disadvantage compared to major
> players like Google, Bing and Yahoo bots , whom the webmasters probably
> would not block access, and by Nutch variant , I meant an instance of a
> customized crawler based on Nutch.
> 
> Thanks.
> 
> 
> On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty  wrote:
> 
>> On 22 June 2014 22:07, Meraj A. Khan  wrote:
>>> 
>>> Hello Folks,
>>> 
>>> I have  noticed that Nutch resources and mailing lists are mostly geared
>>> towards the usage of Nutch in research oriented projects , I would like
>> to
>>> know from those of you who are using Nutch in production for large scale
>>> crawling (vertical or non-vertical) about what challenges to expect and
>> how
>>> to overcome them.
>>> 
>>> I will list a few  challenges that  I faced below and would like to hear
>>> from if you faced these challenges you on how you overcame these.
>>> 
>>> 
>>>  1. If I were to go for a vertical search engine for websites in a
>>>  particular domain  and follow the crawl-delay directive for
>> politeness in
>>>  the robots.txt , there is a possibility that the web master could
>> still
>>>  block my IP address and I start getting HTTP 403 forbidden/access
>> denied
>>>  messages. How can I  overcome these kind of issues , other than
>> providing
>>>  full contact info in the nutch-site.xml for the web master to get in
>> touch
>>>  with me, before blocking me ?.
>> 
>> Er, providing full access info. is just basic politeness, and IMHO
>> should become a requirement for Nutch. If you are going to hit some
>> sites particularly hard, with good reasons, try contacting the website
>> administrators and explaining to them why you need such access. We
>> both administer, and crawl sites, and as an administrator I am quite
>> willing to accept reasonable requests. After all, it is also our goal
>> to promote our websites, and already most traffic on the web is
>> through search engines.
>> 
>>>  2. The fact that you will be considered as just another Nutch variant
>> by
>>>  web master puts you at a great level of dis-advantage , where you
>> could be
>>>  blocked from accessing the web site at the whims of the web master.
>> 
>> Not sure what you mean by "just another Nutch variant", nor why you
>> think that it puts you at a disadvantage. Disadvantage compared to
>> whom? Also, "whims of the web master"? Really? After all, it is their
>> resources that you are using, and they are perfectly within their
>> rights to ban you if they feel, for whatever reason, that you are
>> abusing such resources.
>> 
>>>  3. Can anyone share info as to how they overcame this issue when they
>>>  were starting out , did you establish a relationship with each website
>>>  owner/master to allows unhindered access ?
>>>  4. Any other tips and suggestions would also be greatly appreciated.
>> 
>> Sorry if I am misreading the above, but what you are asking for smells
>> like trying to circumvent reasonable requirements. Yes, do try talking
>> to website administrators. You might find them to be surprisingly
>> accommodating if you are reasonable in return.
>> 
>> Regards,
>> Gora
>> 

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu


Re: Please share your experience of using Nutch in production

2014-06-22 Thread Meraj A. Khan
Gora,

Thanks for sharing your admin perspective , rest assured  I am not trying
to circumvent any politeness requirements in any way , as I mentioned
earlier , I am with in the crawl-delay limits that are being set by the web
masters if any , however , you have confirmed my hunch that I might have to
reach out to individual webmasters to try and convince them to not block my
IP address .

Even if I have as small a number as 100 web sites to crawl , it would be a
huge challenge for us to communicate with each and every webmaster , how
would one go about doing that ? Also is there a standard way the web
masters list their contact info so as to sell them the pitch to or persuade
them to allows us to crawl their websites at a reasonable frequency?

By being at a disadvantage , I meant at a disadvantage compared to major
players like Google, Bing and Yahoo bots , whom the webmasters probably
would not block access, and by Nutch variant , I meant an instance of a
customized crawler based on Nutch.

Thanks.


On Sun, Jun 22, 2014 at 1:33 PM, Gora Mohanty  wrote:

> On 22 June 2014 22:07, Meraj A. Khan  wrote:
> >
> > Hello Folks,
> >
> > I have  noticed that Nutch resources and mailing lists are mostly geared
> > towards the usage of Nutch in research oriented projects , I would like
> to
> > know from those of you who are using Nutch in production for large scale
> > crawling (vertical or non-vertical) about what challenges to expect and
> how
> > to overcome them.
> >
> > I will list a few  challenges that  I faced below and would like to hear
> > from if you faced these challenges you on how you overcame these.
> >
> >
> >1. If I were to go for a vertical search engine for websites in a
> >particular domain  and follow the crawl-delay directive for
> politeness in
> >the robots.txt , there is a possibility that the web master could
> still
> >block my IP address and I start getting HTTP 403 forbidden/access
> denied
> >messages. How can I  overcome these kind of issues , other than
> providing
> >full contact info in the nutch-site.xml for the web master to get in
> touch
> >with me, before blocking me ?.
>
> Er, providing full access info. is just basic politeness, and IMHO
> should become a requirement for Nutch. If you are going to hit some
> sites particularly hard, with good reasons, try contacting the website
> administrators and explaining to them why you need such access. We
> both administer, and crawl sites, and as an administrator I am quite
> willing to accept reasonable requests. After all, it is also our goal
> to promote our websites, and already most traffic on the web is
> through search engines.
>
> >2. The fact that you will be considered as just another Nutch variant
> by
> >web master puts you at a great level of dis-advantage , where you
> could be
> >blocked from accessing the web site at the whims of the web master.
>
> Not sure what you mean by "just another Nutch variant", nor why you
> think that it puts you at a disadvantage. Disadvantage compared to
> whom? Also, "whims of the web master"? Really? After all, it is their
> resources that you are using, and they are perfectly within their
> rights to ban you if they feel, for whatever reason, that you are
> abusing such resources.
>
> >3. Can anyone share info as to how they overcame this issue when they
> >were starting out , did you establish a relationship with each website
> >owner/master to allows unhindered access ?
> >4. Any other tips and suggestions would also be greatly appreciated.
>
> Sorry if I am misreading the above, but what you are asking for smells
> like trying to circumvent reasonable requirements. Yes, do try talking
> to website administrators. You might find them to be surprisingly
> accommodating if you are reasonable in return.
>
> Regards,
> Gora
>


Re: Please share your experience of using Nutch in production

2014-06-22 Thread Gora Mohanty
On 22 June 2014 22:07, Meraj A. Khan  wrote:
>
> Hello Folks,
>
> I have  noticed that Nutch resources and mailing lists are mostly geared
> towards the usage of Nutch in research oriented projects , I would like to
> know from those of you who are using Nutch in production for large scale
> crawling (vertical or non-vertical) about what challenges to expect and how
> to overcome them.
>
> I will list a few  challenges that  I faced below and would like to hear
> from if you faced these challenges you on how you overcame these.
>
>
>1. If I were to go for a vertical search engine for websites in a
>particular domain  and follow the crawl-delay directive for politeness in
>the robots.txt , there is a possibility that the web master could still
>block my IP address and I start getting HTTP 403 forbidden/access denied
>messages. How can I  overcome these kind of issues , other than providing
>full contact info in the nutch-site.xml for the web master to get in touch
>with me, before blocking me ?.

Er, providing full access info. is just basic politeness, and IMHO
should become a requirement for Nutch. If you are going to hit some
sites particularly hard, with good reasons, try contacting the website
administrators and explaining to them why you need such access. We
both administer, and crawl sites, and as an administrator I am quite
willing to accept reasonable requests. After all, it is also our goal
to promote our websites, and already most traffic on the web is
through search engines.

>2. The fact that you will be considered as just another Nutch variant by
>web master puts you at a great level of dis-advantage , where you could be
>blocked from accessing the web site at the whims of the web master.

Not sure what you mean by "just another Nutch variant", nor why you
think that it puts you at a disadvantage. Disadvantage compared to
whom? Also, "whims of the web master"? Really? After all, it is their
resources that you are using, and they are perfectly within their
rights to ban you if they feel, for whatever reason, that you are
abusing such resources.

>3. Can anyone share info as to how they overcame this issue when they
>were starting out , did you establish a relationship with each website
>owner/master to allows unhindered access ?
>4. Any other tips and suggestions would also be greatly appreciated.

Sorry if I am misreading the above, but what you are asking for smells
like trying to circumvent reasonable requirements. Yes, do try talking
to website administrators. You might find them to be surprisingly
accommodating if you are reasonable in return.

Regards,
Gora