Re: Crawling localhost Webapps - regex- urfilter query

Lewis John Mcgibbney Wed, 19 Dec 2012 05:22:53 -0800

Additionally, wget is great for fetching pages on the fly, but it does not
necessarily meant that your Nutch server will and/or should be able to
fetch the page.


I would always recommend using the parserchecker [0] tool for on the fly
fetching and parser checking. It can be run from the command line very
easily.

hth

Lewis

[0]
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java

On Wed, Dec 19, 2012 at 1:20 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> This sounds most like non-existence of robots.txt on the webserver.
>
> Lewis
>
>
> On Wed, Dec 19, 2012 at 5:26 AM, Rajani Maski <[email protected]>wrote:
>
>> Hi Tejas,
>>
>>  I found out the reason for why the blog was not getting crawled :
>> http://rajinimaski.blogspot.in/
>> This is because of the proxy that has filter(block) for blog sites. Used
>> different IP and
>>  Now I am able to crawl the above blog site successfully.
>>
>> However the html files that I have put in local tomcat webserver are not
>> getting crawled and there are no errors also. attached is the log file and
>> html sample pages.I will look at the robot rules for this and get back.
>>
>> Thanks very much
>> Regards
>> Rajani
>>
>>
>>
>>
>>
>> On Wed, Dec 19, 2012 at 2:48 AM, Tejas Patil <[email protected]>wrote:
>>
>>> Hi Rajani,
>>>
>>> *Robot rules? I didn't get this check. Did you mean any setting in
>>> nutch-site
>>> xml ?*
>>> No. See this http://en.wikipedia.org/wiki/Robots_exclusion_standard
>>>
>>> I was able to crawl http://rajinimaski.blogspot.in/ successfully at my
>>> end.
>>> Without any error or exception its hard to tell issue. Set the logger to
>>> TRACE or DEBUG and see the logs created for the fetch phase.
>>> There must be some message regarding the url like
>>> fetch of http://www.abcd.edu/~pqr/homework.html failed with: Http
>>> code=403,
>>> url=http://www.abcd.edu/~pqr/homework.html
>>> or
>>> 2012-12-18 11:24:58,436 TRACE http.Http - fetching
>>> http://www.ics.uci.edu/~dan/class/260/notes/
>>> 2012-12-18 11:24:58,481 TRACE http.Http - fetched 482 bytes from
>>> http://www.ics.uci.edu/~dan/class/260/notes/
>>> 2012-12-18 11:24:58,486 TRACE http.Http - 401 Authentication Required
>>>
>>> or something else that can shed the light on the issue.
>>>
>>> Thanks,
>>> Tejas Patil
>>>
>>> On Tue, Dec 18, 2012 at 3:36 AM, Rajani Maski <[email protected]>
>>> wrote:
>>>
>>> > Hi Tejas,
>>> > Thank you for detailed information. For the checks,
>>> >
>>> > Check 1  - can the url be fetched via wget command :
>>> >
>>> > ubuntu@ubuntu-OptiPlex-390:~$ wget
>>> > http://localhost:8080/nutch-test-site/child-1.html
>>> > --2012-12-18 16:07:34--
>>> > http://localhost:8080/nutch-test-site/child-1.html
>>> > Resolving localhost (localhost)... 127.0.0.1
>>> > Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
>>> > HTTP request sent, awaiting response... 200 OK
>>> > Length: 102 [text/html]
>>> > Saving to: `child-1.html.1'
>>> >
>>> > 100%[======================================>] 102         --.-K/s   in
>>> 0s
>>> >
>>> >
>>> > 2012-12-18 16:07:34 (13.8 MB/s) - `child-1.html.1' saved [102/102]
>>> >
>>> > Check 2 : what are the robots rules defined for the host ? Do they
>>> allow
>>> > the
>>> > crawler to crawl that url ? this will address #5.
>>> > Robot rules? I didn't get this check. Did you mean any setting in
>>> > nutch-site xml ?
>>> >
>>> > 3. After changing the parent page url from IP based to localhost and
>>> > running a *fresh* crawl, did you see any error or exception in the
>>> logs ?
>>> > try running fresh crawl in local mode, its helps in debugging things
>>> > quickly.
>>> >
>>> > Did a fresh crawl. There are no errors only warnings. The stats is
>>> same as
>>> > above.
>>> > configuration : regexurl-filter.txt has "+." and urls/seed.txt has
>>> > http://localhost:8080/nutch-test-site/child-1.html
>>> >
>>> > Also important observation is when I set other sites for crawling like
>>> > http://viterbi.usc.edu/admission/ etc.,. crawl is successful and
>>> indexed
>>> > to
>>> > solr. But when I crawl the above html page nothing is fetched. Also
>>> when I
>>> > am trying to crawl the site: http://rajinimaski.blogspot.in/  (this
>>> has 3
>>> > blogs) there is 403 status - failed to fetch.
>>> >
>>> >
>>> > thanks & Regards
>>> > Rajani
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tue, Dec 18, 2012 at 1:59 PM, Tejas Patil <[email protected]
>>> > >wrote:
>>> >
>>> > > Hi Rajani,
>>> > >
>>> > > A url is marked as "db_gone" when nutch receives below HTTP error
>>> codes
>>> > for
>>> > > the request:
>>> > > 1. Bad request (error code: 400)
>>> > > 2. Not found (error code: 404)
>>> > > 3. Access denied (error code: 401)
>>> > > 4. Permanently gone (error code: 410)
>>> > >
>>> > > Apart from this, a url can also be marked as "db_gone" if:
>>> > > 5. its not getting crawled due to "Robots denied" or
>>> > > 6. some exception is triggered while fetching the content from the
>>> server
>>> > > (eg. Read time out, Broken socket etc.)
>>> > >
>>> > > (NOTE: as we are dealing with a HTTP url here, it made sense to
>>> focus on
>>> > > HTTP codes only. For FTP protocol, nutch has similar stuff. I
>>> preferred
>>> > to
>>> > > avoid discussing that.)
>>> > >
>>> > > The reason why you could not see the child pages in the crawldb:
>>> because
>>> > > the parent page has not been fetched successfully.
>>> > >
>>> > > Quick checks that you can try:
>>> > > 1. can the url be fetched via wget command
>>> > > <http://linux.die.net/man/1/wget>on the terminal ? this will address
>>> > > cases 1-4
>>> > > 2. what are the robots rules defined for the host ? Do they allow the
>>> > > crawler to crawl that url ? this will address #5.
>>> > > 3. After changing the parent page url from IP based to localhost and
>>> > > running a *fresh* crawl, did you see any error or exception in the
>>> logs ?
>>> > > try running fresh crawl in local mode, its helps in debugging things
>>> > > quickly.
>>> > >
>>> > > Thanks,
>>> > > Tejas Patil
>>> > >
>>> > > On Mon, Dec 17, 2012 at 11:34 PM, Rajani Maski <
>>> [email protected]
>>> > > >wrote:
>>> > >
>>> > > >  Can you please tell me what does this mean : Status: 3 (db_gone)
>>> > >
>>> >
>>>
>>
>>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Re: Crawling localhost Webapps - regex- urfilter query

Reply via email to