Re: Site being crawled even when the URL is removed from seed.txt

Rajani Maski Wed, 26 Dec 2012 20:55:00 -0800

Hi Tejas,

    "urls" is the directory at /home/ubuntu/nutch_new_setup/urls/seed.txt.
Within that there is only one file with name : seed.txt and that has only
one url : http://localhost:8080/nutch-test-site/chi.html . You can see the
folder structure in the screen shot 1 for the same. I am sure that there is
no other /urls/seed.txt folder structure on disc. This is the command
: ubuntu@ubuntu-OptiPlex-390:~/nutch_new_setup$ bin/nutch crawl urls -dir
tomcatcrawl -solr http://localhost:8080/nutch_poc -depth 5.


Thanks & Regards
Rajani


On Wed, Dec 26, 2012 at 11:36 PM, Tejas Patil <[email protected]>wrote:

> Hi Rajani,
>
> As per screen shot #1, the seed url (
> http://localhost:8080/nutch-test-site/chi.html) was saved in the file
> named
> "seeds.txt". But while running the crawl (screen shot #3), this file is not
> passed as an argument to the crawl command. Instead some other file named
> "urls" is passed as an argument. I suspect that it might be having the
> links from sony.com and usc.edu.
> Please pass the correct seed file in the crawl command and run a fresh
> crawl again.
>
> Thanks,
> Tejas Patil
>
>
> On Wed, Dec 26, 2012 at 4:04 AM, Rajani Maski <[email protected]>
> wrote:
>
> > Hi Lewis,
> >
> >    I think there is something wrong in the configuration from my end.
> >
> > But I am yet to find the reason for crawl that is taking place on the
> > history of links that are not mentioned in the seed text file nor the old
> > crawl db created in /home/ubuntu/nutch_new_setup/testcrawl/crawldb
> exists.
> > Did you mean the same crawldb or does it create tmp folder somewhere else
> > that need to be cleared?
> >
> > Please find the screens shots in this
> > link<http://rajinimaski.blogspot.in/2012/12/nutch-learning.html> taken
> > during set up and while crawl is executed. It shows the detailed
> > configuration steps followed.
> >
> >
> > Thanks & Regards,
> > Rajani Maski
> >
> >
> >
> > On Wed, Dec 19, 2012 at 6:35 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> > > Hi Rajani,
> > >
> > >
> > >
> > > On Wed, Dec 19, 2012 at 10:33 AM, Rajani Maski <[email protected]
> > > >wrote:
> > >
> > > >
> > > >    Now I wanted to do fresh new crawl, So after the completion of
> above
> > > > crawling process,  i followed the below steps:
> > > >
> > > >    - Changed the URL in seed.txt to service.sony.com.in,
> > > >
> > >
> > > Did you inject the above URL into the new crawl database?
> > >
> > >
> > > >    - in the regexurlfilter.txt I just  gave "+." [I know that this
> "+."
> > > >    means to accept anything , But "anything"  does it mean any ULRS
> > that
> > > is
> > > >    not there in seed.txt too? ]
> > > >
> > >
> > > Nutch will follow out/in links for any given URL (depending on your
> > > configuration). The crawler cannot magically jump to undiscovered URLs,
> > > there needs to be a graph linking nodes.
> > >
> > >
> > > >
> > > > *What I observe is crawling for the site :
> > > > http://viterbi.usc.edu/admission/
> > > > is still taking place even when the url does not exist in seed.txt
> nor
> > > the
> > > > old crawldb(nutchcrawldb) exists.
> > > >
> > >
> > > If you have totally deleted the old crawl database this should be
> > > impossible. The crawl database tracks URLs along with lots of metadata,
> > > once it is deleted this information is lost and you will need to create
> > > your crawl database from scratch.
> > >
> > > Lewis
> > >
> >
>

Re: Site being crawled even when the URL is removed from seed.txt

Reply via email to