Re: Best way to crawl, but not index?

Scott Gonyea Wed, 21 Jul 2010 17:18:14 -0700

Repeat 2 through 7 until your boss yells at you.  You may or may not end up
needing to write a plugin.  I've not had to deal with extracting anything
other than a website's text-based content.  I assume others have had to
solve the problems that you are dealing with, since it sounds like a common
use-case, and so you may not even have to write a plugin.


Scott

On Wed, Jul 21, 2010 at 5:06 PM, Branden Makana <
[email protected]> wrote:

> Scott,
>
>
>        Thanks so much for your reply. Here's what I've done so far: I made
> a file, urls, with my initial url (homepage of site), and I'm using
> directory crawl-test instead of crawl:
>
> 1. Inject initial url
> bin/nutch inject crawl-test/crawldb urls
>
> 2. generate segments
> bin/nutch generate crawl-test/crawldb crawl-test/segments
>
> 3. Make the environment variable for convenience
> export SEGMENT=crawl-test/segments/`ls -tr crawl-test/segments|tail -1`
> echo $SEGMENT
> crawl-test/segments/20100721165811
>
> 4. fetch my one page
> bin/nutch fetch $SEGMENT -noParsing
>
> 5. parse it
> bin/nutch parse $SEGMENT
>
> 6. update db
> bin/nutch updatedb crawl-test/crawldb/ $SEGMENT -filter -normalize
>
> 7. invert
> bin/nutch invertlinks crawl-test/linkdb -dir crawl-test/segments
>
> And then I guess I'm done for the one page (note I skipped the solr/index
> steps). I realize that the crawldb now has the links found from the homepage
> (well, I hope it does), so I need to crawl those pages (rinse & repeat)
> however I'm not sure how. Example, after doing the above:
>
> bin/nutch fetch $SEGMENT -noParsing
> Fetcher: starting
> Fetcher: segment: crawl-test/segments/20100721165811
> Fetcher: java.io.IOException: Segment already fetched!
>
>
> What should the next step be, and how do I repeat that next step until I've
> finished crawling the site? My ultimate goal here is to do this
> programmatically, but I realize the bin/nutch script just calls Java
> classes, so once I have the steps down with the script file I should be able
> to replicate the steps with code.
>
> Thanks again for the help!
>
> Branden Makana
>
>
>
> On Jul 21, 2010, at 4:53 PM, Scott Gonyea wrote:
>
> > Sure, I'll help you along with some start bits and pieces.  Improving
> docs
> > would be great for everyone who comes along.
> >
> > In your nutch root folder, create a dir called "urls" (or whatever you
> want
> > to call it)
> > throw a few test URLs into a flat-file ("seed_urls.txt"), separate by a
> > new-line
> >
> > Here's a sample script that will help you break down all of the phases of
> a
> > nutch crawl.  The "crawl" command is just basically all of these
> commands,
> > except annoying.
> >
> >
> > cat > stupid_script.sh <<EOF
> > echo "-------------"
> > echo "bin/nutch inject crawl/crawldb urls"
> > bin/nutch inject crawl/crawldb urls/seed.txt
> > echo "-------------"
> > echo "-------------"
> > bin/nutch generate crawl/crawldb crawl/segments
> > echo "-------------"
> > echo "-------------"
> > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > echo $SEGMENT
> > echo "-------------"
> > echo "-------------"
> > echo "bin/nutch fetch $SEGMENT -noParsing"
> > bin/nutch fetch $SEGMENT -noParsing
> > echo "-------------"
> > echo "-------------"
> > echo "bin/nutch parse $SEGMENT"
> > bin/nutch parse $SEGMENT
> > echo "-------------"
> > echo "-------------"
> > echo "bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize"
> > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > echo "-------------"
> > echo "-------------"
> > echo "bin/nutch invertlinks crawl/linkdb -dir crawl/segments"
> > bin/nutch invertlinks crawl/linkdb -dir crawl/segments
> > echo "-------------"
> > echo "-------------"
> > echo "bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> > crawl/linkdb crawl/segments/*"
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb
> > crawl/segments/*
> > echo "-------------"
> > echo "-------------"
> > echo "bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
> > crawl/segments/*"
> > bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
> > echo "-------------"
> > EOF
> >
> > You can comment out the index bits and pieces.  Keep in mind that the
> > "crawl" command will iterate the above commands, for any given depth.
>  So,
> > you need to call all of the pieces of the script (except for inject, and
> > index if you plan to use it one day) for however many depths you end up
> > wanting to crawl.  Each depth will take more and more time/space, etc.
> >
> > Spend some time in the wiki, and digging through the
> > "conf/nutch-default.xml" file, to look for options you might care about.
> > Any changes you want to make should be done in "conf/nutch-site.xml",
> which
> > will override the defaults.
> >
> > Scott
> >
> > On Wed, Jul 21, 2010 at 4:28 PM, Branden Makana <
> > [email protected]> wrote:
> >
> >> Hi All,
> >>
> >>
> >>       Just wanted to follow up my question with a polite request that
> >> perhaps the documentation for Nutch be updated? I'm trying to follow the
> >> Nutch Tutorial (http://wiki.apache.org/nutch/NutchTutorial) to see if I
> >> can crawl a site without indexing it, but the commands and examples
> shown
> >> are out of date: directories are named different than in the examples
> (or
> >> don't exist at all), and even some of the commands appear to be
> different.
> >>
> >>       Nutch being open source, I'd gladly volunteer to do the updating,
> >> assuming someone can give me the pertinent information...
> >>
> >> Thanks,
> >> Branden Makana
> >>
> >> On Jul 21, 2010, at 11:52 AM, Branden Makana wrote:
> >>
> >>> Hello,
> >>>
> >>>
> >>>      We're trying to crawl a very large site, but we really just want
> >> all the html/image URLs on the site - we don't care to search it.
> Therefor,
> >> what's the best way to have Nutch crawl the site, but NOT index/store
> pages
> >> locally? Is it even possible?
> >>>
> >>>
> >>>
> >>> Thanks,
> >>> Branden Makana
> >>
> >>
>
>

Re: Best way to crawl, but not index?

Reply via email to