Re: Best way to crawl, but not index?

Scott Gonyea Wed, 21 Jul 2010 16:54:35 -0700

Sure, I'll help you along with some start bits and pieces.  Improving docs
would be great for everyone who comes along.


In your nutch root folder, create a dir called "urls" (or whatever you want
to call it)
throw a few test URLs into a flat-file ("seed_urls.txt"), separate by a
new-line

Here's a sample script that will help you break down all of the phases of a
nutch crawl.  The "crawl" command is just basically all of these commands,
except annoying.


cat > stupid_script.sh <<EOF
echo "-------------"
echo "bin/nutch inject crawl/crawldb urls"
bin/nutch inject crawl/crawldb urls/seed.txt
echo "-------------"
echo "-------------"
bin/nutch generate crawl/crawldb crawl/segments
echo "-------------"
echo "-------------"
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
echo $SEGMENT
echo "-------------"
echo "-------------"
echo "bin/nutch fetch $SEGMENT -noParsing"
bin/nutch fetch $SEGMENT -noParsing
echo "-------------"
echo "-------------"
echo "bin/nutch parse $SEGMENT"
bin/nutch parse $SEGMENT
echo "-------------"
echo "-------------"
echo "bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize"
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
echo "-------------"
echo "-------------"
echo "bin/nutch invertlinks crawl/linkdb -dir crawl/segments"
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
echo "-------------"
echo "-------------"
echo "bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
crawl/linkdb crawl/segments/*"
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*
echo "-------------"
echo "-------------"
echo "bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
crawl/segments/*"
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
echo "-------------"
EOF

You can comment out the index bits and pieces.  Keep in mind that the
"crawl" command will iterate the above commands, for any given depth.  So,
you need to call all of the pieces of the script (except for inject, and
index if you plan to use it one day) for however many depths you end up
wanting to crawl.  Each depth will take more and more time/space, etc.

Spend some time in the wiki, and digging through the
"conf/nutch-default.xml" file, to look for options you might care about.
 Any changes you want to make should be done in "conf/nutch-site.xml", which
will override the defaults.

Scott

On Wed, Jul 21, 2010 at 4:28 PM, Branden Makana <
[email protected]> wrote:

> Hi All,
>
>
>        Just wanted to follow up my question with a polite request that
> perhaps the documentation for Nutch be updated? I'm trying to follow the
> Nutch Tutorial (http://wiki.apache.org/nutch/NutchTutorial) to see if I
> can crawl a site without indexing it, but the commands and examples shown
> are out of date: directories are named different than in the examples (or
> don't exist at all), and even some of the commands appear to be different.
>
>        Nutch being open source, I'd gladly volunteer to do the updating,
> assuming someone can give me the pertinent information...
>
> Thanks,
> Branden Makana
>
> On Jul 21, 2010, at 11:52 AM, Branden Makana wrote:
>
> > Hello,
> >
> >
> >       We're trying to crawl a very large site, but we really just want
> all the html/image URLs on the site - we don't care to search it. Therefor,
> what's the best way to have Nutch crawl the site, but NOT index/store pages
> locally? Is it even possible?
> >
> >
> >
> > Thanks,
> > Branden Makana
>
>

Re: Best way to crawl, but not index?

Reply via email to