Re: Best way to crawl, but not index?

Branden Makana Wed, 21 Jul 2010 17:08:23 -0700

Scott,


        Thanks so much for your reply. Here's what I've done so far: I made a 
file, urls, with my initial url (homepage of site), and I'm using directory 
crawl-test instead of crawl:

1. Inject initial url 
bin/nutch inject crawl-test/crawldb urls

2. generate segments
bin/nutch generate crawl-test/crawldb crawl-test/segments

3. Make the environment variable for convenience
export SEGMENT=crawl-test/segments/`ls -tr crawl-test/segments|tail -1`
echo $SEGMENT
crawl-test/segments/20100721165811

4. fetch my one page
bin/nutch fetch $SEGMENT -noParsing

5. parse it
bin/nutch parse $SEGMENT

6. update db
bin/nutch updatedb crawl-test/crawldb/ $SEGMENT -filter -normalize

7. invert 
bin/nutch invertlinks crawl-test/linkdb -dir crawl-test/segments

And then I guess I'm done for the one page (note I skipped the solr/index 
steps). I realize that the crawldb now has the links found from the homepage 
(well, I hope it does), so I need to crawl those pages (rinse & repeat) however 
I'm not sure how. Example, after doing the above: 

bin/nutch fetch $SEGMENT -noParsing
Fetcher: starting
Fetcher: segment: crawl-test/segments/20100721165811
Fetcher: java.io.IOException: Segment already fetched!


What should the next step be, and how do I repeat that next step until I've 
finished crawling the site? My ultimate goal here is to do this 
programmatically, but I realize the bin/nutch script just calls Java classes, 
so once I have the steps down with the script file I should be able to 
replicate the steps with code. 

Thanks again for the help!

Branden Makana



On Jul 21, 2010, at 4:53 PM, Scott Gonyea wrote:

> Sure, I'll help you along with some start bits and pieces.  Improving docs
> would be great for everyone who comes along.
> 
> In your nutch root folder, create a dir called "urls" (or whatever you want
> to call it)
> throw a few test URLs into a flat-file ("seed_urls.txt"), separate by a
> new-line
> 
> Here's a sample script that will help you break down all of the phases of a
> nutch crawl.  The "crawl" command is just basically all of these commands,
> except annoying.
> 
> 
> cat > stupid_script.sh <<EOF
> echo "-------------"
> echo "bin/nutch inject crawl/crawldb urls"
> bin/nutch inject crawl/crawldb urls/seed.txt
> echo "-------------"
> echo "-------------"
> bin/nutch generate crawl/crawldb crawl/segments
> echo "-------------"
> echo "-------------"
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> echo $SEGMENT
> echo "-------------"
> echo "-------------"
> echo "bin/nutch fetch $SEGMENT -noParsing"
> bin/nutch fetch $SEGMENT -noParsing
> echo "-------------"
> echo "-------------"
> echo "bin/nutch parse $SEGMENT"
> bin/nutch parse $SEGMENT
> echo "-------------"
> echo "-------------"
> echo "bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize"
> bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> echo "-------------"
> echo "-------------"
> echo "bin/nutch invertlinks crawl/linkdb -dir crawl/segments"
> bin/nutch invertlinks crawl/linkdb -dir crawl/segments
> echo "-------------"
> echo "-------------"
> echo "bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb crawl/segments/*"
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
> crawl/segments/*
> echo "-------------"
> echo "-------------"
> echo "bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
> crawl/segments/*"
> bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
> echo "-------------"
> EOF
> 
> You can comment out the index bits and pieces.  Keep in mind that the
> "crawl" command will iterate the above commands, for any given depth.  So,
> you need to call all of the pieces of the script (except for inject, and
> index if you plan to use it one day) for however many depths you end up
> wanting to crawl.  Each depth will take more and more time/space, etc.
> 
> Spend some time in the wiki, and digging through the
> "conf/nutch-default.xml" file, to look for options you might care about.
> Any changes you want to make should be done in "conf/nutch-site.xml", which
> will override the defaults.
> 
> Scott
> 
> On Wed, Jul 21, 2010 at 4:28 PM, Branden Makana <
> [email protected]> wrote:
> 
>> Hi All,
>> 
>> 
>>       Just wanted to follow up my question with a polite request that
>> perhaps the documentation for Nutch be updated? I'm trying to follow the
>> Nutch Tutorial (http://wiki.apache.org/nutch/NutchTutorial) to see if I
>> can crawl a site without indexing it, but the commands and examples shown
>> are out of date: directories are named different than in the examples (or
>> don't exist at all), and even some of the commands appear to be different.
>> 
>>       Nutch being open source, I'd gladly volunteer to do the updating,
>> assuming someone can give me the pertinent information...
>> 
>> Thanks,
>> Branden Makana
>> 
>> On Jul 21, 2010, at 11:52 AM, Branden Makana wrote:
>> 
>>> Hello,
>>> 
>>> 
>>>      We're trying to crawl a very large site, but we really just want
>> all the html/image URLs on the site - we don't care to search it. Therefor,
>> what's the best way to have Nutch crawl the site, but NOT index/store pages
>> locally? Is it even possible?
>>> 
>>> 
>>> 
>>> Thanks,
>>> Branden Makana
>> 
>>

Re: Best way to crawl, but not index?

Reply via email to