Scott,
Thanks so much for your reply. Here's what I've done so far: I made a
file, urls, with my initial url (homepage of site), and I'm using directory
crawl-test instead of crawl:
1. Inject initial url
bin/nutch inject crawl-test/crawldb urls
2. generate segments
bin/nutch generate crawl-test/crawldb crawl-test/segments
3. Make the environment variable for convenience
export SEGMENT=crawl-test/segments/`ls -tr crawl-test/segments|tail -1`
echo $SEGMENT
crawl-test/segments/20100721165811
4. fetch my one page
bin/nutch fetch $SEGMENT -noParsing
5. parse it
bin/nutch parse $SEGMENT
6. update db
bin/nutch updatedb crawl-test/crawldb/ $SEGMENT -filter -normalize
7. invert
bin/nutch invertlinks crawl-test/linkdb -dir crawl-test/segments
And then I guess I'm done for the one page (note I skipped the solr/index
steps). I realize that the crawldb now has the links found from the homepage
(well, I hope it does), so I need to crawl those pages (rinse & repeat) however
I'm not sure how. Example, after doing the above:
bin/nutch fetch $SEGMENT -noParsing
Fetcher: starting
Fetcher: segment: crawl-test/segments/20100721165811
Fetcher: java.io.IOException: Segment already fetched!
What should the next step be, and how do I repeat that next step until I've
finished crawling the site? My ultimate goal here is to do this
programmatically, but I realize the bin/nutch script just calls Java classes,
so once I have the steps down with the script file I should be able to
replicate the steps with code.
Thanks again for the help!
Branden Makana
On Jul 21, 2010, at 4:53 PM, Scott Gonyea wrote:
> Sure, I'll help you along with some start bits and pieces. Improving docs
> would be great for everyone who comes along.
>
> In your nutch root folder, create a dir called "urls" (or whatever you want
> to call it)
> throw a few test URLs into a flat-file ("seed_urls.txt"), separate by a
> new-line
>
> Here's a sample script that will help you break down all of the phases of a
> nutch crawl. The "crawl" command is just basically all of these commands,
> except annoying.
>
>
> cat > stupid_script.sh <<EOF
> echo "-------------"
> echo "bin/nutch inject crawl/crawldb urls"
> bin/nutch inject crawl/crawldb urls/seed.txt
> echo "-------------"
> echo "-------------"
> bin/nutch generate crawl/crawldb crawl/segments
> echo "-------------"
> echo "-------------"
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> echo $SEGMENT
> echo "-------------"
> echo "-------------"
> echo "bin/nutch fetch $SEGMENT -noParsing"
> bin/nutch fetch $SEGMENT -noParsing
> echo "-------------"
> echo "-------------"
> echo "bin/nutch parse $SEGMENT"
> bin/nutch parse $SEGMENT
> echo "-------------"
> echo "-------------"
> echo "bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize"
> bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> echo "-------------"
> echo "-------------"
> echo "bin/nutch invertlinks crawl/linkdb -dir crawl/segments"
> bin/nutch invertlinks crawl/linkdb -dir crawl/segments
> echo "-------------"
> echo "-------------"
> echo "bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> crawl/linkdb crawl/segments/*"
> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
> crawl/segments/*
> echo "-------------"
> echo "-------------"
> echo "bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
> crawl/segments/*"
> bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
> echo "-------------"
> EOF
>
> You can comment out the index bits and pieces. Keep in mind that the
> "crawl" command will iterate the above commands, for any given depth. So,
> you need to call all of the pieces of the script (except for inject, and
> index if you plan to use it one day) for however many depths you end up
> wanting to crawl. Each depth will take more and more time/space, etc.
>
> Spend some time in the wiki, and digging through the
> "conf/nutch-default.xml" file, to look for options you might care about.
> Any changes you want to make should be done in "conf/nutch-site.xml", which
> will override the defaults.
>
> Scott
>
> On Wed, Jul 21, 2010 at 4:28 PM, Branden Makana <
> [email protected]> wrote:
>
>> Hi All,
>>
>>
>> Just wanted to follow up my question with a polite request that
>> perhaps the documentation for Nutch be updated? I'm trying to follow the
>> Nutch Tutorial (http://wiki.apache.org/nutch/NutchTutorial) to see if I
>> can crawl a site without indexing it, but the commands and examples shown
>> are out of date: directories are named different than in the examples (or
>> don't exist at all), and even some of the commands appear to be different.
>>
>> Nutch being open source, I'd gladly volunteer to do the updating,
>> assuming someone can give me the pertinent information...
>>
>> Thanks,
>> Branden Makana
>>
>> On Jul 21, 2010, at 11:52 AM, Branden Makana wrote:
>>
>>> Hello,
>>>
>>>
>>> We're trying to crawl a very large site, but we really just want
>> all the html/image URLs on the site - we don't care to search it. Therefor,
>> what's the best way to have Nutch crawl the site, but NOT index/store pages
>> locally? Is it even possible?
>>>
>>>
>>>
>>> Thanks,
>>> Branden Makana
>>
>>