Repeat 2 through 7 until your boss yells at you. You may or may not end up needing to write a plugin. I've not had to deal with extracting anything other than a website's text-based content. I assume others have had to solve the problems that you are dealing with, since it sounds like a common use-case, and so you may not even have to write a plugin.
Scott On Wed, Jul 21, 2010 at 5:06 PM, Branden Makana < [email protected]> wrote: > Scott, > > > Thanks so much for your reply. Here's what I've done so far: I made > a file, urls, with my initial url (homepage of site), and I'm using > directory crawl-test instead of crawl: > > 1. Inject initial url > bin/nutch inject crawl-test/crawldb urls > > 2. generate segments > bin/nutch generate crawl-test/crawldb crawl-test/segments > > 3. Make the environment variable for convenience > export SEGMENT=crawl-test/segments/`ls -tr crawl-test/segments|tail -1` > echo $SEGMENT > crawl-test/segments/20100721165811 > > 4. fetch my one page > bin/nutch fetch $SEGMENT -noParsing > > 5. parse it > bin/nutch parse $SEGMENT > > 6. update db > bin/nutch updatedb crawl-test/crawldb/ $SEGMENT -filter -normalize > > 7. invert > bin/nutch invertlinks crawl-test/linkdb -dir crawl-test/segments > > And then I guess I'm done for the one page (note I skipped the solr/index > steps). I realize that the crawldb now has the links found from the homepage > (well, I hope it does), so I need to crawl those pages (rinse & repeat) > however I'm not sure how. Example, after doing the above: > > bin/nutch fetch $SEGMENT -noParsing > Fetcher: starting > Fetcher: segment: crawl-test/segments/20100721165811 > Fetcher: java.io.IOException: Segment already fetched! > > > What should the next step be, and how do I repeat that next step until I've > finished crawling the site? My ultimate goal here is to do this > programmatically, but I realize the bin/nutch script just calls Java > classes, so once I have the steps down with the script file I should be able > to replicate the steps with code. > > Thanks again for the help! > > Branden Makana > > > > On Jul 21, 2010, at 4:53 PM, Scott Gonyea wrote: > > > Sure, I'll help you along with some start bits and pieces. Improving > docs > > would be great for everyone who comes along. > > > > In your nutch root folder, create a dir called "urls" (or whatever you > want > > to call it) > > throw a few test URLs into a flat-file ("seed_urls.txt"), separate by a > > new-line > > > > Here's a sample script that will help you break down all of the phases of > a > > nutch crawl. The "crawl" command is just basically all of these > commands, > > except annoying. > > > > > > cat > stupid_script.sh <<EOF > > echo "-------------" > > echo "bin/nutch inject crawl/crawldb urls" > > bin/nutch inject crawl/crawldb urls/seed.txt > > echo "-------------" > > echo "-------------" > > bin/nutch generate crawl/crawldb crawl/segments > > echo "-------------" > > echo "-------------" > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` > > echo $SEGMENT > > echo "-------------" > > echo "-------------" > > echo "bin/nutch fetch $SEGMENT -noParsing" > > bin/nutch fetch $SEGMENT -noParsing > > echo "-------------" > > echo "-------------" > > echo "bin/nutch parse $SEGMENT" > > bin/nutch parse $SEGMENT > > echo "-------------" > > echo "-------------" > > echo "bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize" > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize > > echo "-------------" > > echo "-------------" > > echo "bin/nutch invertlinks crawl/linkdb -dir crawl/segments" > > bin/nutch invertlinks crawl/linkdb -dir crawl/segments > > echo "-------------" > > echo "-------------" > > echo "bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb > > crawl/linkdb crawl/segments/*" > > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb > crawl/linkdb > > crawl/segments/* > > echo "-------------" > > echo "-------------" > > echo "bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb > > crawl/segments/*" > > bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* > > echo "-------------" > > EOF > > > > You can comment out the index bits and pieces. Keep in mind that the > > "crawl" command will iterate the above commands, for any given depth. > So, > > you need to call all of the pieces of the script (except for inject, and > > index if you plan to use it one day) for however many depths you end up > > wanting to crawl. Each depth will take more and more time/space, etc. > > > > Spend some time in the wiki, and digging through the > > "conf/nutch-default.xml" file, to look for options you might care about. > > Any changes you want to make should be done in "conf/nutch-site.xml", > which > > will override the defaults. > > > > Scott > > > > On Wed, Jul 21, 2010 at 4:28 PM, Branden Makana < > > [email protected]> wrote: > > > >> Hi All, > >> > >> > >> Just wanted to follow up my question with a polite request that > >> perhaps the documentation for Nutch be updated? I'm trying to follow the > >> Nutch Tutorial (http://wiki.apache.org/nutch/NutchTutorial) to see if I > >> can crawl a site without indexing it, but the commands and examples > shown > >> are out of date: directories are named different than in the examples > (or > >> don't exist at all), and even some of the commands appear to be > different. > >> > >> Nutch being open source, I'd gladly volunteer to do the updating, > >> assuming someone can give me the pertinent information... > >> > >> Thanks, > >> Branden Makana > >> > >> On Jul 21, 2010, at 11:52 AM, Branden Makana wrote: > >> > >>> Hello, > >>> > >>> > >>> We're trying to crawl a very large site, but we really just want > >> all the html/image URLs on the site - we don't care to search it. > Therefor, > >> what's the best way to have Nutch crawl the site, but NOT index/store > pages > >> locally? Is it even possible? > >>> > >>> > >>> > >>> Thanks, > >>> Branden Makana > >> > >> > >

