Re: Best way to crawl, but not index?

Branden Makana Wed, 21 Jul 2010 17:25:34 -0700


        Ha! Well that brings up a question - how will I ever know when to 
-stop- repeating the steps? Running the crawl command line does stop, so I 
assume there's something that alerts it that it has crawled the site.



        I know that I can export a CSV of crawled links, so I'm able to grab 
that. As far as grabbing image URLs, well, that I'm not so sure about. 

Thanks,
Branden Makana

On Jul 21, 2010, at 5:16 PM, Scott Gonyea wrote:

> Repeat 2 through 7 until your boss yells at you.  You may or may not end up
> needing to write a plugin.  I've not had to deal with extracting anything
> other than a website's text-based content.  I assume others have had to
> solve the problems that you are dealing with, since it sounds like a common
> use-case, and so you may not even have to write a plugin.
> 
> Scott
> 
> On Wed, Jul 21, 2010 at 5:06 PM, Branden Makana <
> [email protected]> wrote:
> 
>> Scott,
>> 
>> 
>>      Thanks so much for your reply. Here's what I've done so far: I made
>> a file, urls, with my initial url (homepage of site), and I'm using
>> directory crawl-test instead of crawl:
>> 
>> 1. Inject initial url
>> bin/nutch inject crawl-test/crawldb urls
>> 
>> 2. generate segments
>> bin/nutch generate crawl-test/crawldb crawl-test/segments
>> 
>> 3. Make the environment variable for convenience
>> export SEGMENT=crawl-test/segments/`ls -tr crawl-test/segments|tail -1`
>> echo $SEGMENT
>> crawl-test/segments/20100721165811
>> 
>> 4. fetch my one page
>> bin/nutch fetch $SEGMENT -noParsing
>> 
>> 5. parse it
>> bin/nutch parse $SEGMENT
>> 
>> 6. update db
>> bin/nutch updatedb crawl-test/crawldb/ $SEGMENT -filter -normalize
>> 
>> 7. invert
>> bin/nutch invertlinks crawl-test/linkdb -dir crawl-test/segments
>> 
>> And then I guess I'm done for the one page (note I skipped the solr/index
>> steps). I realize that the crawldb now has the links found from the homepage
>> (well, I hope it does), so I need to crawl those pages (rinse & repeat)
>> however I'm not sure how. Example, after doing the above:
>> 
>> bin/nutch fetch $SEGMENT -noParsing
>> Fetcher: starting
>> Fetcher: segment: crawl-test/segments/20100721165811
>> Fetcher: java.io.IOException: Segment already fetched!
>> 
>> 
>> What should the next step be, and how do I repeat that next step until I've
>> finished crawling the site? My ultimate goal here is to do this
>> programmatically, but I realize the bin/nutch script just calls Java
>> classes, so once I have the steps down with the script file I should be able
>> to replicate the steps with code.
>> 
>> Thanks again for the help!
>> 
>> Branden Makana
>> 
>> 
>> 
>> On Jul 21, 2010, at 4:53 PM, Scott Gonyea wrote:
>> 
>>> Sure, I'll help you along with some start bits and pieces.  Improving
>> docs
>>> would be great for everyone who comes along.
>>> 
>>> In your nutch root folder, create a dir called "urls" (or whatever you
>> want
>>> to call it)
>>> throw a few test URLs into a flat-file ("seed_urls.txt"), separate by a
>>> new-line
>>> 
>>> Here's a sample script that will help you break down all of the phases of
>> a
>>> nutch crawl.  The "crawl" command is just basically all of these
>> commands,
>>> except annoying.
>>> 
>>> 
>>> cat > stupid_script.sh <<EOF
>>> echo "-------------"
>>> echo "bin/nutch inject crawl/crawldb urls"
>>> bin/nutch inject crawl/crawldb urls/seed.txt
>>> echo "-------------"
>>> echo "-------------"
>>> bin/nutch generate crawl/crawldb crawl/segments
>>> echo "-------------"
>>> echo "-------------"
>>> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
>>> echo $SEGMENT
>>> echo "-------------"
>>> echo "-------------"
>>> echo "bin/nutch fetch $SEGMENT -noParsing"
>>> bin/nutch fetch $SEGMENT -noParsing
>>> echo "-------------"
>>> echo "-------------"
>>> echo "bin/nutch parse $SEGMENT"
>>> bin/nutch parse $SEGMENT
>>> echo "-------------"
>>> echo "-------------"
>>> echo "bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize"
>>> bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
>>> echo "-------------"
>>> echo "-------------"
>>> echo "bin/nutch invertlinks crawl/linkdb -dir crawl/segments"
>>> bin/nutch invertlinks crawl/linkdb -dir crawl/segments
>>> echo "-------------"
>>> echo "-------------"
>>> echo "bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
>>> crawl/linkdb crawl/segments/*"
>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
>> crawl/linkdb
>>> crawl/segments/*
>>> echo "-------------"
>>> echo "-------------"
>>> echo "bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
>>> crawl/segments/*"
>>> bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
>>> echo "-------------"
>>> EOF
>>> 
>>> You can comment out the index bits and pieces.  Keep in mind that the
>>> "crawl" command will iterate the above commands, for any given depth.
>> So,
>>> you need to call all of the pieces of the script (except for inject, and
>>> index if you plan to use it one day) for however many depths you end up
>>> wanting to crawl.  Each depth will take more and more time/space, etc.
>>> 
>>> Spend some time in the wiki, and digging through the
>>> "conf/nutch-default.xml" file, to look for options you might care about.
>>> Any changes you want to make should be done in "conf/nutch-site.xml",
>> which
>>> will override the defaults.
>>> 
>>> Scott
>>> 
>>> On Wed, Jul 21, 2010 at 4:28 PM, Branden Makana <
>>> [email protected]> wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> 
>>>>     Just wanted to follow up my question with a polite request that
>>>> perhaps the documentation for Nutch be updated? I'm trying to follow the
>>>> Nutch Tutorial (http://wiki.apache.org/nutch/NutchTutorial) to see if I
>>>> can crawl a site without indexing it, but the commands and examples
>> shown
>>>> are out of date: directories are named different than in the examples
>> (or
>>>> don't exist at all), and even some of the commands appear to be
>> different.
>>>> 
>>>>     Nutch being open source, I'd gladly volunteer to do the updating,
>>>> assuming someone can give me the pertinent information...
>>>> 
>>>> Thanks,
>>>> Branden Makana
>>>> 
>>>> On Jul 21, 2010, at 11:52 AM, Branden Makana wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> 
>>>>>    We're trying to crawl a very large site, but we really just want
>>>> all the html/image URLs on the site - we don't care to search it.
>> Therefor,
>>>> what's the best way to have Nutch crawl the site, but NOT index/store
>> pages
>>>> locally? Is it even possible?
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Branden Makana
>>>> 
>>>> 
>> 
>>

Re: Best way to crawl, but not index?

Reply via email to