> I figured that it would be better to do it that way -- but I'm having
> issues getting them to work, which may be caused by my configurations,
> but I'm not able to find good help trying to get them figured out
> either.  I've been trying to crawl a SharePoint site (anonymous access
> allowed, so not even bothering with authentication) for over a week
> and not really getting any farther along.  It can find the directories
> and aspx pages and some of the pdfs, but then it skips over other
> pdfs, as well as office docs (doc, docx, ppt, pptx, etc.)  I'm
> guessing that is Tika related, but I'm not seeing much on that either.

What do you mean by skipping over? You don't want ppt pptx and things? In all 
cases you need to set up URL filters specific for your scenario and whishes.

> 
> I don't get what the plugin.includes value even does, and I can't find
> an explanation of that on the wiki. 

All configuration directives have a description, see nutch-default.xml. It's 
wise to go through all settings. You won't understand them all at once but 
later on you will remember them for a new problem you encouter.

> Is there a published book
> somewhere, or some more thorough documentation?

No. The documentation on the wiki and internet tutorials will not tell you how 
crawling actually works and what's involved. 

> I'm just getting
> frustrated and running in circles trying to make sense of all this.
> 
> Can you give me a good starting point, or a good reference to use?

Well, yes, the nutch-default descriptions and unit tests and the rest of the 
sources.

> 
> -- Chris
> 
> 
> 
> On Mon, Dec 19, 2011 at 1:59 PM, Markus Jelsma
> 
> <[email protected]> wrote:
> > I prefer and encourage use of individual commands because: a) you learn
> > how the tools actually work and get a better understanding and b) the
> > crawl command does not expose all options of separate commands.
> > 
> >> I created a script based on:
> >> http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Command
> >> s_f or_Whole-Web_Crawling
> >> 
> >> The script is similar to what happens in the old runbot.sh script, but
> >> it isn't working for me.  The part that does the "s1" barely gets
> >> anything, but then the "s2" fails completely (it tries to make the
> >> same segment?).
> >> 
> >> The plain "crawl" command seems to work, but how practical is it to
> >> use that instead of making this script?  Is it better to have all the
> >> parts broken down like this, instead of passing more parts to the
> >> "crawl" command and letting that handle things?
> >> 
> >> Any help would be greatly appreciated!
> >> 
> >> Script:
> >> 
> >> # runbot script to run the Nutch bot for crawling and re-crawling.
> >> # Usage: bin/runbot [safe]
> >> #        If executed in 'safe' mode, it doesn't delete the temporary
> >> #        directories generated during crawl. This might be helpful for
> >> #        analysis and recovery in case a crawl fails.
> >> #
> >> # Author: Susam Pal
> >> 
> >> #Set up some local vars -- like NUTCH_HOME & CATALINA_HOME
> >> . /common/setenv.sh
> >> 
> >> CRAWL_DIR=$NUTCH_HOME/crawl
> >> SEGMENTS_DIR=$CRAWL_DIR/segments
> >> CRAWL_DB=$CRAWL_DIR/crawldb
> >> LINK_DB=$CRAWL_DIR/linkdb
> >> 
> >> depth=5
> >> threads=5
> >> adddays=5
> >> topN=15 #Comment this statement if you don't want to set topN value
> >> 
> >> # Arguments for rm and mv
> >> RMARGS="-rf"
> >> MVARGS="--verbose"
> >> 
> >> # Parse arguments
> >> if [ "$1" == "safe" ]
> >> then
> >>   safe=yes
> >> fi
> >> 
> >> if [ -z "$NUTCH_HOME" ]
> >> then
> >>   NUTCH_HOME=/cdda/nutch
> >>   echo runbot: $0 could not find environment variable NUTCH_HOME
> >>   echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
> >> else
> >>   echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
> >> fi
> >> 
> >> if [ -z "$CATALINA_HOME" ]
> >> then
> >>   CATALINA_HOME=/cdda/apache-tomcat
> >>   echo runbot: $0 could not find environment variable NUTCH_HOME
> >>   echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
> >> else
> >>   echo runbot: $0 found environment variable
> >> CATALINA_HOME=$CATALINA_HOME fi
> >> 
> >> if [ -n "$topN" ]
> >> then
> >>   topN="-topN $topN"
> >> else
> >>   topN=""
> >> fi
> >> 
> >> echo "crawl.sh: STARTING"
> >> echo $NUTCH_HOME/bin/nutch inject $CRAWL_DB $NUTCH_HOME/urls
> >> echo ""
> >> $NUTCH_HOME/bin/nutch inject $CRAWL_DB $NUTCH_HOME/urls
> >> 
> >> echo $NUTCH_HOME/bin/nutch generate $NUTCH_HOME/crawl/crawldb
> >> $SEGMENTS_DIR -numFetchers 5
> >> echo ""
> >> $NUTCH_HOME/bin/nutch generate $CRAWL_DB $SEGMENTS_DIR -numFetchers 5
> >> 
> >> s1=`ls -rd $SEGMENTS_DIR/* | tail -1`
> >> echo Segments: $s1
> >> echo ""
> >> 
> >> echo $NUTCH_HOME/bin/nutch fetch $s1 -threads $threads
> >> echo ""
> >> $NUTCH_HOME/bin/nutch fetch $s1 -threads $threads
> >> 
> >> echo $NUTCH_HOME/bin/nutch parse $s1
> >> echo ""
> >> $NUTCH_HOME/bin/nutch parse $s1
> >> 
> >> echo $NUTCH_HOME/bin/nutch updatedb $CRAWL_DB $s1
> >> echo ""
> >> $NUTCH_HOME/bin/nutch updatedb $CRAWL_DB $s1
> >> 
> >> echo $NUTCH_HOME/bin/nutch generate crawldb $SEGMENTS_DIR -numFetchers 5
> >> echo ""
> >> $NUTCH_HOME/bin/nutch generate crawldb $SEGMENTS_DIR -numFetchers 5
> >> s2=`ls -rd $SEGMENTS_DIR/* | tail -1`
> >> echo Segments: $s2
> >> echo ""
> >> 
> >> echo $NUTCH_HOME/bin/nutch fetch $s2 -threads $threads
> >> echo ""
> >> $NUTCH_HOME/bin/nutch fetch $s2 -threads $threads
> >> 
> >> echo $NUTCH_HOME/bin/nutch parse $s2
> >> echo ""
> >> $NUTCH_HOME/bin/nutch parse $s2
> >> 
> >> echo $NUTCH_HOME/bin/nutch updatedb $CRAWL_DB $s2
> >> echo ""
> >> $NUTCH_HOME/bin/nutch updatedb $CRAWL_DB $s2
> >> 
> >> echo $NUTCH_HOME/bin/nutch invertlinks $LINK_DB -dir $SEGMENTS_DIR
> >> echo ""
> >> $NUTCH_HOME/bin/nutch invertlinks $LINK_DB -dir $SEGMENTS_DIR
> >> 
> >> echo $NUTCH_HOME/bin/nutch solrindex http://localhost:8080/nutchsolr/
> >> $CRAWLDB $LINK_DB -dir $SEGMENTS_DIR/*
> >> echo ""
> >> $NUTCH_HOME/bin/nutch solrindex http://localhost:8080/nutchsolr/
> >> $CRAWLDB $LINK_DB -dir $SEGMENTS_DIR/*
> >> 
> >> echo "cleaning $SEGMENTS_DIR"
> >> #rm -rf $SEGMENTS_DIR/*
> >> 
> >> echo "crawl.sh: FINISHED: Crawl completed!"
> >> echo ""
> >> 
> >> 
> >> 
> >> -- Chris

Reply via email to