I setup Nutch as per this http://nlp.solutions.asia/?p=362.
I wiped the data within MYSQL and re indexed several time and these
fields remain NULL
modifiedTime prevModifiedTime
MYSQL version 5.6.16
Nutch version 2.2
./bin/nutch inject urls
./bin/nutch generate -topN 20
./bin/nutch fetch -all
./bin/nutch parse -all
./bin/nutch updatedb
I run these commands in this order each time i index.
I'm very new at Nutch but learning.
Cheers
Shane.
On 27/03/14 19:29, Lewis John Mcgibbney wrote:
Hi Shane,
It really helps users of this list and yourself if you are able to provide
more detailed questions.
Can you please state which version of Nutch, gora-core and gora-sql
artifacts and MySQL you are using?
It would seem that you've not made much progress to date, so i would
suggest wiping the data you have within your MySQL WebPage table and
starting again.
I would advise you to use the readdb tool to check the stats of the DB
after EVERY phase of the crawl.
https://wiki.apache.org/nutch/bin/nutch%20readdb
Please see below for more feedback.
On Thu, Mar 27, 2014 at 8:54 AM,<[email protected]> wrote:
mapred.FileOutputCommitter - Output path is null in cleanup
What does this mean?
The above WARN can be ignored. Really, it occurs when we commit a job and
do the
clean up of a temporary directory. This is not a problem.
what would be the command line too index a single domain. say test.com
The exact same as it would be to index multiple domains. Your configuration
however may need some tweaking. Have you looked over the wiki documentation
on urlfilter's? You'll have a better idea of where in the crawl things are
going wrong once you've analyzed the crawl progress as I've mentioned
above.
Why does generate give me the same fetch list every time ?
Because it would appear that these URL's are considered as good for
fetching. This is more likely a mistake in your crawler configuration as
oppose to Nutch itself.
i thought Nutch would only re indexed the same page once every 30 days
my setup fetch the same pages every time i index, this seems a waist of
resources.
As I originally stated, it helps if you described in more details if you
have been able to index at all. Right now this seems to be a mystery as to
what you've actually achieved.