Hi Shane,

It really helps users of this list and yourself if you are able to provide
more detailed questions.
Can you please state which version of Nutch, gora-core and gora-sql
artifacts and MySQL you are using?
It would seem that you've not made much progress to date, so i would
suggest wiping the data you have within your MySQL WebPage table and
starting again.
I would advise you to use the readdb tool to check the stats of the DB
after EVERY phase of the crawl.
https://wiki.apache.org/nutch/bin/nutch%20readdb
Please see below for more feedback.

On Thu, Mar 27, 2014 at 8:54 AM, <[email protected]> wrote:

>
> mapred.FileOutputCommitter - Output path is null in cleanup
>
> What does this mean?


The above WARN can be ignored. Really, it occurs when we commit a job and
do the
clean up of a temporary directory. This is not a problem.


> what would be the command line too index a single domain. say test.com
>

The exact same as it would be to index multiple domains. Your configuration
however may need some tweaking. Have you looked over the wiki documentation
on urlfilter's? You'll have a better idea of where in the crawl things are
going wrong once you've analyzed the crawl progress as I've mentioned
above.


>
> Why does generate give me the same fetch list every time ?


Because it would appear that these URL's are considered as good for
fetching. This is more likely a mistake in your crawler configuration as
oppose to Nutch itself.


> i thought Nutch would only re indexed the same page once every 30 days
> my setup fetch the same pages every time i index, this seems a waist of
> resources.
>
>
As I originally stated, it helps if you described in more details if you
have been able to index at all. Right now this seems to be a mystery as to
what you've actually achieved.

Reply via email to