Hi Shane, It really helps users of this list and yourself if you are able to provide more detailed questions. Can you please state which version of Nutch, gora-core and gora-sql artifacts and MySQL you are using? It would seem that you've not made much progress to date, so i would suggest wiping the data you have within your MySQL WebPage table and starting again. I would advise you to use the readdb tool to check the stats of the DB after EVERY phase of the crawl. https://wiki.apache.org/nutch/bin/nutch%20readdb Please see below for more feedback.
On Thu, Mar 27, 2014 at 8:54 AM, <[email protected]> wrote: > > mapred.FileOutputCommitter - Output path is null in cleanup > > What does this mean? The above WARN can be ignored. Really, it occurs when we commit a job and do the clean up of a temporary directory. This is not a problem. > what would be the command line too index a single domain. say test.com > The exact same as it would be to index multiple domains. Your configuration however may need some tweaking. Have you looked over the wiki documentation on urlfilter's? You'll have a better idea of where in the crawl things are going wrong once you've analyzed the crawl progress as I've mentioned above. > > Why does generate give me the same fetch list every time ? Because it would appear that these URL's are considered as good for fetching. This is more likely a mistake in your crawler configuration as oppose to Nutch itself. > i thought Nutch would only re indexed the same page once every 30 days > my setup fetch the same pages every time i index, this seems a waist of > resources. > > As I originally stated, it helps if you described in more details if you have been able to index at all. Right now this seems to be a mystery as to what you've actually achieved.

