Usage Scenarios

John Lafitte Mon, 17 Mar 2014 19:45:27 -0700

We are just starting out using nutch and solr but I have a couple of issues
I can't find any answers for.


1. Some of the HTML files we index are UTF-8 and contain a BOM.  Nutch
seems to capture it and store it as some strange characters "ï»¿".  I can
fix it by removing the BOM and indexchecker confirms it no longer will
index it with those strange characters.  Is there a way to prevent this
from happening without modifying all of the HTML files that contain it?

2. Often a URL gets updated and we want to recraw/index a specific URL on
demand.  I see no way to do this currently without deleting the crawl
directory and starting over.  What is the proper way to handle this
situation?

These are somewhat related because even though I can go through the files
and manually remove the BOM I can't figure out how to have nutch reindex
them.  We are using nutch 1.7 but I have patched a few things and would be
happy to upgrade if it fixes any of this.

Thanks in advance for help.

Usage Scenarios

Reply via email to