We are just starting out using nutch and solr but I have a couple of issues I can't find any answers for.
1. Some of the HTML files we index are UTF-8 and contain a BOM. Nutch seems to capture it and store it as some strange characters "". I can fix it by removing the BOM and indexchecker confirms it no longer will index it with those strange characters. Is there a way to prevent this from happening without modifying all of the HTML files that contain it? 2. Often a URL gets updated and we want to recraw/index a specific URL on demand. I see no way to do this currently without deleting the crawl directory and starting over. What is the proper way to handle this situation? These are somewhat related because even though I can go through the files and manually remove the BOM I can't figure out how to have nutch reindex them. We are using nutch 1.7 but I have patched a few things and would be happy to upgrade if it fixes any of this. Thanks in advance for help.

