Hi John, Try freegen for the second question: http://wiki.apache.org/nutch/bin/nutch_freegen
Remi On Tuesday, March 18, 2014, John Lafitte <[email protected]> wrote: > We are just starting out using nutch and solr but I have a couple of issues > I can't find any answers for. > > 1. Some of the HTML files we index are UTF-8 and contain a BOM. Nutch > seems to capture it and store it as some strange characters "". I can > fix it by removing the BOM and indexchecker confirms it no longer will > index it with those strange characters. Is there a way to prevent this > from happening without modifying all of the HTML files that contain it? > > 2. Often a URL gets updated and we want to recraw/index a specific URL on > demand. I see no way to do this currently without deleting the crawl > directory and starting over. What is the proper way to handle this > situation? > > These are somewhat related because even though I can go through the files > and manually remove the BOM I can't figure out how to have nutch reindex > them. We are using nutch 1.7 but I have patched a few things and would be > happy to upgrade if it fixes any of this. > > Thanks in advance for help. >

