Thanks Remi. I presume I basically just need my own version of the crawl script that uses freegen instead of generate?
For the BOM issue, I searched all over for it, but just now found that someone has already brought it up. So I'll try that patch out. https://issues.apache.org/jira/browse/NUTCH-1733 On Tue, Mar 18, 2014 at 8:18 AM, remi tassing <[email protected]> wrote: > Hi John, > > Try freegen for the second question: > http://wiki.apache.org/nutch/bin/nutch_freegen > > Remi > > On Tuesday, March 18, 2014, John Lafitte <[email protected]> > wrote: > > > We are just starting out using nutch and solr but I have a couple of > issues > > I can't find any answers for. > > > > 1. Some of the HTML files we index are UTF-8 and contain a BOM. Nutch > > seems to capture it and store it as some strange characters "". I can > > fix it by removing the BOM and indexchecker confirms it no longer will > > index it with those strange characters. Is there a way to prevent this > > from happening without modifying all of the HTML files that contain it? > > > > 2. Often a URL gets updated and we want to recraw/index a specific URL on > > demand. I see no way to do this currently without deleting the crawl > > directory and starting over. What is the proper way to handle this > > situation? > > > > These are somewhat related because even though I can go through the files > > and manually remove the BOM I can't figure out how to have nutch reindex > > them. We are using nutch 1.7 but I have patched a few things and would > be > > happy to upgrade if it fixes any of this. > > > > Thanks in advance for help. > > >

