Thanks Remi.  I presume I basically just need my own version of the crawl
script that uses freegen instead of generate?

For the BOM issue, I searched all over for it, but just now found that
someone has already brought it up.  So I'll try that patch out.
https://issues.apache.org/jira/browse/NUTCH-1733


On Tue, Mar 18, 2014 at 8:18 AM, remi tassing <[email protected]> wrote:

> Hi John,
>
> Try freegen for the second question:
> http://wiki.apache.org/nutch/bin/nutch_freegen
>
> Remi
>
> On Tuesday, March 18, 2014, John Lafitte <[email protected]>
> wrote:
>
> > We are just starting out using nutch and solr but I have a couple of
> issues
> > I can't find any answers for.
> >
> > 1. Some of the HTML files we index are UTF-8 and contain a BOM.  Nutch
> > seems to capture it and store it as some strange characters "".  I can
> > fix it by removing the BOM and indexchecker confirms it no longer will
> > index it with those strange characters.  Is there a way to prevent this
> > from happening without modifying all of the HTML files that contain it?
> >
> > 2. Often a URL gets updated and we want to recraw/index a specific URL on
> > demand.  I see no way to do this currently without deleting the crawl
> > directory and starting over.  What is the proper way to handle this
> > situation?
> >
> > These are somewhat related because even though I can go through the files
> > and manually remove the BOM I can't figure out how to have nutch reindex
> > them.  We are using nutch 1.7 but I have patched a few things and would
> be
> > happy to upgrade if it fixes any of this.
> >
> > Thanks in advance for help.
> >
>

Reply via email to