Re: Usage Scenarios

remi tassing Tue, 18 Mar 2014 06:20:16 -0700

Hi John,

Try freegen for the second question:
http://wiki.apache.org/nutch/bin/nutch_freegen


Remi

On Tuesday, March 18, 2014, John Lafitte <[email protected]> wrote:

> We are just starting out using nutch and solr but I have a couple of issues
> I can't find any answers for.
>
> 1. Some of the HTML files we index are UTF-8 and contain a BOM.  Nutch
> seems to capture it and store it as some strange characters "ï»¿".  I can
> fix it by removing the BOM and indexchecker confirms it no longer will
> index it with those strange characters.  Is there a way to prevent this
> from happening without modifying all of the HTML files that contain it?
>
> 2. Often a URL gets updated and we want to recraw/index a specific URL on
> demand.  I see no way to do this currently without deleting the crawl
> directory and starting over.  What is the proper way to handle this
> situation?
>
> These are somewhat related because even though I can go through the files
> and manually remove the BOM I can't figure out how to have nutch reindex
> them.  We are using nutch 1.7 but I have patched a few things and would be
> happy to upgrade if it fixes any of this.
>
> Thanks in advance for help.
>

Re: Usage Scenarios

Reply via email to