Two questions have come up over the last week:

 

We use the xml output format found in the RSS tab to pipe into another
process.  Due to volume constraints, we would like there to be no
required post-processing on the xml - just push it to its target
container.  By default, Nutch wraps the search terms in the snippet with
<span class="highlight"></span> tags - is there a config file somewhere
to modify that output (we're looking to change it to <b></b>)?  Is there
somewhere else I might change that - maybe the java files for the
servlet?

 

Secondly, and I feel like I already know the answer - we need to be able
to delete offensive urls.  Through our crawl, we'll have adult or
otherwise irrelevant results climb too high in the results and need to
remove them on a case by case basis.  We have a plan to add blacklisted
urls to our crawl-urlfilter.txt file, so they're effectively removed on
recrawls - is that the best we can do?  Is there a way to deindex them
manually, without needing to recrawl the whole url list?

 

Thanks,

Rob

Reply via email to