Hi
your task is clear but solution is not simple. That is why there are so many
companies which are competing for users and try to show them relevant
results.
Nutch is not cleaver enough to sort out if a page an opinion or just an
advertisement. So you MUST yourself to teach it.
First of all you
Hi
I would suggest you to try the tika parser if you are not using it now. It
might give you better parsed HTML. Next - if you want exclude some parts of
text from HTML than you must re-write html parser and adopt it to your
needs.
To reduce relevancy of your page - you may play with boost value
Unfortunately Tika currently has the same issue of not inserting
spaces between menu list items, which gives you these types of
concatenated results.
It's a trivial patch, I just need a few minutes of spare time :(
-- Ken
On Apr 18, 2010, at 10:31pm, Alexander Aristov wrote:
Hi
I would s