Re: how to parse html files while crawling

2010-04-18 Thread Alexander Aristov
Hi your task is clear but solution is not simple. That is why there are so many companies which are competing for users and try to show them relevant results. Nutch is not cleaver enough to sort out if a page an opinion or just an advertisement. So you MUST yourself to teach it. First of all you

Re: Weird crawl issue. Nutch picking up drop-down menu options.

2010-04-18 Thread Alexander Aristov
Hi I would suggest you to try the tika parser if you are not using it now. It might give you better parsed HTML. Next - if you want exclude some parts of text from HTML than you must re-write html parser and adopt it to your needs. To reduce relevancy of your page - you may play with boost value

Re: Weird crawl issue. Nutch picking up drop-down menu options.

2010-04-18 Thread Ken Krugler
Unfortunately Tika currently has the same issue of not inserting spaces between menu list items, which gives you these types of concatenated results. It's a trivial patch, I just need a few minutes of spare time :( -- Ken On Apr 18, 2010, at 10:31pm, Alexander Aristov wrote: Hi I would s