Re: Weird crawl issue. Nutch picking up drop-down menu options.

2010-04-18 Thread Ken Krugler
Unfortunately Tika currently has the same issue of not inserting  
spaces between menu list items, which gives you these types of  
concatenated results.


It's a trivial patch, I just need a few minutes of spare time :(

-- Ken

On Apr 18, 2010, at 10:31pm, Alexander Aristov wrote:


Hi

I would suggest you to try the  tika parser if you are not using it  
now. It
might give you better parsed HTML. Next - if you want exclude some  
parts of

text from HTML than you must re-write html parser and adopt it to your
needs.

To reduce relevancy of your page - you may play with boost values  
which are
are configured in the nutch-site.xml. But they will apply to every  
crowled
page and so if you want to apply it to only one(serveral) page then  
you

might want to look at this patch.

https://issues.apache.org/jira/browse/NUTCH-16


Best Regards
Alexander Aristov


On 15 April 2010 21:09, tsmori  wrote:



I have an old page on my site that Nutch is fetching. The results  
in the

Nutch web app look like this:

Site Map
... INSECT SYSTEMATIC RESOURCES Home : Site Map   search Resources by
Scientific Name ... Common Name Select

NameAlderfliesAntsAntlionsAphidsBarkliceBeesBeetlesBookliceBristletailsBugsButterfliesCaddisfliesCicadasCockroachesCricketsDamselfliesDobsonfliesDragonfliesEarwigsFleasFliesFishfliesFootspinnersGladiatorsGnatsGrasshoppersHangingfliesHoppersJumping_BristletailsFirebratsKatydidsLacewingsLeaf_InsectsLiceMantidsMayfliesMitesMothsMosquitoesOwlfliesPsyllidsRock_CrawlersSawfliesScale_InsectsScorpionfliesSilverfishSkippersSnakefliesSpidersSpringtailsStonefliesTermitesThripsTicksTwisted_Wing_ParasitesWalking_SticksWaspsWebspinnersWhiteflies
Site Map Also see the Taxonomic ...


The big block of text there are the option values in a drop down  
menu.
What's weird is that this page has 2 drop down menus, but Nutch  
only grabs

one of them and does this kind of thing. It's an ancient page and
unfortunately each option value is actually a URL, which I'm  
guessing is

why
Nutch is indexing the values, but it's odd that it only does it for  
one

menu
and not the other.

Another problem is that in the web app, this wall of text doesn't  
wrap so

it
completely messes up any formatting to our custom search application.

Lastly, this page is the first hit in the list of results for a  
search on
the term "map" entirely due to the page being called "sitemap". I  
could
probably fix this easy by just calling it something else, but I  
would like

to know the best way to handle something like this in Nutch.

First, how do I get nutch to stop pulling all the menu option  
values into

the results list. Second, how do I reduce the relevancy of this page
without
changing the page itself?
--
View this message in context:
http://n3.nabble.com/Weird-crawl-issue-Nutch-picking-up-drop-down-menu-options-tp721751p721751.html
Sent from the Nutch - User mailing list archive at Nabble.com.




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Weird crawl issue. Nutch picking up drop-down menu options.

2010-04-18 Thread Alexander Aristov
Hi

I would suggest you to try the  tika parser if you are not using it now. It
might give you better parsed HTML. Next - if you want exclude some parts of
text from HTML than you must re-write html parser and adopt it to your
needs.

To reduce relevancy of your page - you may play with boost values which are
are configured in the nutch-site.xml. But they will apply to every crowled
page and so if you want to apply it to only one(serveral) page then you
might want to look at this patch.

https://issues.apache.org/jira/browse/NUTCH-16


Best Regards
Alexander Aristov


On 15 April 2010 21:09, tsmori  wrote:

>
> I have an old page on my site that Nutch is fetching. The results in the
> Nutch web app look like this:
>
> Site Map
> ... INSECT SYSTEMATIC RESOURCES Home : Site Map   search Resources by
> Scientific Name ... Common Name Select
>
> NameAlderfliesAntsAntlionsAphidsBarkliceBeesBeetlesBookliceBristletailsBugsButterfliesCaddisfliesCicadasCockroachesCricketsDamselfliesDobsonfliesDragonfliesEarwigsFleasFliesFishfliesFootspinnersGladiatorsGnatsGrasshoppersHangingfliesHoppersJumping_BristletailsFirebratsKatydidsLacewingsLeaf_InsectsLiceMantidsMayfliesMitesMothsMosquitoesOwlfliesPsyllidsRock_CrawlersSawfliesScale_InsectsScorpionfliesSilverfishSkippersSnakefliesSpidersSpringtailsStonefliesTermitesThripsTicksTwisted_Wing_ParasitesWalking_SticksWaspsWebspinnersWhiteflies
> Site Map Also see the Taxonomic ...
>
>
> The big block of text there are the option values in a drop down menu.
> What's weird is that this page has 2 drop down menus, but Nutch only grabs
> one of them and does this kind of thing. It's an ancient page and
> unfortunately each option value is actually a URL, which I'm guessing is
> why
> Nutch is indexing the values, but it's odd that it only does it for one
> menu
> and not the other.
>
> Another problem is that in the web app, this wall of text doesn't wrap so
> it
> completely messes up any formatting to our custom search application.
>
> Lastly, this page is the first hit in the list of results for a search on
> the term "map" entirely due to the page being called "sitemap". I could
> probably fix this easy by just calling it something else, but I would like
> to know the best way to handle something like this in Nutch.
>
> First, how do I get nutch to stop pulling all the menu option values into
> the results list. Second, how do I reduce the relevancy of this page
> without
> changing the page itself?
> --
> View this message in context:
> http://n3.nabble.com/Weird-crawl-issue-Nutch-picking-up-drop-down-menu-options-tp721751p721751.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>