Hi Markus, My question nr.1. is, why nutch does not parse correctly the "ü" in url.(this is in the first example) My question nr.2. is: Where does nutch found this url " http://www.mydomain.com/LIGJE/601270007/11.1.1.e". I see in the website where nutch supposed to found this url and I see that is this frase (some text "11.1.1.e" some text) inside a <span> tag. It seems that nutch consider it as a link.
Any idea how to correct this? Any help will be appreciated. Marseld -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Monday, June 20, 2011 12:43 AM To: [email protected] Cc: Marseld Dedgjonaj Subject: Re: Problem in nutch parsing. What is your question? If you have any, provide more details of what you expect, what your configuration is (url filters?) and what you're getting. > Hello everybody, > > I use nutch-1.2 and I use it to crawl my website. > > I see some url in fetched list that doesn’t exist in website. > > > > Examples: > > http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gisele-B├Æ > ╞├ Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├ > Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├ÆΓ¼a├┬╝ndchen-eshte-ende-modelja-me-e-paguar-. > aspx" > > http://www.mydomain.com/LIGJE/601270007/11.1.1.e > > > > I think nutch is not parsing correctly in this case. > > > > Thanks in advance. > > Best Regards, > > Marseld > > > > > > > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni > <b>Punë të Mirë</b> dhe <b>të Mirë për > Punë</b>... Vizitoni: <a target="_blank" > href="http://www.punaime.al/">www.punaime.al</a></span></p> <p><a > target="_blank" href="http://www.punaime.al/"><span > style="text-decoration: none;"><img width="165" height="31" border="0" > alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" > /></span></a></p> <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Punë të Mirë</b> dhe <b>të Mirë për Punë</b>... Vizitoni: <a target="_blank" href="http://www.punaime.al/">www.punaime.al</a></span></p> <p><a target="_blank" href="http://www.punaime.al/"><span style="text-decoration: none;"><img width="165" height="31" border="0" alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p>

