Hmm, Nutch should parse that URL without a problem. I've done crawls of 
wikipedia and had no trouble downloading and parsing and indexing of non-latin 
URL's.

You can test parsing using bin/nutch org.apache.nutch.parse.ParserChecker 
<URL> to see how it parses and how it detects outlinks.

On Monday 20 June 2011 13:02:15 Marseld Dedgjonaj wrote:
> Hi Markus,
> My question nr.1. is, why nutch does not parse correctly the "ü" in
> url.(this is in the first example) My question nr.2. is: Where does nutch
> found this url " http://www.mydomain.com/LIGJE/601270007/11.1.1.e";. I see
> in the website where nutch supposed to found this url and I see that is
> this frase (some text "11.1.1.e" some text) inside a <span> tag. It seems
> that nutch consider it as a link.
> 
> Any idea how to correct this?
> 
> Any help will be appreciated.
> Marseld
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Monday, June 20, 2011 12:43 AM
> To: [email protected]
> Cc: Marseld Dedgjonaj
> Subject: Re: Problem in nutch parsing.
> 
> What is your question? If you have any, provide more details of what you
> expect, what your configuration is (url filters?) and what you're getting.
> 
> > Hello everybody,
> > 
> > I use nutch-1.2 and I use it to crawl my website.
> > 
> > I see some url in fetched list that doesn’t exist in website.
> > 
> > 
> > 
> > Examples:
> > 
> > http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gisele-B
> > ├Æ ╞├ Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├
> > Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├ÆΓ¼a├┬╝ndchen-eshte-ende-modelja-me-e-paguar-
> > . aspx"
> > 
> > http://www.mydomain.com/LIGJE/601270007/11.1.1.e
> > 
> > 
> > 
> > I think nutch is not parsing correctly in this case.
> > 
> > 
> > 
> > Thanks in advance.
> > 
> > Best Regards,
> > 
> > Marseld
> > 
> > 
> > 
> > 
> > 
> > 
> > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
> > <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
> > Pun&euml;</b>... Vizitoni: <a target="_blank"
> > href="http://www.punaime.al/";>www.punaime.al</a></span></p> <p><a
> > target="_blank" href="http://www.punaime.al/";><span
> > style="text-decoration: none;"><img width="165" height="31" border="0"
> > alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png";
> > /></span></a></p>
> 
> <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
> <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
> Pun&euml;</b>... Vizitoni: <a target="_blank"
> href="http://www.punaime.al/";>www.punaime.al</a></span></p> <p><a
> target="_blank" href="http://www.punaime.al/";><span
> style="text-decoration: none;"><img width="165" height="31" border="0"
> alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png";
> /></span></a></p>

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to