Thank you Markus.
I tested my two examples with ParserChecker and as I see, I have to separated 
issues.

First one: When I run ParserChecker in the url that contains url of the first 
example, it will parse it correctly and encode "ü".
But the issue still remains. What cause the deformation of the url to: " 
http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gisele-B
├Æ ╞├ 
Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├ÆΓ¼a├┬╝ndchen-eshte-ende-modelja-me-e-paguar-.
 aspx"?

And the second issue: I run the ParserChecker with url that contains url of the 
second example, and it will list in "outlink:" the url of the second example. I 
see the page source and this url is not in the page source. Why does nutch 
create this url? Is there a way to fix it?

Thank you again Markus.
Marseldi


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Monday, June 20, 2011 1:17 PM
To: [email protected]
Cc: Marseld Dedgjonaj
Subject: Re: Problem in nutch parsing.

Hmm, Nutch should parse that URL without a problem. I've done crawls of 
wikipedia and had no trouble downloading and parsing and indexing of non-latin 
URL's.

You can test parsing using bin/nutch org.apache.nutch.parse.ParserChecker 
<URL> to see how it parses and how it detects outlinks.

On Monday 20 June 2011 13:02:15 Marseld Dedgjonaj wrote:
> Hi Markus,
> My question nr.1. is, why nutch does not parse correctly the "ü" in
> url.(this is in the first example) My question nr.2. is: Where does nutch
> found this url " http://www.mydomain.com/LIGJE/601270007/11.1.1.e";. I see
> in the website where nutch supposed to found this url and I see that is
> this frase (some text "11.1.1.e" some text) inside a <span> tag. It seems
> that nutch consider it as a link.
> 
> Any idea how to correct this?
> 
> Any help will be appreciated.
> Marseld
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Monday, June 20, 2011 12:43 AM
> To: [email protected]
> Cc: Marseld Dedgjonaj
> Subject: Re: Problem in nutch parsing.
> 
> What is your question? If you have any, provide more details of what you
> expect, what your configuration is (url filters?) and what you're getting.
> 
> > Hello everybody,
> > 
> > I use nutch-1.2 and I use it to crawl my website.
> > 
> > I see some url in fetched list that doesn’t exist in website.
> > 
> > 
> > 
> > Examples:
> > 
> > http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gisele-B
> > ├Æ ╞├ Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├
> > Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├ÆΓ¼a├┬╝ndchen-eshte-ende-modelja-me-e-paguar-
> > . aspx"
> > 
> > http://www.mydomain.com/LIGJE/601270007/11.1.1.e
> > 
> > 
> > 
> > I think nutch is not parsing correctly in this case.
> > 
> > 
> > 
> > Thanks in advance.
> > 
> > Best Regards,
> > 
> > Marseld
> > 
> > 
> > 
> > 
> > 
> > 
> > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
> > <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
> > Pun&euml;</b>... Vizitoni: <a target="_blank"
> > href="http://www.punaime.al/";>www.punaime.al</a></span></p> <p><a
> > target="_blank" href="http://www.punaime.al/";><span
> > style="text-decoration: none;"><img width="165" height="31" border="0"
> > alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png";
> > /></span></a></p>
> 
> <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
> <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
> Pun&euml;</b>... Vizitoni: <a target="_blank"
> href="http://www.punaime.al/";>www.punaime.al</a></span></p> <p><a
> target="_blank" href="http://www.punaime.al/";><span
> style="text-decoration: none;"><img width="165" height="31" border="0"
> alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png";
> /></span></a></p>

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Pun&euml; 
t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r Pun&euml;</b>... 
Vizitoni: <a target="_blank" 
href="http://www.punaime.al/";>www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/";><span 
style="text-decoration: none;"><img width="165" height="31" border="0" 
alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png"; 
/></span></a></p>


Reply via email to