RE: Problem in nutch parsing.

Marseld Dedgjonaj Sun, 26 Jun 2011 08:31:05 -0700

Hi Markus,
Thank you very much for your time and sorry for my late response.(I was out of 
office for some days.)


Link below is the link where happens the problem.

http://www.ikub.al/LIGJE/601270007/default.aspx

Open this link, view page source and search for this pattern "11.1.1.e". You 
will see that no url which contains it are founded.
But if you try to parse the link, you will get in outlinks, a link like this 
"http://www.mydomain.com/LIGJE/601270007/11.1.1.e";
What I expect is: when I parse this link not to get such links for next fetch 
that doesn’t exists.

Marseldi



-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Monday, June 20, 2011 6:46 PM
To: Marseld Dedgjonaj
Cc: [email protected]
Subject: Re: Problem in nutch parsing.



On Monday 20 June 2011 18:24:39 Marseld Dedgjonaj wrote:
> Thank you Markus.
> I tested my two examples with ParserChecker and as I see, I have to
> separated issues.
> 
> First one: When I run ParserChecker in the url that contains url of the
> first example, it will parse it correctly and encode "ü". But the issue
> still remains. What cause the deformation of the url to: "
> http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gisele-B
> ├Æ ╞├
> Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├Æ
> Γ¼a├┬╝ndchen-eshte-ende-modelja-me-e-paguar-. aspx"?
> 
> And the second issue: I run the ParserChecker with url that contains url of
> the second example, and it will list in "outlink:" the url of the second
> example. I see the page source and this url is not in the page source. Why
> does nutch create this url? Is there a way to fix it?

I don't know about your document source. Perhaps you can publish a simple test 
page online where we can replicate this behaviour. Then please describe what 
you expect and what you get.

> 
> Thank you again Markus.
> Marseldi
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Monday, June 20, 2011 1:17 PM
> To: [email protected]
> Cc: Marseld Dedgjonaj
> Subject: Re: Problem in nutch parsing.
> 
> Hmm, Nutch should parse that URL without a problem. I've done crawls of
> wikipedia and had no trouble downloading and parsing and indexing of
> non-latin URL's.
> 
> You can test parsing using bin/nutch org.apache.nutch.parse.ParserChecker
> <URL> to see how it parses and how it detects outlinks.
> 
> On Monday 20 June 2011 13:02:15 Marseld Dedgjonaj wrote:
> > Hi Markus,
> > My question nr.1. is, why nutch does not parse correctly the "ü" in
> > url.(this is in the first example) My question nr.2. is: Where does nutch
> > found this url " http://www.mydomain.com/LIGJE/601270007/11.1.1.e";. I see
> > in the website where nutch supposed to found this url and I see that is
> > this frase (some text "11.1.1.e" some text) inside a <span> tag. It seems
> > that nutch consider it as a link.
> > 
> > Any idea how to correct this?
> > 
> > Any help will be appreciated.
> > Marseld
> > 
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: Monday, June 20, 2011 12:43 AM
> > To: [email protected]
> > Cc: Marseld Dedgjonaj
> > Subject: Re: Problem in nutch parsing.
> > 
> > What is your question? If you have any, provide more details of what you
> > expect, what your configuration is (url filters?) and what you're
> > getting.
> > 
> > > Hello everybody,
> > > 
> > > I use nutch-1.2 and I use it to crawl my website.
> > > 
> > > I see some url in fetched list that doesn’t exist in website.
> > > 
> > > 
> > > 
> > > Examples:
> > > 
> > > http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gisele
> > > -B ├Æ ╞├ Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├
> > > Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├ÆΓ¼a├┬╝ndchen-eshte-ende-modelja-me-e-pagua
> > > r- . aspx"
> > > 
> > > http://www.mydomain.com/LIGJE/601270007/11.1.1.e
> > > 
> > > 
> > > 
> > > I think nutch is not parsing correctly in this case.
> > > 
> > > 
> > > 
> > > Thanks in advance.
> > > 
> > > Best Regards,
> > > 
> > > Marseld
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
> > > <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
> > > Pun&euml;</b>... Vizitoni: <a target="_blank"
> > > href="http://www.punaime.al/";>www.punaime.al</a></span></p> <p><a
> > > target="_blank" href="http://www.punaime.al/";><span
> > > style="text-decoration: none;"><img width="165" height="31" border="0"
> > > alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png";
> > > /></span></a></p>
> > 
> > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
> > <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
> > Pun&euml;</b>... Vizitoni: <a target="_blank"
> > href="http://www.punaime.al/";>www.punaime.al</a></span></p> <p><a
> > target="_blank" href="http://www.punaime.al/";><span
> > style="text-decoration: none;"><img width="165" height="31" border="0"
> > alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png";
> > /></span></a></p>

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


<p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni <b>Pun&euml; 
t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r Pun&euml;</b>... 
Vizitoni: <a target="_blank" 
href="http://www.punaime.al/";>www.punaime.al</a></span></p>
<p><a target="_blank" href="http://www.punaime.al/";><span 
style="text-decoration: none;"><img width="165" height="31" border="0" 
alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png"; 
/></span></a></p>

RE: Problem in nutch parsing.

Reply via email to