Re: Problem in nutch parsing.

Markus Jelsma Tue, 28 Jun 2011 08:06:57 -0700

Hmmm, i'm not getting with with 1.4-dev. Not with parse-html or parse-tika.


bin/nutch org.apache.nutch.parse.ParserChecker 
http://www.ikub.al/LIGJE/601270007/default.aspx

I'm also not getting it with 1.2. The only difference is that 1.2 yields 188 
outlinks where as 1.4-dev html and tika resp. yield 111 and 110 outlinks.



On Sunday 26 June 2011 17:30:28 Marseld Dedgjonaj wrote:
> Hi Markus,
> Thank you very much for your time and sorry for my late response.(I was out
> of office for some days.)
> 
> Link below is the link where happens the problem.
> 
> http://www.ikub.al/LIGJE/601270007/default.aspx
> 
> Open this link, view page source and search for this pattern "11.1.1.e".
> You will see that no url which contains it are founded. But if you try to
> parse the link, you will get in outlinks, a link like this
> "http://www.mydomain.com/LIGJE/601270007/11.1.1.e"; What I expect is: when
> I parse this link not to get such links for next fetch that doesn’t
> exists.
> 
> Marseldi
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Monday, June 20, 2011 6:46 PM
> To: Marseld Dedgjonaj
> Cc: [email protected]
> Subject: Re: Problem in nutch parsing.
> 
> On Monday 20 June 2011 18:24:39 Marseld Dedgjonaj wrote:
> > Thank you Markus.
> > I tested my two examples with ParserChecker and as I see, I have to
> > separated issues.
> > 
> > First one: When I run ParserChecker in the url that contains url of the
> > first example, it will parse it correctly and encode "ü". But the issue
> > still remains. What cause the deformation of the url to: "
> > http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gisele-B
> > ├Æ ╞├
> > Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├
> > Æ Γ¼a├┬╝ndchen-eshte-ende-modelja-me-e-paguar-. aspx"?
> > 
> > And the second issue: I run the ParserChecker with url that contains url
> > of the second example, and it will list in "outlink:" the url of the
> > second example. I see the page source and this url is not in the page
> > source. Why does nutch create this url? Is there a way to fix it?
> 
> I don't know about your document source. Perhaps you can publish a simple
> test page online where we can replicate this behaviour. Then please
> describe what you expect and what you get.
> 
> > Thank you again Markus.
> > Marseldi
> > 
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: Monday, June 20, 2011 1:17 PM
> > To: [email protected]
> > Cc: Marseld Dedgjonaj
> > Subject: Re: Problem in nutch parsing.
> > 
> > Hmm, Nutch should parse that URL without a problem. I've done crawls of
> > wikipedia and had no trouble downloading and parsing and indexing of
> > non-latin URL's.
> > 
> > You can test parsing using bin/nutch org.apache.nutch.parse.ParserChecker
> > <URL> to see how it parses and how it detects outlinks.
> > 
> > On Monday 20 June 2011 13:02:15 Marseld Dedgjonaj wrote:
> > > Hi Markus,
> > > My question nr.1. is, why nutch does not parse correctly the "ü" in
> > > url.(this is in the first example) My question nr.2. is: Where does
> > > nutch found this url "
> > > http://www.mydomain.com/LIGJE/601270007/11.1.1.e";. I see in the
> > > website where nutch supposed to found this url and I see that is this
> > > frase (some text "11.1.1.e" some text) inside a <span> tag. It seems
> > > that nutch consider it as a link.
> > > 
> > > Any idea how to correct this?
> > > 
> > > Any help will be appreciated.
> > > Marseld
> > > 
> > > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:[email protected]]
> > > Sent: Monday, June 20, 2011 12:43 AM
> > > To: [email protected]
> > > Cc: Marseld Dedgjonaj
> > > Subject: Re: Problem in nutch parsing.
> > > 
> > > What is your question? If you have any, provide more details of what
> > > you expect, what your configuration is (url filters?) and what you're
> > > getting.
> > > 
> > > > Hello everybody,
> > > > 
> > > > I use nutch-1.2 and I use it to crawl my website.
> > > > 
> > > > I see some url in fetched list that doesn’t exist in website.
> > > > 
> > > > 
> > > > 
> > > > Examples:
> > > > 
> > > > http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gise
> > > > le -B ├Æ ╞├ Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├
> > > > Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├ÆΓ¼a├┬╝ndchen-eshte-ende-modelja-me-e-pag
> > > > ua r- . aspx"
> > > > 
> > > > http://www.mydomain.com/LIGJE/601270007/11.1.1.e
> > > > 
> > > > 
> > > > 
> > > > I think nutch is not parsing correctly in this case.
> > > > 
> > > > 
> > > > 
> > > > Thanks in advance.
> > > > 
> > > > Best Regards,
> > > > 
> > > > Marseld
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
> > > > <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
> > > > Pun&euml;</b>... Vizitoni: <a target="_blank"
> > > > href="http://www.punaime.al/";>www.punaime.al</a></span></p> <p><a
> > > > target="_blank" href="http://www.punaime.al/";><span
> > > > style="text-decoration: none;"><img width="165" height="31"
> > > > border="0" alt="punaime"
> > > > src="http://www.ikub.al/images/punaime.al_small.png";
> > > > /></span></a></p>
> > > 
> > > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni
> > > <b>Pun&euml; t&euml; Mir&euml;</b> dhe <b>t&euml; Mir&euml; p&euml;r
> > > Pun&euml;</b>... Vizitoni: <a target="_blank"
> > > href="http://www.punaime.al/";>www.punaime.al</a></span></p> <p><a
> > > target="_blank" href="http://www.punaime.al/";><span
> > > style="text-decoration: none;"><img width="165" height="31" border="0"
> > > alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png";
> > > /></span></a></p>

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Problem in nutch parsing.

Reply via email to