Hmmm, i'm not getting with with 1.4-dev. Not with parse-html or parse-tika.
bin/nutch org.apache.nutch.parse.ParserChecker http://www.ikub.al/LIGJE/601270007/default.aspx I'm also not getting it with 1.2. The only difference is that 1.2 yields 188 outlinks where as 1.4-dev html and tika resp. yield 111 and 110 outlinks. On Sunday 26 June 2011 17:30:28 Marseld Dedgjonaj wrote: > Hi Markus, > Thank you very much for your time and sorry for my late response.(I was out > of office for some days.) > > Link below is the link where happens the problem. > > http://www.ikub.al/LIGJE/601270007/default.aspx > > Open this link, view page source and search for this pattern "11.1.1.e". > You will see that no url which contains it are founded. But if you try to > parse the link, you will get in outlinks, a link like this > "http://www.mydomain.com/LIGJE/601270007/11.1.1.e" What I expect is: when > I parse this link not to get such links for next fetch that doesn’t > exists. > > Marseldi > > > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Monday, June 20, 2011 6:46 PM > To: Marseld Dedgjonaj > Cc: [email protected] > Subject: Re: Problem in nutch parsing. > > On Monday 20 June 2011 18:24:39 Marseld Dedgjonaj wrote: > > Thank you Markus. > > I tested my two examples with ParserChecker and as I see, I have to > > separated issues. > > > > First one: When I run ParserChecker in the url that contains url of the > > first example, it will parse it correctly and encode "ü". But the issue > > still remains. What cause the deformation of the url to: " > > http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gisele-B > > ├Æ ╞├ > > Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├ > > Æ Γ¼a├┬╝ndchen-eshte-ende-modelja-me-e-paguar-. aspx"? > > > > And the second issue: I run the ParserChecker with url that contains url > > of the second example, and it will list in "outlink:" the url of the > > second example. I see the page source and this url is not in the page > > source. Why does nutch create this url? Is there a way to fix it? > > I don't know about your document source. Perhaps you can publish a simple > test page online where we can replicate this behaviour. Then please > describe what you expect and what you get. > > > Thank you again Markus. > > Marseldi > > > > > > -----Original Message----- > > From: Markus Jelsma [mailto:[email protected]] > > Sent: Monday, June 20, 2011 1:17 PM > > To: [email protected] > > Cc: Marseld Dedgjonaj > > Subject: Re: Problem in nutch parsing. > > > > Hmm, Nutch should parse that URL without a problem. I've done crawls of > > wikipedia and had no trouble downloading and parsing and indexing of > > non-latin URL's. > > > > You can test parsing using bin/nutch org.apache.nutch.parse.ParserChecker > > <URL> to see how it parses and how it detects outlinks. > > > > On Monday 20 June 2011 13:02:15 Marseld Dedgjonaj wrote: > > > Hi Markus, > > > My question nr.1. is, why nutch does not parse correctly the "ü" in > > > url.(this is in the first example) My question nr.2. is: Where does > > > nutch found this url " > > > http://www.mydomain.com/LIGJE/601270007/11.1.1.e". I see in the > > > website where nutch supposed to found this url and I see that is this > > > frase (some text "11.1.1.e" some text) inside a <span> tag. It seems > > > that nutch consider it as a link. > > > > > > Any idea how to correct this? > > > > > > Any help will be appreciated. > > > Marseld > > > > > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:[email protected]] > > > Sent: Monday, June 20, 2011 12:43 AM > > > To: [email protected] > > > Cc: Marseld Dedgjonaj > > > Subject: Re: Problem in nutch parsing. > > > > > > What is your question? If you have any, provide more details of what > > > you expect, what your configuration is (url filters?) and what you're > > > getting. > > > > > > > Hello everybody, > > > > > > > > I use nutch-1.2 and I use it to crawl my website. > > > > > > > > I see some url in fetched list that doesn’t exist in website. > > > > > > > > > > > > > > > > Examples: > > > > > > > > http://www.mydomain.com/LAJME_GOSSIP_CATEGORY/1105130159/Article-Gise > > > > le -B ├Æ ╞├ Γ¼├ÆΓ¼á├óΓ¼Γ‑ó├Æ╞├óΓ¼┬á├Æ┬ó├óΓ¼a┬¼├óΓ¼~┬ó├Æ╞├ > > > > Γ¼"├Æ┬ó├óΓ¼a┬¼├&┬í├Æ╞├óΓ¼┼í├ÆΓ¼a├┬╝ndchen-eshte-ende-modelja-me-e-pag > > > > ua r- . aspx" > > > > > > > > http://www.mydomain.com/LIGJE/601270007/11.1.1.e > > > > > > > > > > > > > > > > I think nutch is not parsing correctly in this case. > > > > > > > > > > > > > > > > Thanks in advance. > > > > > > > > Best Regards, > > > > > > > > Marseld > > > > > > > > > > > > > > > > > > > > > > > > > > > > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni > > > > <b>Punë të Mirë</b> dhe <b>të Mirë për > > > > Punë</b>... Vizitoni: <a target="_blank" > > > > href="http://www.punaime.al/">www.punaime.al</a></span></p> <p><a > > > > target="_blank" href="http://www.punaime.al/"><span > > > > style="text-decoration: none;"><img width="165" height="31" > > > > border="0" alt="punaime" > > > > src="http://www.ikub.al/images/punaime.al_small.png" > > > > /></span></a></p> > > > > > > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni > > > <b>Punë të Mirë</b> dhe <b>të Mirë për > > > Punë</b>... Vizitoni: <a target="_blank" > > > href="http://www.punaime.al/">www.punaime.al</a></span></p> <p><a > > > target="_blank" href="http://www.punaime.al/"><span > > > style="text-decoration: none;"><img width="165" height="31" border="0" > > > alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" > > > /></span></a></p> -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

