Hi,

I had same problem. I solved with Hive. I mapped hbase table to hive. After than i write little query. If you use Hive i can help you. But your problem is url-validation plugin problem. you should add in your nutch-site.xml. Doest come by the default.



01-08-2013 13:57 tarihinde, A Laxmi yazdı:
Is there any way to find an *inlink *of a crawled site?


On Thu, Aug 1, 2013 at 6:48 AM, A Laxmi <[email protected]> wrote:

Thanks for your help, Ahme! I would be interested in more than a
timestamp. I would like to understand how a particular URL was crawled - in
better terms, the sequence or how nutch landed up with a particular link in
its crawldb.

My problem is I found one site from the crawled list of URLS with a
horrible URL format something like '
www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>?"
- as you can see this link it has got some backslashes for some reason. I
tried to reach that url starting from the landing page "
www.domainabc.com/level1/level2/" but I could not find that URL with such
a bad format. So, I want to know how did nutch reach that url? Is there
some link nutch crawled which has the url " '
www.domainabc.com/level1/level2/level3\\\\\\\\\\\\/level4_viewid=1<http://www.domainabc.com/level1/level2/level3%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C%5C/level4_viewid=1>?"
somewhere? In what sequence did nutch did the crawling starting from a seed
url to crawl such a url? I hope I made it clear. Please let me know if you
have any questions. Any help is much appreciated.


On Wed, Jul 31, 2013 at 9:27 PM, Ahme Emre Aladağ <[email protected]>wrote:

Hello,

Does timestamp give what you need? There should be a timestamp indicating
the time of the operation.




----- Orijinal Mesaj -----
Kimden: "A Laxmi" <[email protected]>
Kime: [email protected]
Gönderilenler: 31 Temmuz Çarşamba 2013 17:55:45
Konu: Nutch 1.6 - sequence in which crawler works its way to a URL

Hello,

For example, I have a single *seed *url say "http://nutch.apache.org/";
and
I am crawling it for "n" times. At the end of the crawl, I have 1220 new
urls generated/fetched/updated from a single seed url. While looking at
these 1220 new urls, I am interested to know how a particular site eg.
"www.abc/xy.com" has been crawled. Better question would be - in what
sequence did the crawler work its way to a particular url "www.abc/xy.com
"?

Thanks for your help!



Reply via email to