How to crowl AJAX populated pages

2012-02-28 Thread Grijesh
I need to Crawl pages which were loaded using Ajax, there is pagination on my pages which works using ajax. So when its being crawled ,its only crawling landing page of site not the other pages. Any help will be appreciated. - Thanx: Grijesh www.gettinhahead.co.in -- View this message in

Re: How to crowl AJAX populated pages

2012-02-28 Thread remi tassing
Same question here... I have similar issues where (redirection)links are given through JavaScript I hope I haven't hijacked your post as I see these issues very similar Remi On Tue, Feb 28, 2012 at 10:56 AM, Grijesh pintu.grij...@gmail.com wrote: I need to Crawl pages which were loaded using

Re: How to crowl AJAX populated pages

2012-02-28 Thread Lewis John Mcgibbney
Can you please provide one such URL so I can try. Thanks On Tue, Feb 28, 2012 at 9:02 AM, remi tassing tassingr...@gmail.com wrote: Same question here... I have similar issues where (redirection)links are given through JavaScript I hope I haven't hijacked your post as I see these issues

Re: How to crowl AJAX populated pages

2012-02-28 Thread Lewis John Mcgibbney
Tiny chunk of info on this topic https://developers.google.com/webmasters/ajax-crawling/ On Tue, Feb 28, 2012 at 9:39 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Can you please provide one such URL so I can try. Thanks On Tue, Feb 28, 2012 at 9:02 AM, remi tassing

Re: Large Shared Drive Crawl

2012-02-28 Thread webdev1977
OH.. forgot to say.. no I am not parsing while fetching. I had more problems with that so I turned it off. -- View this message in context: http://lucene.472066.n3.nabble.com/Large-Shared-Drive-Crawl-tp3781917p3783706.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Large Shared Drive Crawl

2012-02-28 Thread Markus Jelsma
I guess I don't mind using topN as long as I can be assured that I will get ALL of the urls crawled eventually. Do you know if that is a true statement? That is true. The cycle will continue until all records are exhausted. You just need more cycles. Also consider using maxSegments to

Re: How to crowl AJAX populated pages

2012-02-28 Thread Markus Jelsma
This is not implemented in Nutch and there are no tickets so far in Jira. Supporting this feature would need a two-way normalizer. One for normalizing incoming URL's to _escaped_fragment_...etc and one the other way around when indexing URL's. Otherwise the non-AJAX URL is shown in search

Re: crawldb modifications

2012-02-28 Thread Markus Jelsma
I may be missing something but rm -r crawl/crawldb works fine here. On Tuesday 28 February 2012 07:03:39 remi tassing wrote: What do in this case is to erase the db, use the.command mergesegs with -filter option and then updatedb. I would.love to know if there is a simpler way Remi On

Re: Large Shared Drive Crawl

2012-02-28 Thread webdev1977
What is a reasonable number of threads? What about memory? Where is the best place to set that in the nutch script? in one of the config files. I abandoned using distributed mode (10 slaves), it was taking WAY to long to crawl the web and share drives in my enterprise, not to mention I

Re: crawldb modifications

2012-02-28 Thread remi tassing
I think he ment to remove some specific URLs not everything On Tue, Feb 28, 2012 at 1:51 PM, Markus Jelsma markus.jel...@openindex.iowrote: I may be missing something but rm -r crawl/crawldb works fine here. On Tuesday 28 February 2012 07:03:39 remi tassing wrote: What do in this case is to

Re: crawldb modifications

2012-02-28 Thread Markus Jelsma
In that case i suggest using crawldbscanner tool or the new regex feature for the crawldbreader tool in trunk. On Tuesday 28 February 2012 13:04:47 remi tassing wrote: I think he ment to remove some specific URLs not everything On Tue, Feb 28, 2012 at 1:51 PM, Markus Jelsma

Query in nutch

2012-02-28 Thread Geetha Venu
Hi All, I have specific requirement to crawl only a specific content in the bodytag of the website. The Nutch Crawler crawls all the content present in the body, even the menu items, urls, whatever data is present in the bodytag of the website.I couldn't find an option in Nutch to restrict

Re: Query in nutch

2012-02-28 Thread Lewis John Mcgibbney
As far as I know, Elisabeth Adler contributed a patch exactly for this on NUTCH-585 [0]. If you wish to get cracking with it please check out the latest trunk code [1] patch it using the blacklist_whitelist_plugin.patch Elisabeth attached to the issue. Would be excellent if you could provide

Re: [nutchgora] - proposal to support distributed indexing

2012-02-28 Thread SUJIT PAL
I have updated a patch for NUTCH-945. It works locally as described in the JIRA. -sujit On Feb 23, 2012, at 10:35 PM, SUJIT PAL wrote: Hi Lewis, Ok, thanks, I will attach the patch to NUTCH-945 after I am done with it, and update this thread as well... -sujit On Feb 23, 2012, at

[blog post] Accumulo, Nutch, and GORA

2012-02-28 Thread Jason Trost
Blog post for anyone who's interested. I cover a basic howto for getting Nutch to use Apache Gora to store web crawl data in Accumulo. Let me know if you have any questions. Accumulo, Nutch, and GORA http://www.covert.io/post/18414889381/accumulo-nutch-and-gora --Jason

too few db_fetched

2012-02-28 Thread pepe3059
Hello, I'm Jose, i have one question and i hope you can help me I have nutch-1.4 and I'm crawling the web from a country (mx), for that reason i changed regex-urlfilter to add the correct regex. the second param changed in nutch script was the java heap amount because an error of memory space.

Re: [blog post] Accumulo, Nutch, and Gora

2012-02-28 Thread Enis Söztutar
Fabulous work! There are obviously a lot of local modifications to be done for nutch + gora + accumulo to work. So feel free to propose these to upstream nutch and gora. It should feel good to run the web crawl, and store the results on accumulo. Cheers, Enis On Tue, Feb 28, 2012 at 6:24 PM,

Re: too few db_fetched

2012-02-28 Thread remi tassing
Hi Jose, We have this question very often and the short answer, with regard to 'stats' printout, is that everything is probably fine. For a more complete answer plz search in the mailing-list or Google. BTW, how did you change the heap size? I get some IOException when the TopN is 'too' high

Re: How to crowl AJAX populated pages

2012-02-28 Thread Grijesh
Thanks, Markus for quick reply. Currently I have to make our site crawlable to google and other search engines I am already looking at https://developers.google.com/webmasters/ajax-crawling/ also but this is still in under development phase. - Thanx: Grijesh www.gettinhahead.co.in --