Yeah, I'm currently learning Java (from scratch) and a crash course in Solr / Hadoop / Pig / Hive and Cloudera after hearing your prior response. The result of my efforts must be the scraper, data analysis pipeline (data munging), and ultimately refine the output to populate a mysql database (which is tied to current site).
For hosting, I'm considering digitalocean.com for the Hadoop/Solr setup. Is this a good one? Any recommendations? Any other tips or things I should be learning to accomplish this task? On Wed, Mar 7, 2018 at 1:58 PM, Markus Jelsma <[email protected]> wrote: > Hello, > > Yes, we have used headless browsers with and without Nutch. But i am > unsure which of the mentioned challenges a headless browser is going to > help solving, except for dealing with sites that serve only AJAXed web > pages. > > Semyon is right, if you really want this, Nutch and Hadoop can be great > tools for the job, but none of it is easy and you are going to need plenty > of custom code. That is, of course, doable, but you also need to bring > plenty of hardware, infrastructure and time to do the job. > > Regards, > Markus > > > -----Original message----- > > From:Eric Valencia <[email protected]> > > Sent: Wednesday 7th March 2018 21:51 > > To: [email protected] > > Subject: Re: Need Tutorial on Nutch > > > > How about using nutch with a headless browser like CasperJS? Will this > > work? Have any of you tried this? > > > > On Tue, Mar 6, 2018 at 1:00 PM Markus Jelsma <[email protected] > > > > wrote: > > > > > Hi, > > > > > > Yes you are going to need code, and a lot more than just that, probably > > > including dropping the 'every two hour' requirement. > > > > > > For your case you need either site-specific price extraction, which is > > > easy but a lot of work for 500+ sites. Or you need a more complicated > > > generic algorithm, which is a lot of work too. Both can be implemented > as > > > Nutch ParseFilter plugins and need Java code to run. > > > > > > Your next problem is daily volume, every product 12x per day for 500+ > > > shops times many products. You can ignore bandwidth and processing, > that is > > > easy. But you are going to be blocked within a few days by at least a > good > > > amount of sites. > > > > > > We once built a price checker crawler too, but the client's requirement > > > for very high interval checks could not be met easily without the use > of > > > costly proxies to avoid being blocked, hardware and network costs. They > > > dropped the requirement. > > > > > > Good luck > > > Markus > > > > > > -----Original message----- > > > > From:Eric Valencia <[email protected]> > > > > Sent: Tuesday 6th March 2018 21:17 > > > > To: [email protected] > > > > Subject: Re: Need Tutorial on Nutch > > > > > > > > Yash, well, I want to monitor the price for every item in the top 500 > > > > retail websites every two hours, 24/7/365. Java is needed? > > > > > > > > On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan < > > > > [email protected]> wrote: > > > > > > > > > If you want simple crawlung then Not at all. > > > > > But having experience with java will help you to fulfil your > personal > > > > > requirements. > > > > > > > > > > On 7 Mar 2018 01:42, "Eric Valencia" <[email protected]> > wrote: > > > > > > > > > > > Does this require knowing Java proficiently? > > > > > > > > > > > > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov < > > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Here is an unpleasant truth - there is no up to date tutorial > for > > > > > Nutch. > > > > > > > To make it even more interesting, sometimes the tutorial can > > > contradict > > > > > > > real behavior of Nutch, because of lately introduced > > > features/bugs. If > > > > > > you > > > > > > > find such cases, please try to fix and contribute to the > project. > > > > > > > > > > > > > > Welcome to the open source world. > > > > > > > > > > > > > > Though, my recommendations as a person who started with Nutch > less > > > > > then a > > > > > > > year ago : > > > > > > > 1) If you just need a simple crawl, you are in luck. Simply run > > > crawl > > > > > > > script or several steps according to the Nutch crawl tutorial. > > > > > > > 2) If it is bit more comlex you start to face problems either > with > > > > > > > configuration or with bugs. Therefore, first have a look at > Nutch > > > List > > > > > > > Archive http://nutch.apache.org/mailing_lists.html , if it > doesnt > > > work > > > > > > > try to figure out yourself, if that doesnt work ask here or at > > > > > developer > > > > > > > list. > > > > > > > 3) In most cases, you HAVE to open the code and fix/discover > > > something. > > > > > > > Nutch is really complicated system and to understand it > properly > > > you > > > > > can > > > > > > > easily spend 2-3 months trying to get the full basic > understanding > > > of > > > > > the > > > > > > > system. It gets even worse if you don't know Hadoop. If you > dont I > > > do > > > > > > > recomend to read "Hadoop. The definitive guide", because, well, > > > Nutch > > > > > is > > > > > > > Hadoop. > > > > > > > > > > > > > > Here we are, no pain, no gain. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Tuesday, March 06, 2018 at 7:42 PM > > > > > > > From: "Eric Valencia" <[email protected]> > > > > > > > To: [email protected] > > > > > > > Subject: Re: Need Tutorial on Nutch > > > > > > > Thank you kindly Yash. Yes, I did try some of the tutorials > > > actually > > > > > but > > > > > > > they seem to be missing the complete amount of steps required > to > > > > > > > successfully scrape in nutch. > > > > > > > > > > > > > > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan < > > > > > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > I would suggest to start with the documentation on nutch's > > > website. > > > > > > > > You can get a Idea about how to start crawling and all. > > > > > > > > Apart from that there are no proper tutorials as such. > > > > > > > > Just start crawling if you got stuck somewhere try to find > > > something > > > > > > > > related to that on Google and nutch mailing list archives. > > > > > > > > Ask questions if nothing helps. > > > > > > > > > > > > > > > > On 7 Mar 2018 00:01, "Eric Valencia" < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > I'm a beginner in Nutch and need the best tutorials to get > > > started. > > > > > Can > > > > > > > > you guys let me know how you would advise yourselves if > starting > > > > > today > > > > > > > > (like me)? > > > > > > > > > > > > > > > > Eric > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

