How about using nutch with a headless browser like CasperJS?  Will this
work? Have any of you tried this?

On Tue, Mar 6, 2018 at 1:00 PM Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hi,
>
> Yes you are going to need code, and a lot more than just that, probably
> including dropping the 'every two hour' requirement.
>
> For your case you need either site-specific price extraction, which is
> easy but a lot of work for 500+ sites. Or you need a more complicated
> generic algorithm, which is a lot of work too. Both can be implemented as
> Nutch ParseFilter plugins and need Java code to run.
>
> Your next problem is daily volume, every product 12x per day for 500+
> shops times many products. You can ignore bandwidth and processing, that is
> easy. But you are going to be blocked within a few days by at least a good
> amount of sites.
>
> We once built a price checker crawler too, but the client's requirement
> for very high interval checks could not be met easily without the use of
> costly proxies to avoid being blocked, hardware and network costs. They
> dropped the requirement.
>
> Good luck
> Markus
>
> -----Original message-----
> > From:Eric Valencia <ericlvalen...@gmail.com>
> > Sent: Tuesday 6th March 2018 21:17
> > To: user@nutch.apache.org
> > Subject: Re: Need Tutorial on Nutch
> >
> > Yash, well, I want to monitor the price for every item in the top 500
> > retail websites every two hours, 24/7/365.  Java is needed?
> >
> > On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
> > rit2014...@iiita.ac.in> wrote:
> >
> > > If you want simple crawlung then Not at all.
> > > But having experience with java will help you to fulfil your personal
> > > requirements.
> > >
> > > On 7 Mar 2018 01:42, "Eric Valencia" <ericlvalen...@gmail.com> wrote:
> > >
> > > > Does this require knowing Java proficiently?
> > > >
> > > > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> > > semyon.semyo...@mail.com>
> > > > wrote:
> > > >
> > > > > Here is an unpleasant truth - there is no up to date tutorial for
> > > Nutch.
> > > > > To make it even more interesting, sometimes the tutorial can
> contradict
> > > > > real behavior of Nutch, because of lately introduced
> features/bugs. If
> > > > you
> > > > > find such cases, please try to fix and contribute to the project.
> > > > >
> > > > > Welcome to the open source world.
> > > > >
> > > > > Though, my recommendations as a person who started with Nutch less
> > > then a
> > > > > year ago :
> > > > > 1) If you just need a simple crawl, you are in luck. Simply run
> crawl
> > > > > script or several steps according to the Nutch crawl tutorial.
> > > > > 2) If it is bit more comlex you start to face problems either with
> > > > > configuration or with bugs. Therefore, first have a look at Nutch
> List
> > > > > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt
> work
> > > > > try to figure out yourself, if that doesnt work ask here or at
> > > developer
> > > > > list.
> > > > > 3) In most cases, you HAVE to open the code and fix/discover
> something.
> > > > > Nutch is really complicated system and to understand it properly
> you
> > > can
> > > > > easily spend 2-3 months trying to get the full basic understanding
> of
> > > the
> > > > > system. It gets even worse if you don't know Hadoop. If you dont I
> do
> > > > > recomend to read "Hadoop. The definitive guide", because, well,
> Nutch
> > > is
> > > > > Hadoop.
> > > > >
> > > > > Here we are, no pain, no gain.
> > > > >
> > > > >
> > > > >
> > > > > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > > > > From: "Eric Valencia" <ericlvalen...@gmail.com>
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Need Tutorial on Nutch
> > > > > Thank you kindly Yash. Yes, I did try some of the tutorials
> actually
> > > but
> > > > > they seem to be missing the complete amount of steps required to
> > > > > successfully scrape in nutch.
> > > > >
> > > > > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > > > > rit2014...@iiita.ac.in>
> > > > > wrote:
> > > > >
> > > > > > I would suggest to start with the documentation on nutch's
> website.
> > > > > > You can get a Idea about how to start crawling and all.
> > > > > > Apart from that there are no proper tutorials as such.
> > > > > > Just start crawling if you got stuck somewhere try to find
> something
> > > > > > related to that on Google and nutch mailing list archives.
> > > > > > Ask questions if nothing helps.
> > > > > >
> > > > > > On 7 Mar 2018 00:01, "Eric Valencia" <ericlvalen...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > I'm a beginner in Nutch and need the best tutorials to get
> started.
> > > Can
> > > > > > you guys let me know how you would advise yourselves if starting
> > > today
> > > > > > (like me)?
> > > > > >
> > > > > > Eric
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to