Re: Need Tutorial on Nutch

2018-03-07 Thread Eric Valencia
Yeah, I'm currently learning Java (from scratch) and a crash course in Solr
/ Hadoop / Pig / Hive and Cloudera after hearing your prior response. The
result of my efforts must be the scraper, data analysis pipeline (data
munging), and ultimately refine the output to populate a mysql database
(which is tied to current site).

For hosting, I'm considering digitalocean.com for the Hadoop/Solr setup.
Is this a good one?  Any recommendations?

Any other tips or things I should be learning to accomplish this task?

On Wed, Mar 7, 2018 at 1:58 PM, Markus Jelsma 
wrote:

> Hello,
>
> Yes, we have used headless browsers with and without Nutch. But i am
> unsure which of the mentioned challenges a headless browser is going to
> help solving, except for dealing with sites that serve only AJAXed web
> pages.
>
> Semyon is right, if you really want this, Nutch and Hadoop can be great
> tools for the job, but none of it is easy and you are going to need plenty
> of custom code. That is, of course, doable, but you also need to bring
> plenty of hardware, infrastructure and time to do the job.
>
> Regards,
> Markus
>
>
> -Original message-
> > From:Eric Valencia 
> > Sent: Wednesday 7th March 2018 21:51
> > To: user@nutch.apache.org
> > Subject: Re: Need Tutorial on Nutch
> >
> > How about using nutch with a headless browser like CasperJS?  Will this
> > work? Have any of you tried this?
> >
> > On Tue, Mar 6, 2018 at 1:00 PM Markus Jelsma  >
> > wrote:
> >
> > > Hi,
> > >
> > > Yes you are going to need code, and a lot more than just that, probably
> > > including dropping the 'every two hour' requirement.
> > >
> > > For your case you need either site-specific price extraction, which is
> > > easy but a lot of work for 500+ sites. Or you need a more complicated
> > > generic algorithm, which is a lot of work too. Both can be implemented
> as
> > > Nutch ParseFilter plugins and need Java code to run.
> > >
> > > Your next problem is daily volume, every product 12x per day for 500+
> > > shops times many products. You can ignore bandwidth and processing,
> that is
> > > easy. But you are going to be blocked within a few days by at least a
> good
> > > amount of sites.
> > >
> > > We once built a price checker crawler too, but the client's requirement
> > > for very high interval checks could not be met easily without the use
> of
> > > costly proxies to avoid being blocked, hardware and network costs. They
> > > dropped the requirement.
> > >
> > > Good luck
> > > Markus
> > >
> > > -Original message-
> > > > From:Eric Valencia 
> > > > Sent: Tuesday 6th March 2018 21:17
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Need Tutorial on Nutch
> > > >
> > > > Yash, well, I want to monitor the price for every item in the top 500
> > > > retail websites every two hours, 24/7/365.  Java is needed?
> > > >
> > > > On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
> > > > rit2014...@iiita.ac.in> wrote:
> > > >
> > > > > If you want simple crawlung then Not at all.
> > > > > But having experience with java will help you to fulfil your
> personal
> > > > > requirements.
> > > > >
> > > > > On 7 Mar 2018 01:42, "Eric Valencia" 
> wrote:
> > > > >
> > > > > > Does this require knowing Java proficiently?
> > > > > >
> > > > > > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> > > > > semyon.semyo...@mail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Here is an unpleasant truth - there is no up to date tutorial
> for
> > > > > Nutch.
> > > > > > > To make it even more interesting, sometimes the tutorial can
> > > contradict
> > > > > > > real behavior of Nutch, because of lately introduced
> > > features/bugs. If
> > > > > > you
> > > > > > > find such cases, please try to fix and contribute to the
> project.
> > > > > > >
> > > > > > > Welcome to the open source world.
> > > > > > >
> > > > > > > Though, my recommendations as a person who started with Nutch
> less
> > > > > then a
> > > > > > > year ago :
> > > &

RE: Need Tutorial on Nutch

2018-03-07 Thread Markus Jelsma
Hello,

Yes, we have used headless browsers with and without Nutch. But i am unsure 
which of the mentioned challenges a headless browser is going to help solving, 
except for dealing with sites that serve only AJAXed web pages.

Semyon is right, if you really want this, Nutch and Hadoop can be great tools 
for the job, but none of it is easy and you are going to need plenty of custom 
code. That is, of course, doable, but you also need to bring plenty of 
hardware, infrastructure and time to do the job.

Regards,
Markus
 
 
-Original message-
> From:Eric Valencia 
> Sent: Wednesday 7th March 2018 21:51
> To: user@nutch.apache.org
> Subject: Re: Need Tutorial on Nutch
> 
> How about using nutch with a headless browser like CasperJS?  Will this
> work? Have any of you tried this?
> 
> On Tue, Mar 6, 2018 at 1:00 PM Markus Jelsma 
> wrote:
> 
> > Hi,
> >
> > Yes you are going to need code, and a lot more than just that, probably
> > including dropping the 'every two hour' requirement.
> >
> > For your case you need either site-specific price extraction, which is
> > easy but a lot of work for 500+ sites. Or you need a more complicated
> > generic algorithm, which is a lot of work too. Both can be implemented as
> > Nutch ParseFilter plugins and need Java code to run.
> >
> > Your next problem is daily volume, every product 12x per day for 500+
> > shops times many products. You can ignore bandwidth and processing, that is
> > easy. But you are going to be blocked within a few days by at least a good
> > amount of sites.
> >
> > We once built a price checker crawler too, but the client's requirement
> > for very high interval checks could not be met easily without the use of
> > costly proxies to avoid being blocked, hardware and network costs. They
> > dropped the requirement.
> >
> > Good luck
> > Markus
> >
> > -Original message-
> > > From:Eric Valencia 
> > > Sent: Tuesday 6th March 2018 21:17
> > > To: user@nutch.apache.org
> > > Subject: Re: Need Tutorial on Nutch
> > >
> > > Yash, well, I want to monitor the price for every item in the top 500
> > > retail websites every two hours, 24/7/365.  Java is needed?
> > >
> > > On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
> > > rit2014...@iiita.ac.in> wrote:
> > >
> > > > If you want simple crawlung then Not at all.
> > > > But having experience with java will help you to fulfil your personal
> > > > requirements.
> > > >
> > > > On 7 Mar 2018 01:42, "Eric Valencia"  wrote:
> > > >
> > > > > Does this require knowing Java proficiently?
> > > > >
> > > > > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> > > > semyon.semyo...@mail.com>
> > > > > wrote:
> > > > >
> > > > > > Here is an unpleasant truth - there is no up to date tutorial for
> > > > Nutch.
> > > > > > To make it even more interesting, sometimes the tutorial can
> > contradict
> > > > > > real behavior of Nutch, because of lately introduced
> > features/bugs. If
> > > > > you
> > > > > > find such cases, please try to fix and contribute to the project.
> > > > > >
> > > > > > Welcome to the open source world.
> > > > > >
> > > > > > Though, my recommendations as a person who started with Nutch less
> > > > then a
> > > > > > year ago :
> > > > > > 1) If you just need a simple crawl, you are in luck. Simply run
> > crawl
> > > > > > script or several steps according to the Nutch crawl tutorial.
> > > > > > 2) If it is bit more comlex you start to face problems either with
> > > > > > configuration or with bugs. Therefore, first have a look at Nutch
> > List
> > > > > > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt
> > work
> > > > > > try to figure out yourself, if that doesnt work ask here or at
> > > > developer
> > > > > > list.
> > > > > > 3) In most cases, you HAVE to open the code and fix/discover
> > something.
> > > > > > Nutch is really complicated system and to understand it properly
> > you
> > > > can
> > > > > > easily spend 2-3 months trying to get the full basic understanding
> > of
> > > > the
> > > &g

Re: Need Tutorial on Nutch

2018-03-07 Thread Eric Valencia
How about using nutch with a headless browser like CasperJS?  Will this
work? Have any of you tried this?

On Tue, Mar 6, 2018 at 1:00 PM Markus Jelsma 
wrote:

> Hi,
>
> Yes you are going to need code, and a lot more than just that, probably
> including dropping the 'every two hour' requirement.
>
> For your case you need either site-specific price extraction, which is
> easy but a lot of work for 500+ sites. Or you need a more complicated
> generic algorithm, which is a lot of work too. Both can be implemented as
> Nutch ParseFilter plugins and need Java code to run.
>
> Your next problem is daily volume, every product 12x per day for 500+
> shops times many products. You can ignore bandwidth and processing, that is
> easy. But you are going to be blocked within a few days by at least a good
> amount of sites.
>
> We once built a price checker crawler too, but the client's requirement
> for very high interval checks could not be met easily without the use of
> costly proxies to avoid being blocked, hardware and network costs. They
> dropped the requirement.
>
> Good luck
> Markus
>
> -Original message-
> > From:Eric Valencia 
> > Sent: Tuesday 6th March 2018 21:17
> > To: user@nutch.apache.org
> > Subject: Re: Need Tutorial on Nutch
> >
> > Yash, well, I want to monitor the price for every item in the top 500
> > retail websites every two hours, 24/7/365.  Java is needed?
> >
> > On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
> > rit2014...@iiita.ac.in> wrote:
> >
> > > If you want simple crawlung then Not at all.
> > > But having experience with java will help you to fulfil your personal
> > > requirements.
> > >
> > > On 7 Mar 2018 01:42, "Eric Valencia"  wrote:
> > >
> > > > Does this require knowing Java proficiently?
> > > >
> > > > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> > > semyon.semyo...@mail.com>
> > > > wrote:
> > > >
> > > > > Here is an unpleasant truth - there is no up to date tutorial for
> > > Nutch.
> > > > > To make it even more interesting, sometimes the tutorial can
> contradict
> > > > > real behavior of Nutch, because of lately introduced
> features/bugs. If
> > > > you
> > > > > find such cases, please try to fix and contribute to the project.
> > > > >
> > > > > Welcome to the open source world.
> > > > >
> > > > > Though, my recommendations as a person who started with Nutch less
> > > then a
> > > > > year ago :
> > > > > 1) If you just need a simple crawl, you are in luck. Simply run
> crawl
> > > > > script or several steps according to the Nutch crawl tutorial.
> > > > > 2) If it is bit more comlex you start to face problems either with
> > > > > configuration or with bugs. Therefore, first have a look at Nutch
> List
> > > > > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt
> work
> > > > > try to figure out yourself, if that doesnt work ask here or at
> > > developer
> > > > > list.
> > > > > 3) In most cases, you HAVE to open the code and fix/discover
> something.
> > > > > Nutch is really complicated system and to understand it properly
> you
> > > can
> > > > > easily spend 2-3 months trying to get the full basic understanding
> of
> > > the
> > > > > system. It gets even worse if you don't know Hadoop. If you dont I
> do
> > > > > recomend to read "Hadoop. The definitive guide", because, well,
> Nutch
> > > is
> > > > > Hadoop.
> > > > >
> > > > > Here we are, no pain, no gain.
> > > > >
> > > > >
> > > > >
> > > > > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > > > > From: "Eric Valencia" 
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Need Tutorial on Nutch
> > > > > Thank you kindly Yash. Yes, I did try some of the tutorials
> actually
> > > but
> > > > > they seem to be missing the complete amount of steps required to
> > > > > successfully scrape in nutch.
> > > > >
> > > > > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > > > > rit2014...@iiita.ac.in>
> > > > > wrote:
> > > > >
> > > > > > I would suggest to start with the documentation on nutch's
> website.
> > > > > > You can get a Idea about how to start crawling and all.
> > > > > > Apart from that there are no proper tutorials as such.
> > > > > > Just start crawling if you got stuck somewhere try to find
> something
> > > > > > related to that on Google and nutch mailing list archives.
> > > > > > Ask questions if nothing helps.
> > > > > >
> > > > > > On 7 Mar 2018 00:01, "Eric Valencia" 
> > > wrote:
> > > > > >
> > > > > > I'm a beginner in Nutch and need the best tutorials to get
> started.
> > > Can
> > > > > > you guys let me know how you would advise yourselves if starting
> > > today
> > > > > > (like me)?
> > > > > >
> > > > > > Eric
> > > > > >
> > > > >
> > > >
> > >
> >
>


RE: Need Tutorial on Nutch

2018-03-06 Thread Markus Jelsma
Hi,

Yes you are going to need code, and a lot more than just that, probably 
including dropping the 'every two hour' requirement.

For your case you need either site-specific price extraction, which is easy but 
a lot of work for 500+ sites. Or you need a more complicated generic algorithm, 
which is a lot of work too. Both can be implemented as Nutch ParseFilter 
plugins and need Java code to run.

Your next problem is daily volume, every product 12x per day for 500+ shops 
times many products. You can ignore bandwidth and processing, that is easy. But 
you are going to be blocked within a few days by at least a good amount of 
sites.

We once built a price checker crawler too, but the client's requirement for 
very high interval checks could not be met easily without the use of costly 
proxies to avoid being blocked, hardware and network costs. They dropped the 
requirement.

Good luck
Markus
 
-Original message-
> From:Eric Valencia 
> Sent: Tuesday 6th March 2018 21:17
> To: user@nutch.apache.org
> Subject: Re: Need Tutorial on Nutch
> 
> Yash, well, I want to monitor the price for every item in the top 500
> retail websites every two hours, 24/7/365.  Java is needed?
> 
> On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
> rit2014...@iiita.ac.in> wrote:
> 
> > If you want simple crawlung then Not at all.
> > But having experience with java will help you to fulfil your personal
> > requirements.
> >
> > On 7 Mar 2018 01:42, "Eric Valencia"  wrote:
> >
> > > Does this require knowing Java proficiently?
> > >
> > > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> > semyon.semyo...@mail.com>
> > > wrote:
> > >
> > > > Here is an unpleasant truth - there is no up to date tutorial for
> > Nutch.
> > > > To make it even more interesting, sometimes the tutorial can contradict
> > > > real behavior of Nutch, because of lately introduced features/bugs. If
> > > you
> > > > find such cases, please try to fix and contribute to the project.
> > > >
> > > > Welcome to the open source world.
> > > >
> > > > Though, my recommendations as a person who started with Nutch less
> > then a
> > > > year ago :
> > > > 1) If you just need a simple crawl, you are in luck. Simply run crawl
> > > > script or several steps according to the Nutch crawl tutorial.
> > > > 2) If it is bit more comlex you start to face problems either with
> > > > configuration or with bugs. Therefore, first have a look at Nutch List
> > > > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> > > > try to figure out yourself, if that doesnt work ask here or at
> > developer
> > > > list.
> > > > 3) In most cases, you HAVE to open the code and fix/discover something.
> > > > Nutch is really complicated system and to understand it properly you
> > can
> > > > easily spend 2-3 months trying to get the full basic understanding of
> > the
> > > > system. It gets even worse if you don't know Hadoop. If you dont I do
> > > > recomend to read "Hadoop. The definitive guide", because, well, Nutch
> > is
> > > > Hadoop.
> > > >
> > > > Here we are, no pain, no gain.
> > > >
> > > >
> > > >
> > > > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > > > From: "Eric Valencia" 
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Need Tutorial on Nutch
> > > > Thank you kindly Yash. Yes, I did try some of the tutorials actually
> > but
> > > > they seem to be missing the complete amount of steps required to
> > > > successfully scrape in nutch.
> > > >
> > > > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > > > rit2014...@iiita.ac.in>
> > > > wrote:
> > > >
> > > > > I would suggest to start with the documentation on nutch's website.
> > > > > You can get a Idea about how to start crawling and all.
> > > > > Apart from that there are no proper tutorials as such.
> > > > > Just start crawling if you got stuck somewhere try to find something
> > > > > related to that on Google and nutch mailing list archives.
> > > > > Ask questions if nothing helps.
> > > > >
> > > > > On 7 Mar 2018 00:01, "Eric Valencia" 
> > wrote:
> > > > >
> > > > > I'm a beginner in Nutch and need the best tutorials to get started.
> > Can
> > > > > you guys let me know how you would advise yourselves if starting
> > today
> > > > > (like me)?
> > > > >
> > > > > Eric
> > > > >
> > > >
> > >
> >
> 


Re: Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
Yash, well, I want to monitor the price for every item in the top 500
retail websites every two hours, 24/7/365.  Java is needed?

On Tue, Mar 6, 2018 at 12:15 PM, Yash Thenuan Thenuan <
rit2014...@iiita.ac.in> wrote:

> If you want simple crawlung then Not at all.
> But having experience with java will help you to fulfil your personal
> requirements.
>
> On 7 Mar 2018 01:42, "Eric Valencia"  wrote:
>
> > Does this require knowing Java proficiently?
> >
> > On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <
> semyon.semyo...@mail.com>
> > wrote:
> >
> > > Here is an unpleasant truth - there is no up to date tutorial for
> Nutch.
> > > To make it even more interesting, sometimes the tutorial can contradict
> > > real behavior of Nutch, because of lately introduced features/bugs. If
> > you
> > > find such cases, please try to fix and contribute to the project.
> > >
> > > Welcome to the open source world.
> > >
> > > Though, my recommendations as a person who started with Nutch less
> then a
> > > year ago :
> > > 1) If you just need a simple crawl, you are in luck. Simply run crawl
> > > script or several steps according to the Nutch crawl tutorial.
> > > 2) If it is bit more comlex you start to face problems either with
> > > configuration or with bugs. Therefore, first have a look at Nutch List
> > > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> > > try to figure out yourself, if that doesnt work ask here or at
> developer
> > > list.
> > > 3) In most cases, you HAVE to open the code and fix/discover something.
> > > Nutch is really complicated system and to understand it properly you
> can
> > > easily spend 2-3 months trying to get the full basic understanding of
> the
> > > system. It gets even worse if you don't know Hadoop. If you dont I do
> > > recomend to read "Hadoop. The definitive guide", because, well, Nutch
> is
> > > Hadoop.
> > >
> > > Here we are, no pain, no gain.
> > >
> > >
> > >
> > > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > > From: "Eric Valencia" 
> > > To: user@nutch.apache.org
> > > Subject: Re: Need Tutorial on Nutch
> > > Thank you kindly Yash. Yes, I did try some of the tutorials actually
> but
> > > they seem to be missing the complete amount of steps required to
> > > successfully scrape in nutch.
> > >
> > > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > > rit2014...@iiita.ac.in>
> > > wrote:
> > >
> > > > I would suggest to start with the documentation on nutch's website.
> > > > You can get a Idea about how to start crawling and all.
> > > > Apart from that there are no proper tutorials as such.
> > > > Just start crawling if you got stuck somewhere try to find something
> > > > related to that on Google and nutch mailing list archives.
> > > > Ask questions if nothing helps.
> > > >
> > > > On 7 Mar 2018 00:01, "Eric Valencia" 
> wrote:
> > > >
> > > > I'm a beginner in Nutch and need the best tutorials to get started.
> Can
> > > > you guys let me know how you would advise yourselves if starting
> today
> > > > (like me)?
> > > >
> > > > Eric
> > > >
> > >
> >
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
If you want simple crawlung then Not at all.
But having experience with java will help you to fulfil your personal
requirements.

On 7 Mar 2018 01:42, "Eric Valencia"  wrote:

> Does this require knowing Java proficiently?
>
> On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov 
> wrote:
>
> > Here is an unpleasant truth - there is no up to date tutorial for Nutch.
> > To make it even more interesting, sometimes the tutorial can contradict
> > real behavior of Nutch, because of lately introduced features/bugs. If
> you
> > find such cases, please try to fix and contribute to the project.
> >
> > Welcome to the open source world.
> >
> > Though, my recommendations as a person who started with Nutch less then a
> > year ago :
> > 1) If you just need a simple crawl, you are in luck. Simply run crawl
> > script or several steps according to the Nutch crawl tutorial.
> > 2) If it is bit more comlex you start to face problems either with
> > configuration or with bugs. Therefore, first have a look at Nutch List
> > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> > try to figure out yourself, if that doesnt work ask here or at developer
> > list.
> > 3) In most cases, you HAVE to open the code and fix/discover something.
> > Nutch is really complicated system and to understand it properly you can
> > easily spend 2-3 months trying to get the full basic understanding of the
> > system. It gets even worse if you don't know Hadoop. If you dont I do
> > recomend to read "Hadoop. The definitive guide", because, well, Nutch is
> > Hadoop.
> >
> > Here we are, no pain, no gain.
> >
> >
> >
> > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > From: "Eric Valencia" 
> > To: user@nutch.apache.org
> > Subject: Re: Need Tutorial on Nutch
> > Thank you kindly Yash. Yes, I did try some of the tutorials actually but
> > they seem to be missing the complete amount of steps required to
> > successfully scrape in nutch.
> >
> > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > rit2014...@iiita.ac.in>
> > wrote:
> >
> > > I would suggest to start with the documentation on nutch's website.
> > > You can get a Idea about how to start crawling and all.
> > > Apart from that there are no proper tutorials as such.
> > > Just start crawling if you got stuck somewhere try to find something
> > > related to that on Google and nutch mailing list archives.
> > > Ask questions if nothing helps.
> > >
> > > On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
> > >
> > > I'm a beginner in Nutch and need the best tutorials to get started. Can
> > > you guys let me know how you would advise yourselves if starting today
> > > (like me)?
> > >
> > > Eric
> > >
> >
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
Does this require knowing Java proficiently?

On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov 
wrote:

> Here is an unpleasant truth - there is no up to date tutorial for Nutch.
> To make it even more interesting, sometimes the tutorial can contradict
> real behavior of Nutch, because of lately introduced features/bugs. If you
> find such cases, please try to fix and contribute to the project.
>
> Welcome to the open source world.
>
> Though, my recommendations as a person who started with Nutch less then a
> year ago :
> 1) If you just need a simple crawl, you are in luck. Simply run crawl
> script or several steps according to the Nutch crawl tutorial.
> 2) If it is bit more comlex you start to face problems either with
> configuration or with bugs. Therefore, first have a look at Nutch List
> Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> try to figure out yourself, if that doesnt work ask here or at developer
> list.
> 3) In most cases, you HAVE to open the code and fix/discover something.
> Nutch is really complicated system and to understand it properly you can
> easily spend 2-3 months trying to get the full basic understanding of the
> system. It gets even worse if you don't know Hadoop. If you dont I do
> recomend to read "Hadoop. The definitive guide", because, well, Nutch is
> Hadoop.
>
> Here we are, no pain, no gain.
>
>
>
> Sent: Tuesday, March 06, 2018 at 7:42 PM
> From: "Eric Valencia" 
> To: user@nutch.apache.org
> Subject: Re: Need Tutorial on Nutch
> Thank you kindly Yash. Yes, I did try some of the tutorials actually but
> they seem to be missing the complete amount of steps required to
> successfully scrape in nutch.
>
> On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> rit2014...@iiita.ac.in>
> wrote:
>
> > I would suggest to start with the documentation on nutch's website.
> > You can get a Idea about how to start crawling and all.
> > Apart from that there are no proper tutorials as such.
> > Just start crawling if you got stuck somewhere try to find something
> > related to that on Google and nutch mailing list archives.
> > Ask questions if nothing helps.
> >
> > On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
> >
> > I'm a beginner in Nutch and need the best tutorials to get started. Can
> > you guys let me know how you would advise yourselves if starting today
> > (like me)?
> >
> > Eric
> >
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Semyon Semyonov
Here is an unpleasant truth - there is no up to date tutorial for Nutch. To 
make it even more interesting, sometimes the tutorial can contradict real 
behavior of Nutch, because of lately introduced features/bugs. If you find such 
cases, please try to fix and contribute to the project.

Welcome to the open source world.

Though, my recommendations as a person who started with Nutch less then a year 
ago :
1) If you just need a simple crawl, you are in luck. Simply run crawl script or 
several steps according to the Nutch crawl tutorial.
2) If it is bit more comlex you start to face problems either with 
configuration or with bugs. Therefore, first have a look at Nutch List Archive 
http://nutch.apache.org/mailing_lists.html , if it doesnt work try to figure 
out yourself, if that doesnt work ask here or at developer list.
3) In most cases, you HAVE to open the code and fix/discover something. Nutch 
is really complicated system and to understand it properly you can easily spend 
2-3 months trying to get the full basic understanding of the system. It gets 
even worse if you don't know Hadoop. If you dont I do recomend to read "Hadoop. 
The definitive guide", because, well, Nutch is Hadoop.

Here we are, no pain, no gain.
 
 

Sent: Tuesday, March 06, 2018 at 7:42 PM
From: "Eric Valencia" 
To: user@nutch.apache.org
Subject: Re: Need Tutorial on Nutch
Thank you kindly Yash. Yes, I did try some of the tutorials actually but
they seem to be missing the complete amount of steps required to
successfully scrape in nutch.

On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan 
wrote:

> I would suggest to start with the documentation on nutch's website.
> You can get a Idea about how to start crawling and all.
> Apart from that there are no proper tutorials as such.
> Just start crawling if you got stuck somewhere try to find something
> related to that on Google and nutch mailing list archives.
> Ask questions if nothing helps.
>
> On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
>
> I'm a beginner in Nutch and need the best tutorials to get started. Can
> you guys let me know how you would advise yourselves if starting today
> (like me)?
>
> Eric
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
Start with nutch 1.x if you are getting some trouble. Its easier to
configure and by following nutch 1.x tutorial you will be able to crawl
your first website easily.

On 7 Mar 2018 00:13, "Eric Valencia"  wrote:

> Thank you kindly Yash.  Yes, I did try some of the tutorials actually but
> they seem to be missing the complete amount of steps required to
> successfully scrape in nutch.
>
> On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> rit2014...@iiita.ac.in>
> wrote:
>
> > I would suggest to start with the documentation on nutch's website.
> > You can get a Idea about how to start crawling and all.
> > Apart from that there are no proper tutorials as such.
> > Just start crawling if you got stuck somewhere try to find something
> > related to that on Google and nutch mailing list archives.
> > Ask questions if nothing helps.
> >
> > On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
> >
> > I'm a beginner in Nutch and need the best tutorials to get started.  Can
> > you guys let me know how you would advise yourselves if starting today
> > (like me)?
> >
> > Eric
> >
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Eric Valencia
Thank you kindly Yash.  Yes, I did try some of the tutorials actually but
they seem to be missing the complete amount of steps required to
successfully scrape in nutch.

On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan 
wrote:

> I would suggest to start with the documentation on nutch's website.
> You can get a Idea about how to start crawling and all.
> Apart from that there are no proper tutorials as such.
> Just start crawling if you got stuck somewhere try to find something
> related to that on Google and nutch mailing list archives.
> Ask questions if nothing helps.
>
> On 7 Mar 2018 00:01, "Eric Valencia"  wrote:
>
> I'm a beginner in Nutch and need the best tutorials to get started.  Can
> you guys let me know how you would advise yourselves if starting today
> (like me)?
>
> Eric
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
I would suggest to start with the documentation on nutch's website.
You can get a Idea about how to start crawling and all.
Apart from that there are no proper tutorials as such.
Just start crawling if you got stuck somewhere try to find something
related to that on Google and nutch mailing list archives.
Ask questions if nothing helps.

On 7 Mar 2018 00:01, "Eric Valencia"  wrote:

I'm a beginner in Nutch and need the best tutorials to get started.  Can
you guys let me know how you would advise yourselves if starting today
(like me)?

Eric