RE: get generated segments from step / fetch all empty segments
Thanks Markus, is that "enough" driven by the HDFS block size? Edoardo, sorry for hijacking your thread. :( On Sep 22, 2014 9:35 AM, "Markus Jelsma" wrote: > Hi - It will only generate more segments when there are enough URL's to > generate combined with either topN or generate.count.mode and > generate.max.count. > > -Original message- > > From:Meraj A. Khan > > Sent: Monday 22nd September 2014 15:33 > > To: user@nutch.apache.org > > Subject: RE: get generated segments from step / fetch all empty segments > > > > Markus, I have used the maxnum segments but no luck, is it driven by the > > size of the segment instead ? > > On Sep 22, 2014 9:28 AM, "Markus Jelsma" > wrote: > > > > > You can use maxNumSegments to generate more than one segment. And > instead > > > of passing a list of segment names around, why not just loop over the > > > entire directory, and move finished segments to another. > > > > > > > > > > > > -----Original message----- > > > > From:Edoardo Causarano > > > > Sent: Monday 22nd September 2014 15:25 > > > > To: user@nutch.apache.org > > > > Subject: Re: get generated segments from step / fetch all empty > segments > > > > > > > > Hi Meraj, > > > > > > > > at the moment I’m not, but in the Generator job class the method > > > “generate” does return a list of Paths therefore the possibility is > there > > > (somehow.) For now I’m concentrating on passing at least 1 segment name > > > from one step to the other, then I’ll see if and how I can get more. > > > > > > > > > > > > Best, > > > > Edoardo > > > > > > > > > > > > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) > > > wrote: > > > > > > > > Hi Edoardo, > > > > > > > > How do you generate the multiple segments at the time of generate > phase? > > > > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" < > > > edoardo.causar...@gmail.com> > > > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > I’m building an Oozie workflow to schedule the generate, fetch, > etc… > > > > > workflow. Right now I'm trying to feed the list of generated > segments > > > into > > > > > the following fetch stage. > > > > > > > > > > The “crawl” script assumes that the most recently added segment is > > > > > un-fetched and does some hdfs shell scripting to determine its > name and > > > > > stuff this into a shell variable, but I’d like to avoid this and > > > somehow > > > > > feed the list of generated segments directly into the following > step. > > > > > > > > > > I have the feeling that I could use the ooze “capture data from > action” > > > > > option but I think that will require fiddling with the Generator > class > > > > > source; that’s ok but I’m a bit weary of adding custom code that > may > > > not be > > > > > part of the core distribution. Has anyone already done something > > > similar, > > > > > preferably without touching the source? (e.g. > > > > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch > > > but it > > > > > now 404s on GitHub) > > > > > > > > > > > > > > > Best, > > > > > Edoardo > > > > > > > > > > -- > > > > > Edoardo Causarano > > > > > Sent with Airmail > > > > -- > > > > Edoardo Causarano > > > > Sent with Airmail > > > > > >
RE: get generated segments from step / fetch all empty segments
Hi - It will only generate more segments when there are enough URL's to generate combined with either topN or generate.count.mode and generate.max.count. -Original message- > From:Meraj A. Khan > Sent: Monday 22nd September 2014 15:33 > To: user@nutch.apache.org > Subject: RE: get generated segments from step / fetch all empty segments > > Markus, I have used the maxnum segments but no luck, is it driven by the > size of the segment instead ? > On Sep 22, 2014 9:28 AM, "Markus Jelsma" wrote: > > > You can use maxNumSegments to generate more than one segment. And instead > > of passing a list of segment names around, why not just loop over the > > entire directory, and move finished segments to another. > > > > > > > > -Original message- > > > From:Edoardo Causarano > > > Sent: Monday 22nd September 2014 15:25 > > > To: user@nutch.apache.org > > > Subject: Re: get generated segments from step / fetch all empty segments > > > > > > Hi Meraj, > > > > > > at the moment I’m not, but in the Generator job class the method > > “generate” does return a list of Paths therefore the possibility is there > > (somehow.) For now I’m concentrating on passing at least 1 segment name > > from one step to the other, then I’ll see if and how I can get more. > > > > > > > > > Best, > > > Edoardo > > > > > > > > > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) > > wrote: > > > > > > Hi Edoardo, > > > > > > How do you generate the multiple segments at the time of generate phase? > > > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" < > > edoardo.causar...@gmail.com> > > > wrote: > > > > > > > Hi all, > > > > > > > > I’m building an Oozie workflow to schedule the generate, fetch, etc… > > > > workflow. Right now I'm trying to feed the list of generated segments > > into > > > > the following fetch stage. > > > > > > > > The “crawl” script assumes that the most recently added segment is > > > > un-fetched and does some hdfs shell scripting to determine its name and > > > > stuff this into a shell variable, but I’d like to avoid this and > > somehow > > > > feed the list of generated segments directly into the following step. > > > > > > > > I have the feeling that I could use the ooze “capture data from action” > > > > option but I think that will require fiddling with the Generator class > > > > source; that’s ok but I’m a bit weary of adding custom code that may > > not be > > > > part of the core distribution. Has anyone already done something > > similar, > > > > preferably without touching the source? (e.g. > > > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch > > but it > > > > now 404s on GitHub) > > > > > > > > > > > > Best, > > > > Edoardo > > > > > > > > -- > > > > Edoardo Causarano > > > > Sent with Airmail > > > -- > > > Edoardo Causarano > > > Sent with Airmail > > >
RE: get generated segments from step / fetch all empty segments
Markus, I have used the maxnum segments but no luck, is it driven by the size of the segment instead ? On Sep 22, 2014 9:28 AM, "Markus Jelsma" wrote: > You can use maxNumSegments to generate more than one segment. And instead > of passing a list of segment names around, why not just loop over the > entire directory, and move finished segments to another. > > > > -Original message- > > From:Edoardo Causarano > > Sent: Monday 22nd September 2014 15:25 > > To: user@nutch.apache.org > > Subject: Re: get generated segments from step / fetch all empty segments > > > > Hi Meraj, > > > > at the moment I’m not, but in the Generator job class the method > “generate” does return a list of Paths therefore the possibility is there > (somehow.) For now I’m concentrating on passing at least 1 segment name > from one step to the other, then I’ll see if and how I can get more. > > > > > > Best, > > Edoardo > > > > > > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) > wrote: > > > > Hi Edoardo, > > > > How do you generate the multiple segments at the time of generate phase? > > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" < > edoardo.causar...@gmail.com> > > wrote: > > > > > Hi all, > > > > > > I’m building an Oozie workflow to schedule the generate, fetch, etc… > > > workflow. Right now I'm trying to feed the list of generated segments > into > > > the following fetch stage. > > > > > > The “crawl” script assumes that the most recently added segment is > > > un-fetched and does some hdfs shell scripting to determine its name and > > > stuff this into a shell variable, but I’d like to avoid this and > somehow > > > feed the list of generated segments directly into the following step. > > > > > > I have the feeling that I could use the ooze “capture data from action” > > > option but I think that will require fiddling with the Generator class > > > source; that’s ok but I’m a bit weary of adding custom code that may > not be > > > part of the core distribution. Has anyone already done something > similar, > > > preferably without touching the source? (e.g. > > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch > but it > > > now 404s on GitHub) > > > > > > > > > Best, > > > Edoardo > > > > > > -- > > > Edoardo Causarano > > > Sent with Airmail > > -- > > Edoardo Causarano > > Sent with Airmail >
Re: get generated segments from step / fetch all empty segments
Thanks, I was exactly thinking of the same thing the other day,because the crawl script sometimes loses contact with the long running fetch job and just keeps on waiting for %complete updates especially when the script is run in the background,even after the fetch job is complete,effectively resulting in the termination of the script execution. I hope this could be avoided in the Oozie workflow. On Sep 22, 2014 9:25 AM, "Edoardo Causarano" wrote: > Hi Meraj, > > at the moment I’m not, but in the Generator job class the method > “generate” does return a list of Paths therefore the possibility is there > (somehow.) For now I’m concentrating on passing at least 1 segment name > from one step to the other, then I’ll see if and how I can get more. > > > Best, > Edoardo > > > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: > > Hi Edoardo, > > How do you generate the multiple segments at the time of generate phase? > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" > wrote: > > > Hi all, > > > > I’m building an Oozie workflow to schedule the generate, fetch, etc… > > workflow. Right now I'm trying to feed the list of generated segments > into > > the following fetch stage. > > > > The “crawl” script assumes that the most recently added segment is > > un-fetched and does some hdfs shell scripting to determine its name and > > stuff this into a shell variable, but I’d like to avoid this and somehow > > feed the list of generated segments directly into the following step. > > > > I have the feeling that I could use the ooze “capture data from action” > > option but I think that will require fiddling with the Generator class > > source; that’s ok but I’m a bit weary of adding custom code that may not > be > > part of the core distribution. Has anyone already done something similar, > > preferably without touching the source? (e.g. > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but > it > > now 404s on GitHub) > > > > > > Best, > > Edoardo > > > > -- > > Edoardo Causarano > > Sent with Airmail > -- > Edoardo Causarano > Sent with Airmail
RE: get generated segments from step / fetch all empty segments
You can use maxNumSegments to generate more than one segment. And instead of passing a list of segment names around, why not just loop over the entire directory, and move finished segments to another. -Original message- > From:Edoardo Causarano > Sent: Monday 22nd September 2014 15:25 > To: user@nutch.apache.org > Subject: Re: get generated segments from step / fetch all empty segments > > Hi Meraj, > > at the moment I’m not, but in the Generator job class the method “generate” > does return a list of Paths therefore the possibility is there (somehow.) For > now I’m concentrating on passing at least 1 segment name from one step to the > other, then I’ll see if and how I can get more. > > > Best, > Edoardo > > > On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: > > Hi Edoardo, > > How do you generate the multiple segments at the time of generate phase? > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" > wrote: > > > Hi all, > > > > I’m building an Oozie workflow to schedule the generate, fetch, etc… > > workflow. Right now I'm trying to feed the list of generated segments into > > the following fetch stage. > > > > The “crawl” script assumes that the most recently added segment is > > un-fetched and does some hdfs shell scripting to determine its name and > > stuff this into a shell variable, but I’d like to avoid this and somehow > > feed the list of generated segments directly into the following step. > > > > I have the feeling that I could use the ooze “capture data from action” > > option but I think that will require fiddling with the Generator class > > source; that’s ok but I’m a bit weary of adding custom code that may not be > > > > part of the core distribution. Has anyone already done something similar, > > preferably without touching the source? (e.g. > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it > > now 404s on GitHub) > > > > > > Best, > > Edoardo > > > > -- > > Edoardo Causarano > > Sent with Airmail > -- > Edoardo Causarano > Sent with Airmail
Re: get generated segments from step / fetch all empty segments
Hi Meraj, at the moment I’m not, but in the Generator job class the method “generate” does return a list of Paths therefore the possibility is there (somehow.) For now I’m concentrating on passing at least 1 segment name from one step to the other, then I’ll see if and how I can get more. Best, Edoardo On 22 september 2014 at 14:50:03, Meraj A. Khan (mera...@gmail.com) wrote: Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, "Edoardo Causarano" wrote: > Hi all, > > I’m building an Oozie workflow to schedule the generate, fetch, etc… > workflow. Right now I'm trying to feed the list of generated segments into > the following fetch stage. > > The “crawl” script assumes that the most recently added segment is > un-fetched and does some hdfs shell scripting to determine its name and > stuff this into a shell variable, but I’d like to avoid this and somehow > feed the list of generated segments directly into the following step. > > I have the feeling that I could use the ooze “capture data from action” > option but I think that will require fiddling with the Generator class > source; that’s ok but I’m a bit weary of adding custom code that may not be > part of the core distribution. Has anyone already done something similar, > preferably without touching the source? (e.g. > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it > now 404s on GitHub) > > > Best, > Edoardo > > -- > Edoardo Causarano > Sent with Airmail -- Edoardo Causarano Sent with Airmail
Re: get generated segments from step / fetch all empty segments
Hi Edoardo, How do you generate the multiple segments at the time of generate phase? On Sep 22, 2014 6:01 AM, "Edoardo Causarano" wrote: > Hi all, > > I’m building an Oozie workflow to schedule the generate, fetch, etc… > workflow. Right now I'm trying to feed the list of generated segments into > the following fetch stage. > > The “crawl” script assumes that the most recently added segment is > un-fetched and does some hdfs shell scripting to determine its name and > stuff this into a shell variable, but I’d like to avoid this and somehow > feed the list of generated segments directly into the following step. > > I have the feeling that I could use the ooze “capture data from action” > option but I think that will require fiddling with the Generator class > source; that’s ok but I’m a bit weary of adding custom code that may not be > part of the core distribution. Has anyone already done something similar, > preferably without touching the source? (e.g. > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch but it > now 404s on GitHub) > > > Best, > Edoardo > > -- > Edoardo Causarano > Sent with Airmail