Thanks Markus, is that "enough" driven by the HDFS block size?

Edoardo, sorry for hijacking your thread. :(
On Sep 22, 2014 9:35 AM, "Markus Jelsma" <[email protected]> wrote:

> Hi - It will only generate more segments when there are enough URL's to
> generate combined with either topN or generate.count.mode and
> generate.max.count.
>
> -----Original message-----
> > From:Meraj A. Khan <[email protected]>
> > Sent: Monday 22nd September 2014 15:33
> > To: [email protected]
> > Subject: RE: get generated segments from step / fetch all empty segments
> >
> > Markus, I have used the maxnum segments but no luck, is it driven by the
> > size of the segment instead ?
> > On Sep 22, 2014 9:28 AM, "Markus Jelsma" <[email protected]>
> wrote:
> >
> > > You can use maxNumSegments to generate more than one segment. And
> instead
> > > of passing a list of segment names around, why not just loop over the
> > > entire directory, and move finished segments to another.
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Edoardo Causarano <[email protected]>
> > > > Sent: Monday 22nd September 2014 15:25
> > > > To: [email protected]
> > > > Subject: Re: get generated segments from step / fetch all empty
> segments
> > > >
> > > > Hi Meraj,
> > > >
> > > > at the moment I’m not, but in the Generator job class the method
> > > “generate” does return a list of Paths therefore the possibility is
> there
> > > (somehow.) For now I’m concentrating on passing at least 1 segment name
> > > from one step to the other, then I’ll see if and how I can get more.
> > > >
> > > >
> > > > Best,
> > > > Edoardo
> > > >
> > > >
> > > > On 22 september 2014 at 14:50:03, Meraj A. Khan ([email protected])
> > > wrote:
> > > >
> > > > Hi Edoardo,
> > > >
> > > > How do you generate the multiple segments at the time of generate
> phase?
> > > > On Sep 22, 2014 6:01 AM, "Edoardo Causarano" <
> > > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I’m building an Oozie workflow to schedule the generate, fetch,
> etc…
> > > > > workflow. Right now I'm trying to feed the list of generated
> segments
> > > into
> > > > > the following fetch stage.
> > > > >
> > > > > The “crawl” script assumes that the most recently added segment is
> > > > > un-fetched and does some hdfs shell scripting to determine its
> name and
> > > > > stuff this into a shell variable, but I’d like to avoid this and
> > > somehow
> > > > > feed the list of generated segments directly into the following
> step.
> > > > >
> > > > > I have the feeling that I could use the ooze “capture data from
> action”
> > > > > option but I think that will require fiddling with the Generator
> class
> > > > > source; that’s ok but I’m a bit weary of adding custom code that
> may
> > > not be
> > > > > part of the core distribution. Has anyone already done something
> > > similar,
> > > > > preferably without touching the source? (e.g.
> > > > > http://qnalist.com/questions/2330221/nutch-oozie-and-elasticsearch
> > > but it
> > > > > now 404s on GitHub)
> > > > >
> > > > >
> > > > > Best,
> > > > > Edoardo
> > > > >
> > > > > --
> > > > > Edoardo Causarano
> > > > > Sent with Airmail
> > > > --
> > > > Edoardo Causarano
> > > > Sent with Airmail
> > >
> >
>

Reply via email to