Re: Injection from webservice

2019-09-19 Thread lewis john mcgibbney
Hi Folks,
I've implemented what Dave suggested... it is clean and easy but it maybe
not quite as ad-hoc-capable as one would always want. For my use cases it
was acceptable.
More responses inline

On Thu, Sep 19, 2019 at 2:47 PM  wrote:

> From: Jorge Betancourt 
> To: user@nutch.apache.org
> Cc:
> Bcc:
>
>
[snip]


>
> My main concern is if we want to put this additional complexity in Nutch.
> It is really valuable to all of our users to have HTTP/DB/custom injectors
> available out of the box in a pluggable way?
>
> I would love to hear what other people have to say.
>
> In all honesty, I would like to see as much of the REST logic and WebUI
extracted out of the core codebase as possible. I feel like we should have
done it this way around initially but didn't.
Considering 'separation of concerns' for Nutch is important and Jorge, your
spot on with your reservations.

Lewis


Re: [MASSMAIL]Re: Injection from webservice

2019-09-17 Thread Jorge Betancourt
TBH I'm not entirely sure. Downloading the file can be scripted around
without a lot of troubles. My feeling is that the Injector class has a good
enough scope already. There are valid reasons for having a custom injector
(reading the seed URLs from a DB comes to my mind). When I needed a custom
injector it was for very requirements, and it made more sense to have a
custom injector instead of generating a seed file (this was before having a
REST API, which right now provides a nice API around the injector).

It is a valid point that we don't have an extension point for the Injector
logic which could allow for having different seed URL providers without
developers needing to worry about the specific injection logic.

My main concern is if we want to put this additional complexity in Nutch.
It is really valuable to all of our users to have HTTP/DB/custom injectors
available out of the box in a pluggable way?

I would love to hear what other people have to say.

Best Regards,
Jorge

On Mon, Sep 16, 2019 at 8:53 PM Roannel Fernandez Hernandez 
wrote:

> Thanks Jorge for your answer. Do you think an injector that accepts
> local/hdfs paths and in addition API endpoints could be a good improvement
> for Nutch.
>
> Regards, Roannel
>
> - Original Message -
> > From: "Jorge Betancourt" 
> > To: "user" 
> > Sent: Lunes, 16 de Septiembre 2019 13:14:36
> > Subject: [MASSMAIL]Re: Injection from webservice
>
> > Hi Roannel,
> >
> > The current implementation of the injector only accepts a path (actually
> an
> > org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> > directly unless you download the content first.
> >
> > If you use the REST API you can send the seed file using the API
> endpoint.
> > Otherwise, you could write your own injector with the proper logic to
> deal
> > with a list of URLs coming from an URL.
> >
> > The REST API implementation just writes the content in the expected
> format (
> >
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> > )
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <
> roan...@uci.cu>
> > wrote:
> >
> >> Hi folks,
> >>
> >> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> >> via http or https)?
> >>
> >> I mean this, for instance:
> >>
> >> bin/nutch inject crawl http://example.org/seed
> >>
> >> Regards
> >> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> >> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
> >>
> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>
>


Re: [MASSMAIL]Re: Injection from webservice

2019-09-16 Thread Roannel Fernandez Hernandez
Thanks Jorge for your answer. Do you think an injector that accepts local/hdfs 
paths and in addition API endpoints could be a good improvement for Nutch.

Regards, Roannel

- Original Message -
> From: "Jorge Betancourt" 
> To: "user" 
> Sent: Lunes, 16 de Septiembre 2019 13:14:36
> Subject: [MASSMAIL]Re: Injection from webservice

> Hi Roannel,
> 
> The current implementation of the injector only accepts a path (actually an
> org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> directly unless you download the content first.
> 
> If you use the REST API you can send the seed file using the API endpoint.
> Otherwise, you could write your own injector with the proper logic to deal
> with a list of URLs coming from an URL.
> 
> The REST API implementation just writes the content in the expected format (
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> )
> 
> Best Regards,
> Jorge
> 
> On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez 
> wrote:
> 
>> Hi folks,
>>
>> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
>> via http or https)?
>>
>> I mean this, for instance:
>>
>> bin/nutch inject crawl http://example.org/seed
>>
>> Regards
>> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
>> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>>
1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
Por La Habana, lo más grande. #Habana500 #UCIxHabana500



Re: Injection from webservice

2019-09-16 Thread Dave Beckstrom
Or use a scheduled wget job to pull them from the remote server and store
them on a path that Nutch can access locally.

Regards,

Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: dbeckst...@collectivefls.com 
ph: 763.323.3499


On Mon, Sep 16, 2019 at 12:14 PM Jorge Betancourt <
betancourt.jo...@gmail.com> wrote:

> Hi Roannel,
>
> The current implementation of the injector only accepts a path (actually an
> org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> directly unless you download the content first.
>
> If you use the REST API you can send the seed file using the API endpoint.
> Otherwise, you could write your own injector with the proper logic to deal
> with a list of URLs coming from an URL.
>
> The REST API implementation just writes the content in the expected format
> (
>
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> )
>
> Best Regards,
> Jorge
>
> On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <
> roan...@uci.cu>
> wrote:
>
> > Hi folks,
> >
> > Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> > via http or https)?
> >
> > I mean this, for instance:
> >
> > bin/nutch inject crawl http://example.org/seed
> >
> > Regards
> > 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> > Por La Habana, lo más grande. #Habana500 #UCIxHabana500
> >
> >
>

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/  





Re: Injection from webservice

2019-09-16 Thread Jorge Betancourt
Hi Roannel,

The current implementation of the injector only accepts a path (actually an
org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
directly unless you download the content first.

If you use the REST API you can send the seed file using the API endpoint.
Otherwise, you could write your own injector with the proper logic to deal
with a list of URLs coming from an URL.

The REST API implementation just writes the content in the expected format (
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
)

Best Regards,
Jorge

On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez 
wrote:

> Hi folks,
>
> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> via http or https)?
>
> I mean this, for instance:
>
> bin/nutch inject crawl http://example.org/seed
>
> Regards
> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>
>