Re: [MASSMAIL]Re: Injection from webservice

2019-09-17 Thread Jorge Betancourt
TBH I'm not entirely sure. Downloading the file can be scripted around
without a lot of troubles. My feeling is that the Injector class has a good
enough scope already. There are valid reasons for having a custom injector
(reading the seed URLs from a DB comes to my mind). When I needed a custom
injector it was for very requirements, and it made more sense to have a
custom injector instead of generating a seed file (this was before having a
REST API, which right now provides a nice API around the injector).

It is a valid point that we don't have an extension point for the Injector
logic which could allow for having different seed URL providers without
developers needing to worry about the specific injection logic.

My main concern is if we want to put this additional complexity in Nutch.
It is really valuable to all of our users to have HTTP/DB/custom injectors
available out of the box in a pluggable way?

I would love to hear what other people have to say.

Best Regards,
Jorge

On Mon, Sep 16, 2019 at 8:53 PM Roannel Fernandez Hernandez 
wrote:

> Thanks Jorge for your answer. Do you think an injector that accepts
> local/hdfs paths and in addition API endpoints could be a good improvement
> for Nutch.
>
> Regards, Roannel
>
> - Original Message -
> > From: "Jorge Betancourt" 
> > To: "user" 
> > Sent: Lunes, 16 de Septiembre 2019 13:14:36
> > Subject: [MASSMAIL]Re: Injection from webservice
>
> > Hi Roannel,
> >
> > The current implementation of the injector only accepts a path (actually
> an
> > org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> > directly unless you download the content first.
> >
> > If you use the REST API you can send the seed file using the API
> endpoint.
> > Otherwise, you could write your own injector with the proper logic to
> deal
> > with a list of URLs coming from an URL.
> >
> > The REST API implementation just writes the content in the expected
> format (
> >
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> > )
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <
> roan...@uci.cu>
> > wrote:
> >
> >> Hi folks,
> >>
> >> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> >> via http or https)?
> >>
> >> I mean this, for instance:
> >>
> >> bin/nutch inject crawl http://example.org/seed
> >>
> >> Regards
> >> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> >> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
> >>
> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>
>


Re: [MASSMAIL]Re: Injection from webservice

2019-09-16 Thread Roannel Fernandez Hernandez
Thanks Jorge for your answer. Do you think an injector that accepts local/hdfs 
paths and in addition API endpoints could be a good improvement for Nutch.

Regards, Roannel

- Original Message -
> From: "Jorge Betancourt" 
> To: "user" 
> Sent: Lunes, 16 de Septiembre 2019 13:14:36
> Subject: [MASSMAIL]Re: Injection from webservice

> Hi Roannel,
> 
> The current implementation of the injector only accepts a path (actually an
> org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> directly unless you download the content first.
> 
> If you use the REST API you can send the seed file using the API endpoint.
> Otherwise, you could write your own injector with the proper logic to deal
> with a list of URLs coming from an URL.
> 
> The REST API implementation just writes the content in the expected format (
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> )
> 
> Best Regards,
> Jorge
> 
> On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez 
> wrote:
> 
>> Hi folks,
>>
>> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
>> via http or https)?
>>
>> I mean this, for instance:
>>
>> bin/nutch inject crawl http://example.org/seed
>>
>> Regards
>> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
>> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>>
1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
Por La Habana, lo más grande. #Habana500 #UCIxHabana500