Hi Ralph!

Although not totally similar to your use case, DistCp may be the closest
thing to what you want.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java
. The client builds a file list, and then submits an MR job to copy over
all the files.

HTH
Ravi

On Sun, Jul 30, 2017 at 2:21 PM, Ralph Soika <ralph.so...@imixs.com> wrote:

> Hi,
>
> I want to ask, what's the best way implementing a Job which is importing
> files into the HDFS?
>
> I have an external System offering data accessible through a Rest API. My
> goal is to have a job running in Hadoop which is periodical (maybe started
> by chron?) looking into the Rest API if new data is available.
>
> It would be nice if also this job could run on multiple data nodes. But in
> difference to all the MapReduce examples I found, is my job looking for new
> Data or changed data from an external interface and compares the data with
> existing one.
>
> This is a conceptual example of the job:
>
>    1. The job ask the Rest API if there are new files
>    2. if so, the job imports the first file in the list
>    3. look if the file already exits
>       1. if not, the job imports the file
>       2. if yes, the job compares the data with the data already stored
>          1. if changed the job updates the file
>          4. if more file exits the job continues with 2 -
>    5. otherwise ends.
>
>
> Can anybody give me a little help how to start (its my first job I
> write...) ?
>
>
> ===
> Ralph
>
>
>
>
> --
>
>

Reply via email to