Hi Ralph! Although not totally similar to your use case, DistCp may be the closest thing to what you want. https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java . The client builds a file list, and then submits an MR job to copy over all the files.
HTH Ravi On Sun, Jul 30, 2017 at 2:21 PM, Ralph Soika <ralph.so...@imixs.com> wrote: > Hi, > > I want to ask, what's the best way implementing a Job which is importing > files into the HDFS? > > I have an external System offering data accessible through a Rest API. My > goal is to have a job running in Hadoop which is periodical (maybe started > by chron?) looking into the Rest API if new data is available. > > It would be nice if also this job could run on multiple data nodes. But in > difference to all the MapReduce examples I found, is my job looking for new > Data or changed data from an external interface and compares the data with > existing one. > > This is a conceptual example of the job: > > 1. The job ask the Rest API if there are new files > 2. if so, the job imports the first file in the list > 3. look if the file already exits > 1. if not, the job imports the file > 2. if yes, the job compares the data with the data already stored > 1. if changed the job updates the file > 4. if more file exits the job continues with 2 - > 5. otherwise ends. > > > Can anybody give me a little help how to start (its my first job I > write...) ? > > > === > Ralph > > > > > -- > >