Re: How to write a Job for importing Files from an external Rest API into Hadoop

2017-07-31 Thread Ralph Soika

Hi Ravi,

thanks a lot for your response and the code example!
I think this will help me a lot to get started .I am glad to see that my 
idea is not to exotic.

I will report if I can adapt the solution for my problem.

best regards
Ralph


On 31.07.2017 22:05, Ravi Prakash wrote:

Hi Ralph!

Although not totally similar to your use case, DistCp may be the 
closest thing to what you want. 
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java 
. The client builds a file list, and then submits an MR job to copy 
over all the files.


HTH
Ravi

On Sun, Jul 30, 2017 at 2:21 PM, Ralph Soika > wrote:


Hi,

I want to ask, what's the best way implementing a Job which is
importing files into the HDFS?

I have an external System offering data accessible through a Rest
API. My goal is to have a job running in Hadoop which is
periodical (maybe started by chron?) looking into the Rest API if
new data is available.

It would be nice if also this job could run on multiple data
nodes. But in difference to all the MapReduce examples I found, is
my job looking for new Data or changed data from an external
interface and compares the data with existing one.

This is a conceptual example of the job:

 1. The job ask the Rest API if there are new files
 2. if so, the job imports the first file in the list
 3. look if the file already exits
 1. if not, the job imports the file
 2. if yes, the job compares the data with the data already stored
 1. if changed the job updates the file
 4. if more file exits the job continues with 2 -
 5. otherwise ends.


Can anybody give me a little help how to start (its my first job I
write...) ?


===
Ralph




-- 





--
*Imixs*...extends the way people work together
We are an open source company, read more at: www.imixs.org 



Imixs Software Solutions GmbH
Agnes-Pockels-Bogen 1, 80992 München
*Web:* www.imixs.com 
*Office:* +49 (0)89-452136 16 *Mobil:* +49-177-4128245
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika



Re: How to write a Job for importing Files from an external Rest API into Hadoop

2017-07-31 Thread Ravi Prakash
Hi Ralph!

Although not totally similar to your use case, DistCp may be the closest
thing to what you want.
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java
. The client builds a file list, and then submits an MR job to copy over
all the files.

HTH
Ravi

On Sun, Jul 30, 2017 at 2:21 PM, Ralph Soika  wrote:

> Hi,
>
> I want to ask, what's the best way implementing a Job which is importing
> files into the HDFS?
>
> I have an external System offering data accessible through a Rest API. My
> goal is to have a job running in Hadoop which is periodical (maybe started
> by chron?) looking into the Rest API if new data is available.
>
> It would be nice if also this job could run on multiple data nodes. But in
> difference to all the MapReduce examples I found, is my job looking for new
> Data or changed data from an external interface and compares the data with
> existing one.
>
> This is a conceptual example of the job:
>
>1. The job ask the Rest API if there are new files
>2. if so, the job imports the first file in the list
>3. look if the file already exits
>   1. if not, the job imports the file
>   2. if yes, the job compares the data with the data already stored
>  1. if changed the job updates the file
>  4. if more file exits the job continues with 2 -
>5. otherwise ends.
>
>
> Can anybody give me a little help how to start (its my first job I
> write...) ?
>
>
> ===
> Ralph
>
>
>
>
> --
>
>


How to write a Job for importing Files from an external Rest API into Hadoop

2017-07-30 Thread Ralph Soika

Hi,

I want to ask, what's the best way implementing a Job which is importing 
files into the HDFS?


I have an external System offering data accessible through a Rest API. 
My goal is to have a job running in Hadoop which is periodical (maybe 
started by chron?) looking into the Rest API if new data is available.


It would be nice if also this job could run on multiple data nodes. But 
in difference to all the MapReduce examples I found, is my job looking 
for new Data or changed data from an external interface and compares the 
data with existing one.


This is a conceptual example of the job:

1. The job ask the Rest API if there are new files
2. if so, the job imports the first file in the list
3. look if the file already exits
1. if not, the job imports the file
2. if yes, the job compares the data with the data already stored
1. if changed the job updates the file
4. if more file exits the job continues with 2 -
5. otherwise ends.


Can anybody give me a little help how to start (its my first job I 
write...) ?



===
Ralph




--