thanks for your inputs Ashish and Hari.
Ashish, I'm attempting something similar (using webHDFS) to what you
mentioned inline of 3rd point (whether to consider flume for a daily batch
job).
Let me know if you have any idea about the error. I'll update the group if
setting the flag

dfs.webhdfs.enabled  = true

helps.

regards
Sunita


---------- Forwarded message ----------
From: Sunita Arvind <[email protected]>
Date: Fri, Jul 19, 2013 at 7:30 PM
Subject: Re: Seeking advice over choice of language and implementation
To: [email protected]


Thankyou Israel,

I will attempt option 1 and share my experiences.

I tried a workaround in the meanwhile- Using webhdfs to directly write the
files to hdfs from a python daemon (using this library -
https://github.com/carlosmarin/webhdfs-py/blob/master/webhdfs/webhdfs.py)

However, with this, I am getting an exception
07/19/2013 06:05:59 PM - webhdfs - DEBUG - HTTP Response: 404, Not Found

If I copy paste the resultant URL in the browser address bar, I get
something like this:
"
{"RemoteException":{"exception":"IllegalArgumentException","javaClassName":"java.lang.IllegalArgumentException","message":"Invalid
value for webhdfs parameter \"op\": No enum const class
org.apache.hadoop.hdfs.web.resources.GetOpParam$Op.CREATE"}}
"

No idea what this means. Wondering if it means the hdfs is not configured as
dfs.webhdfs.enabled = true. (I do not have permission to check/change
this. Requesting for access from the admin). Let me know your
thoughts.

regards
Sunita





On Fri, Jul 19, 2013 at 12:51 AM, Israel Ekpo <[email protected]> wrote:

> Sunita,
>
> Depending on your level of comfort, you can do one of the following:
>
> 1. Use Python to fetch your data and then send the events via HTTP to the
> Flume HTTP Source [1]
> 2. Use Java to create a custom source [6] in Flume that handles the data
> fetching and then puts it in a channel [3] so that it can be funneled into
> the sinks [4] and [5]
>
> Option 1 would be easier for you since you can get the data in Python and
> just stream it down via HTTP to Flume.
>
> Option 2 will be more involved since you need to write code that
> communicates with external endpoints.
>
> References
> [1] http://goo.gl/5lHlg
> [2] http://goo.gl/GnVbE
> [3] http://goo.gl/t31Xh
> [4] http://goo.gl/G9xS8
> [5] http://goo.gl/Wn4W5
> [6] http://goo.gl/Q0yyn
>
>
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization
> with Open Source Software*
> *http://massivelogdata.com*
>
>
> On 18 July 2013 13:38, Sunita Arvind <[email protected]> wrote:
>
>> Hello friends,
>>
>> I am new to flume and have written a python script to fetch some data
>> from social media. My response is JSON. I am seeking help on following
>> issues:
>> 1. I am finding it hard to make python and flume talk. Is it just my
>> ignorance or it is indeed a long route? AFAIK, I need to understand thrift
>> API and Avro etc to achieve this. I also read about pipes. Would this be a
>> simple implementation
>>
>> 2. I am equally comfortable (uncomfortable) in java. Hence wondering if
>> its better to re-write my application in Java so that I can easily
>> integrate it with flume. Are there any advantages of having a java
>> application, as all of hadoop is java?
>>
>> 3. I need to schedule the agent to run on a daily basis. Which of the
>> above approaches would help me achieve this easily?
>>
>> 4. Going by this -
>> http://mail-archives.apache.org/mod_mbox/flume-user/201306.mbox/%[email protected]%3Elooks
>>  like we need to manually clean up disk space even with flume. I am
>> not clear on the advantages I would have with flume over using a simple
>> cron job to do the task. I can manually write statements like "hadoop fs
>> -put <location of output file on local> <location on hdfs>" in the cron job
>> instead.
>>
>> Appreciate your help and guidance
>>
>> regards,
>> Sunita
>>
>
>

Reply via email to