Re: streaming Avro to HDFS

Hari Shreedharan Wed, 06 Feb 2013 10:59:19 -0800

Here you are: http://flume.apache.org/FlumeDeveloperGuide.html#client



Hari  

--  
Hari Shreedharan


On Wednesday, February 6, 2013 at 10:20 AM, Alan Miller wrote:

> Thanks Hari,
>   
> Are there any links to examples of how to use the RpcClient?
>   
> Alan
>   
> From: Hari Shreedharan [mailto:[email protected]]  
> Sent: Wednesday, February 06, 2013 7:16 PM
> To: [email protected] (mailto:[email protected])
> Subject: Re: streaming Avro to HDFS  
>   
> Alan,
>  
>   
>  
> I think this is probably because the AvroClient is not really very "smart." 
> It is mainly useful for testing the AvroSource. The AvroClient reads the file 
> passed in, and sends one line per event (in 1.2.0, in 1.3.0+ there is an 
> option of sending all files in a directory). So the events are not really 
> sent as Avro files, and since you are using the text serializer they are 
> dumped as is. Since events can arrive out of order, your data is likely to be 
> invalid Avro. Also the new line character that is used to split the event may 
> actually have been part of the real avro serialization, removing it simply 
> made it invalid avro.   
>  
>   
>  
> My advice would be to use the RpcClient to read the file, and send the data 
> such that you send the data in a valid format, by making sure one avro 
> "container" is in one event.
>  
>   
>  
>   
>  
> Hari
>  
>   
>  
> --  
>  
> Hari Shreedharan
>  
>   
>  
>  
> On Wednesday, February 6, 2013 at 9:58 AM, Alan Miller wrote:
> >  
> > Hi I’m just getting started with Flume and trying to understand the flow of 
> > things.
> >  
> >  
> >   
> >  
> >  
> > I have avro binary data files being generated on remote nodes and I want to 
> > use  
> >  
> >  
> > Flume (1.2.0) to stream them to my HDFS cluster at a central location. It 
> > seems I can
> >  
> >  
> > stream the data but the resulting files on HDFS seem corrupt.  Here’s what 
> > I did:
> >  
> >  
> >   
> >  
> >  
> > For my “master” (on the NameNode of my Hadoop cluster)  I started this:
> >  
> >  
> > flume-ng agent -f agent.conf  -Dflume.root.logger=DEBUG,console -n agent
> >  
> >  
> > With this config:
> >  
> >  
> > agent.channels = memory-channel
> >  
> >  
> > agent.sources = avro-source
> >  
> >  
> > agent.sinks = hdfs-sink
> >  
> >  
> >   
> >  
> >  
> > agent.channels.memory-channel.type = memory
> >  
> >  
> > agent.channels.memory-channel.capacity = 1000
> >  
> >  
> > agent.channels.memory-channel.transactionCapacity = 100
> >  
> >  
> >   
> >  
> >  
> > agent.sources.avro-source.channels = memory-channel
> >  
> >  
> > agent.sources.avro-source.type = avro
> >  
> >  
> > agent.sources.avro-source.bind = 10.10.10.10
> >  
> >  
> > agent.sources.avro-source.port = 41414
> >  
> >  
> >   
> >  
> >  
> > agent.sinks.hdfs-sink.type = hdfs
> >  
> >  
> > agent.sinks.hdfs-sink.channel = memory-channel
> >  
> >  
> > agent.sinks.hdfs-sink.hdfs.path = hdfs://namenode1:9000/flume
> >  
> >  
> >   
> >  
> >  
> > On a remote node I streamed a test file like this:
> >  
> >  
> > flume-ng avro-client -H 10.10.10.10 -p 41414 -F /tmp/test.avro
> >  
> >  
> >   
> >  
> >  
> > I can see the master is writing to HDFS
> >  
> >  
> >   ……
> >  
> >  
> >   13/02/06 09:37:55 INFO hdfs.BucketWriter: Creating 
> > hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp
> >  
> >  
> >   13/02/06 09:38:25 INFO hdfs.BucketWriter: Renaming 
> > hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp  
> >  
> >  
> >   to hdfs://namenode1:9000/flume/FlumeData.1360172273684
> >  
> >  
> >   
> >  
> >  
> > But the data doesn’t seem right. The original file is 4551 bytes, the file 
> > written to  
> >  
> >  
> > HDFS was only 219 bytes
> >  
> >  
> >   [localhost] $ ls –l FlumeData.1360172273684 /tmp/test.avro
> >  
> >  
> >   -rwxr-xr-x 1 amiller amiller  219 Feb  6 18:51 FlumeData.1360172273684
> >  
> >  
> >   -rwxr-xr-x 1 amiller amiller 4551 Feb 6 12:00 /tmp/test.avro
> >  
> >  
> >   
> >  
> >  
> >   [localhost] $ avro cat /tmp/test.avro  
> >  
> >  
> >   {"system_model": null, "nfsv4": null, "ip": null, "site": null, "nfsv3": 
> > null, "export": null, "ifnet": [{"send_bps": 1234, "recv_bps": 5678, 
> > "name": "eth0"}, {"send_bps": 100, "recv_bps": 200, "name": "eth1"}, 
> > {"send_bps": 0, "recv_bps": 0, "name": "eth2"}], "disk": null, "hostname": 
> > "localhost", "total_mem": null, "ontapi_version": null, "serial_number": 
> > null, "cifs": null, "cpu_model": null, "volume": null, "time_stamp": 
> > 1357639723, "aggregate": null, "num_cpu": null, "cpu_speed_mhz": null, 
> > "hostid": null, "kernel_version": null, "qtree": null, "processor": null}
> >  
> >  
> >   
> >  
> >  
> >   [localhost] $ hadoop fs -copyToLocal /flume/FlumeData.1360172273684 .
> >  
> >  
> >   [localhost] $ avro cat FlumeData.1360172273684
> >  
> >  
> >   panic: ord() expected a character, but string of length 0 found
> >  
> >  
> >   
> >  
> >  
> > Alan
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>

Re: streaming Avro to HDFS

Reply via email to