RE: streaming Avro to HDFS

Alan Miller Wed, 06 Feb 2013 10:22:25 -0800

Thanks Hari,

Are there any links to examples of how to use the RpcClient?

Alan

From: Hari Shreedharan [mailto:[email protected]]
Sent: Wednesday, February 06, 2013 7:16 PM
To: [email protected]
Subject: Re: streaming Avro to HDFS

Alan,

I think this is probably because the AvroClient is not really very "smart." It 
is mainly useful for testing the AvroSource. The AvroClient reads the file 
passed in, and sends one line per event (in 1.2.0, in 1.3.0+ there is an option 
of sending all files in a directory). So the events are not really sent as Avro 
files, and since you are using the text serializer they are dumped as is. Since 
events can arrive out of order, your data is likely to be invalid Avro. Also 
the new line character that is used to split the event may actually have been 
part of the real avro serialization, removing it simply made it invalid avro.

My advice would be to use the RpcClient to read the file, and send the data 
such that you send the data in a valid format, by making sure one avro 
"container" is in one event.

Hari

--
Hari Shreedharan

On Wednesday, February 6, 2013 at 9:58 AM, Alan Miller wrote:

Hi I’m just getting started with Flume and trying to understand the flow of 
things.

I have avro binary data files being generated on remote nodes and I want to use

Flume (1.2.0) to stream them to my HDFS cluster at a central location. It seems 
I can

stream the data but the resulting files on HDFS seem corrupt.  Here’s what I 
did:

For my “master” (on the NameNode of my Hadoop cluster)  I started this:

flume-ng agent -f agent.conf  -Dflume.root.logger=DEBUG,console -n agent

With this config:

agent.channels = memory-channel

agent.sources = avro-source

agent.sinks = hdfs-sink

agent.channels.memory-channel.type = memory

agent.channels.memory-channel.capacity = 1000

agent.channels.memory-channel.transactionCapacity = 100

agent.sources.avro-source.channels = memory-channel

agent.sources.avro-source.type = avro

agent.sources.avro-source.bind = 10.10.10.10

agent.sources.avro-source.port = 41414

agent.sinks.hdfs-sink.type = hdfs

agent.sinks.hdfs-sink.channel = memory-channel

agent.sinks.hdfs-sink.hdfs.path = hdfs://namenode1:9000/flume

On a remote node I streamed a test file like this:

flume-ng avro-client -H 10.10.10.10 -p 41414 -F /tmp/test.avro

I can see the master is writing to HDFS

  ……

  13/02/06 09:37:55 INFO hdfs.BucketWriter: Creating 
hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp

  13/02/06 09:38:25 INFO hdfs.BucketWriter: Renaming 
hdfs://namenode1:9000/flume/FlumeData.1360172273684.tmp

  to hdfs://namenode1:9000/flume/FlumeData.1360172273684

But the data doesn’t seem right. The original file is 4551 bytes, the file 
written to

HDFS was only 219 bytes

  [localhost] $ ls –l FlumeData.1360172273684 /tmp/test.avro

  -rwxr-xr-x 1 amiller amiller  219 Feb  6 18:51 FlumeData.1360172273684

  -rwxr-xr-x 1 amiller amiller 4551 Feb 6 12:00 /tmp/test.avro

  [localhost] $ avro cat /tmp/test.avro

  {"system_model": null, "nfsv4": null, "ip": null, "site": null, "nfsv3": 
null, "export": null, "ifnet": [{"send_bps": 1234, "recv_bps": 5678, "name": 
"eth0"}, {"send_bps": 100, "recv_bps": 200, "name": "eth1"}, {"send_bps": 0, 
"recv_bps": 0, "name": "eth2"}], "disk": null, "hostname": "localhost", 
"total_mem": null, "ontapi_version": null, "serial_number": null, "cifs": null, 
"cpu_model": null, "volume": null, "time_stamp": 1357639723, "aggregate": null, 
"num_cpu": null, "cpu_speed_mhz": null, "hostid": null, "kernel_version": null, 
"qtree": null, "processor": null}

  [localhost] $ hadoop fs -copyToLocal /flume/FlumeData.1360172273684 .

  [localhost] $ avro cat FlumeData.1360172273684

  panic: ord() expected a character, but string of length 0 found

Alan

RE: streaming Avro to HDFS

Reply via email to