I thought about that. But if load it and then transform the schema, won't it fix the issue which is causing python avro library grief?
Any suggestions on how to make a "redacted" version of the schema... On Tue, Oct 27, 2015 at 2:34 PM, Sean Busbey <[email protected]> wrote: > well, testing with the java avro-tools was my very next suggestion. :/ > > Can you make a redacted version of the schema? > > On Tue, Oct 27, 2015 at 1:22 PM, web user <[email protected]> wrote: > > Unfortunately the company I work at has a strict policy about sharing > data. > > Having said that I don't think the file is corrupted. > > > > I ran the following command: > > > > java -jar avro-tools-1.7.7.jar tojson testdata.avro > > > > and it generates a file of 1 byte > > > > I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it > > gets back the correct schema. > > > > Is there any way when using the python library for it not to have consume > > all memory on the entire box? > > > > Regards, > > > > WU > > > > > > > > On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <[email protected]> > wrote: > >> > >> It sounds like the file you are reading is malformed. Could you share > >> the file or how it was written? > >> > >> On Tue, Oct 27, 2015 at 1:01 PM, web user <[email protected]> > wrote: > >> > I ran this in a vm with much less memory and it immediately failed > with > >> > a > >> > memory error: > >> > > >> > Traceback (most recent call last): > >> > File "testavro.py", line 31, in <module> > >> > for r in reader: > >> > File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line > >> > 362, > >> > in next > >> > datum = self.datum_reader.read(self.datum_decoder) > >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445, > in > >> > read > >> > return self.read_data(self.writers_schema, self.readers_schema, > >> > decoder) > >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490, > in > >> > read_data > >> > return self.read_record(writers_schema, readers_schema, decoder) > >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690, > in > >> > read_record > >> > field_val = self.read_data(field.type, readers_field.type, > decoder) > >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484, > in > >> > read_data > >> > return self.read_array(writers_schema, readers_schema, decoder) > >> > File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582, > in > >> > read_array > >> > for i in range(block_count): > >> > MemoryError > >> > > >> > > >> > On Tue, Oct 27, 2015 at 1:36 PM, web user <[email protected]> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> I'm doing the following: > >> >> > >> >> from avro.datafile import DataFileReader > >> >> from avro.datafile import DataFileWriter > >> >> from avro.io import DatumReader > >> >> from avro.io import DatumWriter > >> >> > >> >> def OpenAvroFileToRead(avro_filename): > >> >> DataFileReader(open(avro_filename, 'r'), DatumReader()) > >> >> > >> >> > >> >> with OpenAvroFileToRead(avro_filename) as reader: > >> >> for r in reader: > >> >> .... > >> >> > >> >> I have an avro file which is only 500 bytes. I think there is a data > >> >> structure in there which is null or empty. > >> >> > >> >> I put in print statements before and after "for r in reader". On the > >> >> instruction, for r in reader it consumes about 400Gigs of memory > before > >> >> I > >> >> have to kill the process. > >> >> > >> >> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1 > >> >> and > >> >> 1.7.1 and 1.7.7 and get the same behavior on all three versions. > >> >> > >> >> Any ideas on what is causing this? > >> >> > >> >> Regards, > >> >> > >> >> WU > >> > > >> > > >> > >> > >> > >> -- > >> Sean > > > > > > > > -- > Sean >
