Hi, Thank you for your replies.. my solution is very similar to Marteen's suggestions. Using the method from my first email, I wrote a new file with only the data I needed in the fixed amount of columns. After that, I just load the new optimised file with Apache Arrow.
Thank you all, Elisa On Tue, 19 Nov 2019 at 20:09, Maarten Ballintijn <[email protected]> wrote: > Hi Elisa, > > One option is to preprocess the file and add the missing columns. > You can do this using two passes (reading once to determine the number of > columns > and once writing out the lines filled out to the right number of columns) > This does not need to take a lot of memory as you can read line by line. > (see > http://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects > ) > > Another option is to make an iterator to do this on the fly. > > Either way the resulting DataFrame is going to be large: > > 14e6 rows * 43 columns * 8 bytes/float ~ 5 Gbyte (in the best case) > > Cheers, > Maarten > > > > On Nov 19, 2019, at 4:51 AM, Antoine Pitrou <[email protected]> wrote: > > > No, there is no way to load CSV files with irregular dimensions, and we > don't have any plans currently to support them. Sorry :-( > > Regards > > Antoine. > > > Le 19/11/2019 à 05:54, Micah Kornfield a écrit : > > +dev@arrow to see if there is a more definitive answer, but I don't > believe > this type of functionality is supported currently. > > > > > On Fri, Nov 15, 2019 at 1:42 AM Elisa Scandellari < > [email protected]> wrote: > > Hi, > I'm trying to improve the performance of my program that loads csv data > and manipulates it. > My CSV file contains 14 million rows and has a variable amount of columns. > The first 27 columns will always be available, and a row can have up to 16 > more columns for a total of 43. > > Using vanilla pandas I've found this workaround: > ``` > > > > > > > > > > > *largest_column_count = 0with open(data_file, 'r') as temp_f: lines = > temp_f.readlines() for l in lines: column_count = > len(l.split(',')) + 1 largest_column_count = column_count if > largest_column_count < column_count else > largest_column_counttemp_f.close()column_names = [i for i in range(0, > largest_column_count)]all_columns_df = pd.read_csv(file, header=None, > delimiter=',', names=column_names, dtype='category').replace(pd.np.nan, '', > regex=True)*``` > This will create the table with all my data plus empty cells where the > data is not available. > With a smaller file, this works perfectly well. With the complete file, my > memory usage goes over the roof. > > I've been reading about Apache Arrow and, after a few attempts to load a > structured csv file (same amount of columns for every row), I'm extremely > impressed. > I've tried to load my data file, using the same concept as above: > ``` > > > > > > > > > > > > *fixed_column_names = [str(i) for i in range(0, 27)]extra_column_names = > [str(i) for i in range(len(fixed_column_names), > largest_column_count)]total_columns = > fixed_column_namestotal_columns.extend(extra_column_names)read_options = > csv.ReadOptions(column_names=total_columns)convert_options = > csv.ConvertOptions(include_columns=total_columns, > include_missing_columns=True, > strings_can_be_null=True)table = csv.read_csv(edr_filename, > read_options=read_options, convert_options=convert_options)* > ``` > but I get the following error > ****Exception: CSV parse error: Expected 43 columns, got 32**** > > I need to use the csv provided by pyarrow, if not I wouldn't be able to > create the pyarrow table to then convert to pandas > ```from pyarrow import csv``` > > I guess that the csv library provided by pyarrow is more streamlined than > the complete one. > > Is there any way I can load this file? Maybe using some ReadOptions and/or > ConvertOptions? > I'd be using pandas to manipulate the data after it's been loaded. > > Thank you in advance > > > > >
