Hi Keith,

I feel your pain. I spend a lot of time doing this kind of manipulation.

I generally avoid CSV, the format of the devil. If necessary I have a library that does a reasonable job of converting CSV to TSV, and run everything through that. Let's just assume for now that you've already done this....

My technique is this:

- given that you've got a routine to iterate over a folder or tree of files
- given that you can easily treat a row at a time, and on each row can easily work through a column at a time (e.g., this is TSV, you've set the itemdelimiter to tab)
- given that the first row in each file gives the column names

1. Maintain an ordered list of output column names

I'd probably keep it in two formats: a string with tab separated column names, an array mapping column name to index, and a variable giving the number of columns.

2. For each file, go through the list of column names (the 'items' of the first row). For each one, if it's not already in the master list of column names (e.g., it's not in the array) then append it with a tab to the end of the string, and add it to the array with the appropriate index.

Also keep an array mapping column index in _this_ file to column index in the master file.

3. Then for each row after that, start with an empty array. For each non-empty item on the row, add it to this 'row array', with the key being the index in the master file corresponding to this item's index in this file.

4. When you've got to the end of the row, dump the data from this array; index from 1 to number-of-master-columns, adding a tab between each. Then add this to the master file accumulator.


So code would look something like this


local tMasterColumns -- tab delimited column names for output 'master' file
local aColumnNameToMasterIndex -- array mapping column name to index in above local iNumMasterColumns -- number of items in the above two

local tMasterFileData -- will accumulate the rows of data for the output file

local aFileIndexToMasterIndex -- for each file, maps index of column in file to index in master file
local aRowData -- for each row, we first move data into this array...
local tOutRow -- ...then output it into this string

local iFileNumCols -- number of columns in the current input file

local iFileColInx, iMasterColInx -- keep track of input and output col indices


repeat for each file... load it into tFileData.. etc

  -- map the columns of this file to the (growing) columns of the masterfile
  put 0 into iFileColInx
  repeat for each item x in line 1 of tFileData
    add 1 to iFileColInx
    get aColumnNameToMasterIndex[x]
    if it = empty then
       put tab & x after tMasterColumns
       add 1 to iNumMasterColumns
       put iNumMasterColumns into aColumnNameToMasterIndex[x]
       get iNumMasterColumns
    end if
    -- now it is the index of this column in the master file
    put it into aFileIndexToMasterIndex[iFileColInx]
  end repeat
  delete line 1 of tFileData
  put iFileColInx into iFileNumCols

  repeat for each line tRowData in tFileData

    -- get data from the row into the proper columns
    put empty into aRowData
    repeat with i = 1 to iFileNumCols -- number of columns in this file
      put aFileIndexToMasterIndex[i] into iMasterColInx
      put item i of tRowData into aRowData[iMasterColInx]
    end repeat

    -- now dump the row
    put empty into tOutRow
    repeat with i = 1 to iNumMasterColumns
      put aRowData[i] & tab after tOutRow
    end repeat
    put (char 1 to -2 of tOutRow) \ -- delete last tab
      & return after tMasterFileData

  end repeat

end repeat -- for each file

-- finally save tMasterColumns & return & tMasterFileData










On 04/04/2022 18:03, Keith Clarke via use-livecode wrote:
Hi folks,
I need to consolidate a couple of hundred CSV files of varying sizes (dozens to 
hundreds of rows) and column structures (some shared columns but many unique), 
into a single superset dataset that comprises all data from all files.

There are too many files and columns to attempt a manual process, so in I’ve 
been trying to achieve this with LiveCode - by iterating through the files to 
parse the column header rows into into row template and column mapping arrays 
used to build the combined table. However, I'm struggling to both compile the 
superset and render the results.

I feel I may be designing a very complicated solution for a problem that has 
probably been solved before. I wonder if I’m missing a simpler method - with 
LiveCode or perhaps with Excel or ‘drag & drop’ into some kind of 
self-organising database GUI for SQLite, etc?

Thanks in advance for any ideas.
Best,
Keith
_______________________________________________
use-livecode mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

_______________________________________________
use-livecode mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to