Hi All. I'm still kind of new and I have a question about how to efficiently process a file that has a bulk export of a database with certain row records being defined by successive information regarding it. In other words, a row from table A is exported from a table to the file, then the records associated with the defining id is exported from table B into the file. The process is repeated for each row in table A.
For example... 15 ident1(1) ident1(2) <--- defines identifying information to successive “20 records" 20 info1(ident1)(1) info1(ident1)(2) info1(ident1)(3)<---record for this "15 type record” . . . 20 infoN(ident1)(1) infoN(ident1)(2) infoN(ident1)(3)<---record for this "15 type record” . . . 15 ident2(1) ident2(2) <--- defines new id to for next group of 20 type records 20 infoX(ident2)(1) info1X(2) info1X(3)<---record for this "15 type record” 20 infoX+1(ident2)(1) infoX+1(ident2)(2) infoX+1(ident2)(3)<---record for this "15 type record” . . . Until next 15 type record appears. All followed by arbitrary 20 type records. Then another 15 record type followed by more 20s regarding the new 15 type record ad infinitum. I was hoping to do map-reduce on this data of various sorts. For example, I want to find the max info value in each column per each 15 “section”. Is there any sort of way to handle that? I was hoping I wouldn't have to split the file myself… These files get to be 22GB each. I thought a strategy close to processing XML files would be useful, but I don’t think that would apply here. I would appreciate any help and insight. Best Regards, Mahesh
