Hi All. I'm still kind of new and I have a question about how to efficiently 
process a file that has a bulk export of a database with certain row records 
being defined by successive information regarding it. In other words, a row 
from table A is exported from
a table to the file, then the records associated with the defining id is 
exported from table B into the file. The process is repeated for each row in 
table A.

For example...
15 ident1(1) ident1(2) <--- defines identifying information to successive “20 
records"
20 info1(ident1)(1) info1(ident1)(2) info1(ident1)(3)<---record for this "15 
type record”
.
.
.
20 infoN(ident1)(1) infoN(ident1)(2) infoN(ident1)(3)<---record for this "15 
type record”
.
.
.
15 ident2(1) ident2(2) <--- defines new id to for next group of 20 type records
20 infoX(ident2)(1) info1X(2) info1X(3)<---record for this "15 type record”
20 infoX+1(ident2)(1) infoX+1(ident2)(2) infoX+1(ident2)(3)<---record for this 
"15 type record”
.
.
.

Until next 15 type record appears. All followed by arbitrary 20 type records. 
Then another 15 record type followed by more 20s regarding the new
15 type record ad infinitum.

I was hoping to do map-reduce on this data of various sorts. For example, I 
want to find the max info value in each column per each 15 “section”. Is
there any sort of way to handle that?

I was hoping I wouldn't have to split the file myself… These files get to be 
22GB each.


I thought a strategy close to processing XML files would be useful, but I don’t 
think that would apply here.

I would appreciate any help and insight.

Best Regards,
Mahesh

Reply via email to