Hi There,

I have a lot of small (~0.5 MB to 3 MB) XML files that I would like to 
process using Apache Pig. Since dealing with a lot of small files is 
problematic , I was thinking of creating SeqeunceFiles such that each sequence 
file between 60 to 64 MB and no XML file is split onto 2 Sequence Files. Is 
there any utility that does the storing and loading of these files from Pig. I 
can for example create a Pig job that would read these XML files and generates 
few large sequence files  such that XML file is split onto 2 Sequence Files. I 
will then write another Pig job that will load these sequence files and then 
analyze them. Each of these XML files contains a lot of information for a given 
entity and the nesting can be quite deep. Any help with this would be great. 

                                          

Reply via email to