Hi There,
I have a lot of small (~0.5 MB to 3 MB) XML files that I would like to
process using Apache Pig. Since dealing with a lot of small files is
problematic , I was thinking of creating SeqeunceFiles such that each sequence
file between 60 to 64 MB and no XML file is split onto 2 Sequence Files. Is
there any utility that does the storing and loading of these files from Pig. I
can for example create a Pig job that would read these XML files and generates
few large sequence files such that XML file is split onto 2 Sequence Files. I
will then write another Pig job that will load these sequence files and then
analyze them. Each of these XML files contains a lot of information for a given
entity and the nesting can be quite deep. Any help with this would be great.