Hi,  I am working with large chunks of XML, anywhere from 1 – 50 GB each.  I am 
running several different MapReduce jobs on the XML to pull out various pieces 
of data, do analytics, etc.  I am using an XML input type based on the 
WikipediaInputFormat from the examples.  What I have been doing is 1) loading 
the entire XML into HDFS as a single document 2) parsing the XML on some tag 
<foo> and storing each one of these instances as the content of a new row in 
Accumulo, using the name of the instance as the row id.  I then run other MR 
jobs that scan this table, pull out and parse the XML and do whatever I need to 
do with the data.

My question is, is there any advantage to storing the XML in Accumulo versus 
just leaving it in HDFS and parsing it from there?  Either as a large block of 
XML or as individual chunks, perhaps  using Hadoop Archive to handle the 
small-file problem?  The actual XML will not be queried in and of itself but is 
part other analysis processes.

Thanks,
Ralph


__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory


Reply via email to