If your XML documents are really just lists of elements/objects, and
what you want to run your analytics on are subsets of those elements
(even across XML documents), then it makes sense to take a document
store approach similar to what the Wikipedia example has done. This
allows you to index specific portions of elements, create graphs and
apply visibility labels to specific attributes in a given object tree.

On Wed, Jun 6, 2012 at 10:06 PM, David Medinets
<[email protected]> wrote:
> I can't think of any advantage to storing XML inside Accumulo. I am
> interested to learn some details about your view. Storing the
> extracted information and the location of the HDFS file that sourced
> the information does make sense to me. In fact, it might be useful to
> store file positions in Accumulo so it's easy to get back to specific
> spots in the XML file. For example, if you had an XML file with many
> records in it and there was no reason to immediately decompose each
> record.
>
> On Wed, Jun 6, 2012 at 9:57 PM, William Slacum <[email protected]> wrote:
>> There are advantages to using Accumulo to store the contents of your
>> XML documents, depending on their structure and what you want to end
>> up taking out of them. Are you trying to emulate the document store
>> pattern that the Wikipedia example uses?
>>
>> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J <[email protected]> wrote:
>>> Hi,  I am working with large chunks of XML, anywhere from 1 – 50 GB each.  
>>> I am running several different MapReduce jobs on the XML to pull out 
>>> various pieces of data, do analytics, etc.  I am using an XML input type 
>>> based on the WikipediaInputFormat from the examples.  What I have been 
>>> doing is 1) loading the entire XML into HDFS as a single document 2) 
>>> parsing the XML on some tag <foo> and storing each one of these instances 
>>> as the content of a new row in Accumulo, using the name of the instance as 
>>> the row id.  I then run other MR jobs that scan this table, pull out and 
>>> parse the XML and do whatever I need to do with the data.
>>>
>>> My question is, is there any advantage to storing the XML in Accumulo 
>>> versus just leaving it in HDFS and parsing it from there?  Either as a 
>>> large block of XML or as individual chunks, perhaps  using Hadoop Archive 
>>> to handle the small-file problem?  The actual XML will not be queried in 
>>> and of itself but is part other analysis processes.
>>>
>>> Thanks,
>>> Ralph
>>>
>>>
>>> __________________________________________________
>>> Ralph Perko
>>> Pacific Northwest National Laboratory
>>>
>>>

Reply via email to