My use-case is very similar to the Wikipedia example. I'm not sure what
you mean by the inflated key. Can you expand on that? I am not really
pulling out individual elements/attributes to simply store them apart from
the XML. Any element I pull out is part of a larger analytic process and
it is this result I store. I am doing some graph worked based on
relationships between elements.
Example:
<books>
<book>
<title>basket weaving</title>
<author>bob</author>
<toc>Š</toc>
<chapter number=1>lots of text here</chapter>
<chapter number=2>even more text here</chapter>
<citation>another book</citation>
</book>
</books>
Each "book" is a record. The book title is the row id. The content is
the XML <book>..</book>
My table then has other columns such as "word count" or "character count"
stored in the table.
Table example:
Row: basket weaving
Col family: content
Col qual: xml
Value: <book>Š</book>
Row: basket weaving
Col family: metrics
Col qual: word count
Value: 12345
Row: basket weaving
Col family:cites
Col qual: another book
Value: -- nothing meaningful
Row: another book
Col family:cited by
Col qual: basket weaving
Value: -- nothing meaningful
I use the "cites" and "cited by" qualifiers for graphs
On 6/6/12 7:50 PM, "Josh Elser" <[email protected]> wrote:
>+1, Bill. Assuming you aren't doing anything crazy in your XML files,
>the wikipedia example should get you pretty far. That being said, the
>structure used in the wikipedia example doesn't handle large lists of
>elements -- short explanation: an attribute of a document is stored as
>one key-vale pair, so if you have lot of large lists, you inflate the
>key which does bad things. That in mind, there are small changes you can
>make to the table structure to store those lists more efficiently and
>still maintain the semantic representation (Bill's graph comment).
>
>David, ignoring any issues of data locality of the blocks in your large
>XML files, storing byte offsets into a hierarchical data structure (XML)
>seems like a sub-optimal solution to me. Aside from losing the hierarchy
>knowledge, if you have a skewed distribution of elements in the XML
>document, you can't get good locality in your query/analytic. What was
>your idea behind storing the offsets?
>
>- Josh
>
>On 6/6/2012 10:19 PM, William Slacum wrote:
>> If your XML documents are really just lists of elements/objects, and
>> what you want to run your analytics on are subsets of those elements
>> (even across XML documents), then it makes sense to take a document
>> store approach similar to what the Wikipedia example has done. This
>> allows you to index specific portions of elements, create graphs and
>> apply visibility labels to specific attributes in a given object tree.
>>
>> On Wed, Jun 6, 2012 at 10:06 PM, David Medinets
>> <[email protected]> wrote:
>>> I can't think of any advantage to storing XML inside Accumulo. I am
>>> interested to learn some details about your view. Storing the
>>> extracted information and the location of the HDFS file that sourced
>>> the information does make sense to me. In fact, it might be useful to
>>> store file positions in Accumulo so it's easy to get back to specific
>>> spots in the XML file. For example, if you had an XML file with many
>>> records in it and there was no reason to immediately decompose each
>>> record.
>>>
>>> On Wed, Jun 6, 2012 at 9:57 PM, William Slacum<[email protected]>
>>>wrote:
>>>> There are advantages to using Accumulo to store the contents of your
>>>> XML documents, depending on their structure and what you want to end
>>>> up taking out of them. Are you trying to emulate the document store
>>>> pattern that the Wikipedia example uses?
>>>>
>>>> On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J<[email protected]>
>>>>wrote:
>>>>> Hi, I am working with large chunks of XML, anywhere from 1 50 GB
>>>>>each. I am running several different MapReduce jobs on the XML to
>>>>>pull out various pieces of data, do analytics, etc. I am using an
>>>>>XML input type based on the WikipediaInputFormat from the examples.
>>>>>What I have been doing is 1) loading the entire XML into HDFS as a
>>>>>single document 2) parsing the XML on some tag<foo> and storing each
>>>>>one of these instances as the content of a new row in Accumulo, using
>>>>>the name of the instance as the row id. I then run other MR jobs
>>>>>that scan this table, pull out and parse the XML and do whatever I
>>>>>need to do with the data.
>>>>>
>>>>> My question is, is there any advantage to storing the XML in
>>>>>Accumulo versus just leaving it in HDFS and parsing it from there?
>>>>>Either as a large block of XML or as individual chunks, perhaps
>>>>>using Hadoop Archive to handle the small-file problem? The actual
>>>>>XML will not be queried in and of itself but is part other analysis
>>>>>processes.
>>>>>
>>>>> Thanks,
>>>>> Ralph
>>>>>
>>>>>
>>>>> __________________________________________________
>>>>> Ralph Perko
>>>>> Pacific Northwest National Laboratory
>>>>>
>>>>>
>