My "inflated key" comment, I'll pull from Eric Newton's comment on the "Table design" thread:

"Accumulo will accomodate keys that are very large (like 100K) but I don't recommend it. It makes indexes big and slows down just about every operation"

As applied to your example, you might generate the following keys if you took the wikisearch approach:

# Represent your document as such: the row "4" being an arbitrary bucket, and the CF "1234abcd" being some unique identifier for your document (a hash of <book> for example)

4   1234abcd:title\x00basket weaving
4   1234abcd:author\x00bob
4   1234abcd:toc\x00stuff
4   1234abcd:citation\x00another book

# Then some indices inside the same row (bucket), creating an in-partition index over the fields of your data. You could also shove the tokenized content from your chapters in here.
4   fi\x00title:basket weaving\x001234abcd
4   fi\x00author:bob\x001234abcd
4   fi\x00toc:stuff\x001234abcd
4   fi\x00citation:another book\x001234abcd

# For those big chapters, store them off to the side, perhaps in their own locality group. Will keep this data in separate files.
4 chapters:1234abcd\x001    Value:byte[chapter one data]
4 chapters:1234abcd\x002    Value:byte[chapter two data]

# Then perhaps some records pointing to data you expect users to query on in a separate table (inverted index)
basket weaving    title:4\x001234abcd
bob    author:4\x001234abcd
another book    citation:4\x001234abcd

- Josh

On 6/7/2012 10:48 AM, Perko, Ralph J wrote:
My use-case is very similar to the Wikipedia example. I'm not sure what
you mean by the inflated key.  Can you expand on that?  I am not really
pulling out individual elements/attributes to simply store them apart from
the XML.  Any element I pull out is part of a larger analytic process and
it is this result I store.  I am doing some graph worked based on
relationships between elements.

Example:

<books>
   <book>
     <title>basket weaving</title>
     <author>bob</author>
     <toc>Š</toc>
     <chapter number=1>lots of text here</chapter>
     <chapter number=2>even more text here</chapter>
     <citation>another book</citation>
   </book>
</books>


Each "book" is a record.  The book title is the row id.  The content is
the XML<book>..</book>

My table then has other columns such as "word count" or "character count"
stored in the table.

Table example:

Row: basket weaving
Col family: content
Col qual: xml
Value:<book>Š</book>


Row: basket weaving
Col family: metrics
Col qual: word count
Value: 12345

Row: basket weaving
Col family:cites
Col qual: another book
Value: -- nothing meaningful


Row: another book
Col family:cited by
Col qual: basket weaving
Value: -- nothing meaningful

I use the "cites" and "cited by" qualifiers for graphs



On 6/6/12 7:50 PM, "Josh Elser"<[email protected]>  wrote:

+1, Bill. Assuming you aren't doing anything crazy in your XML files,
the wikipedia example should get you pretty far. That being said, the
structure used in the wikipedia example doesn't handle large lists of
elements -- short explanation: an attribute of a document is stored as
one key-vale pair, so if you have lot of large lists, you inflate the
key which does bad things. That in mind, there are small changes you can
make to the table structure to store those lists more efficiently and
still maintain the semantic representation (Bill's graph comment).

David, ignoring any issues of data locality of the blocks in your large
XML files, storing byte offsets into a hierarchical data structure (XML)
seems like a sub-optimal solution to me. Aside from losing the hierarchy
knowledge, if you have a skewed distribution of elements in the XML
document, you can't get good locality in your query/analytic. What was
your idea behind storing the offsets?

- Josh

On 6/6/2012 10:19 PM, William Slacum wrote:
If your XML documents are really just lists of elements/objects, and
what you want to run your analytics on are subsets of those elements
(even across XML documents), then it makes sense to take a document
store approach similar to what the Wikipedia example has done. This
allows you to index specific portions of elements, create graphs and
apply visibility labels to specific attributes in a given object tree.

On Wed, Jun 6, 2012 at 10:06 PM, David Medinets
<[email protected]>   wrote:
I can't think of any advantage to storing XML inside Accumulo. I am
interested to learn some details about your view. Storing the
extracted information and the location of the HDFS file that sourced
the information does make sense to me. In fact, it might be useful to
store file positions in Accumulo so it's easy to get back to specific
spots in the XML file. For example, if you had an XML file with many
records in it and there was no reason to immediately decompose each
record.

On Wed, Jun 6, 2012 at 9:57 PM, William Slacum<[email protected]>
wrote:
There are advantages to using Accumulo to store the contents of your
XML documents, depending on their structure and what you want to end
up taking out of them. Are you trying to emulate the document store
pattern that the Wikipedia example uses?

On Wed, Jun 6, 2012 at 4:20 PM, Perko, Ralph J<[email protected]>
wrote:
Hi,  I am working with large chunks of XML, anywhere from 1 ­ 50 GB
each.  I am running several different MapReduce jobs on the XML to
pull out various pieces of data, do analytics, etc.  I am using an
XML input type based on the WikipediaInputFormat from the examples.
What I have been doing is 1) loading the entire XML into HDFS as a
single document 2) parsing the XML on some tag<foo>   and storing each
one of these instances as the content of a new row in Accumulo, using
the name of the instance as the row id.  I then run other MR jobs
that scan this table, pull out and parse the XML and do whatever I
need to do with the data.

My question is, is there any advantage to storing the XML in
Accumulo versus just leaving it in HDFS and parsing it from there?
Either as a large block of XML or as individual chunks, perhaps
using Hadoop Archive to handle the small-file problem?  The actual
XML will not be queried in and of itself but is part other analysis
processes.

Thanks,
Ralph


__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory




Reply via email to