Re: Accumulo indexing social media data

mohit.kaushik Wed, 08 Jul 2015 05:12:26 -0700

Thanks Josh, I am testing the approach. I have one more considerationwhich is "CONDITIONAL MUTATIONS". I have stored the fields in CQaccording to the following schema.


Row     <partition id>
/ \
CF <doc> <index>
| |
CQ <UUID> <Term>
| |
<field> <UUID>
| |
| <Field>
Value <value>

Documents have a fields url. if the url exist. I want mutations not tobe added(skiped). But as I don't know the partitionID. How can I applyconditional mutations here to check the existence of url.


-Mohit kaushik

On 07/06/2015 02:04 AM, Josh Elser wrote:

If your primary search criteria is on a single-term, a term-basedreverse index is going to serve you much better than adocument-partitioned index.
Document partitioned indexes can better support concurrency since youhave some amount of hash-partitioning involved in the partition ID(sometimes you can include other data in the partition ID to furtherrestrict the "search space"). However, you always need to query eachpartition to get an answer for a single term. You'll have much higherlatency using this approach than a term-partitioned index.
To answer your question about choosing a partition ID, it typicallyrevolves around the number of TabletServers you want a single query toparallelize on. For example, if you can assume to have ~10 queriesrunning at one time, you don't want each query to communicate with 90%of your TabletServers. If you only run one or two queries at a time,you would want to talk to as many TabletServers as you can.
To further complicate things, you can also try to apply a partition IDas a suffix on term-based indexes to work around queries such as "the"or "and" which are prone to be extremely common terms. With a simpleterm-based index, all records for this term would be contained in asingle Tablet on a single TabletServer. This ultimately comes down tothe amount and distribution of data you're storing.
Come back with more information, and we can give some morerecommendations. Honestly, you probably won't get this right the firsttime, but this is expected :). What you can do is..
* Set some expectations on performance
* Do some simple math on actual data (estimate parallelism, latency, etc)
* Prototype and test it

mohit.kaushik wrote:
Hi,

We have an social media application currently using MongoDB to serve
documents . We decided to shift it to Accumulo. I am designing the
schema and indexing approach but having some difficulties in managing
indexes and a few concerns with generating UUID in Accumulo.

UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates
a 12 byte UUID sorted on current time and good for multi-user
multi-process environment (<time> <Mac add> <process id> <client
counter> ) which is perfect. but if I concatenate the time,mac add,
process-id, client counter. These are around 28 to 30 characters which
means around 60 bytes. And If I store it in reverse order so that the
latest document shows on top, the size would be doubled( more than 120
bytes) as described by David Medinets. Is there any way to store this
UUID in lesser size or any other efficient way to generate UUID reverse
sorted on current time.

Indexing : I need to retrieve documents from index based on some query
on fields. I found two approaches to index documents in Accumulo.
(1) Term based reverse indexing and
(2) Document partitioning indexing

As Adam described in this video
https://www.youtube.com/watch?v=Ck70G6OuGT4. If I use Document
partitioning indexing.

Row <partition id>
/ \
CF <doc> <index>
| |
CQ <UUID> <Term>
| |
<field> <UUID>
| |
| <Field>
Value <value>

If I just want to serve documents based on single term query. Would it
be better to store <term> in column family so that I can limit on single
term in CF. It will reduce the data by a good factor. what can be other
pros and cons of this approach?
And how should i decide the on partition_Id. If i storing tweets on 3
node cluster?

Regards
Mohit Kaushik



--
Signature

*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553

<http://politicomapper.orkash.com>interactive social intelligence at work...

<https://www.facebook.com/Orkash2012><http://www.linkedin.com/company/orkash-services-private-limited><https://twitter.com/Orkash> <http://www.orkash.com/blog/><http://www.orkash.com>

<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty

/This message including the attachments, if any, is a confidentialbusiness communication. If you are not the intended recipient it may beunlawful for you to read, copy, distribute, disclose or otherwise usethe information in this e-mail. If you have received it in error or arenot the intended recipient, please destroy it and notify the senderimmediately. Thank you /

Re: Accumulo indexing social media data

Reply via email to