Thanks Josh, I am testing the approach. I have one more consideration
which is "CONDITIONAL MUTATIONS". I have stored the fields in CQ
according to the following schema.
Row <partition id>
/ \
CF <doc> <index>
| |
CQ <UUID> <Term>
| |
<field> <UUID>
| |
| <Field>
Value <value>
Documents have a fields url. if the url exist. I want mutations not to
be added(skiped). But as I don't know the partitionID. How can I apply
conditional mutations here to check the existence of url.
-Mohit kaushik
On 07/06/2015 02:04 AM, Josh Elser wrote:
If your primary search criteria is on a single-term, a term-based
reverse index is going to serve you much better than a
document-partitioned index.
Document partitioned indexes can better support concurrency since you
have some amount of hash-partitioning involved in the partition ID
(sometimes you can include other data in the partition ID to further
restrict the "search space"). However, you always need to query each
partition to get an answer for a single term. You'll have much higher
latency using this approach than a term-partitioned index.
To answer your question about choosing a partition ID, it typically
revolves around the number of TabletServers you want a single query to
parallelize on. For example, if you can assume to have ~10 queries
running at one time, you don't want each query to communicate with 90%
of your TabletServers. If you only run one or two queries at a time,
you would want to talk to as many TabletServers as you can.
To further complicate things, you can also try to apply a partition ID
as a suffix on term-based indexes to work around queries such as "the"
or "and" which are prone to be extremely common terms. With a simple
term-based index, all records for this term would be contained in a
single Tablet on a single TabletServer. This ultimately comes down to
the amount and distribution of data you're storing.
Come back with more information, and we can give some more
recommendations. Honestly, you probably won't get this right the first
time, but this is expected :). What you can do is..
* Set some expectations on performance
* Do some simple math on actual data (estimate parallelism, latency, etc)
* Prototype and test it
mohit.kaushik wrote:
Hi,
We have an social media application currently using MongoDB to serve
documents . We decided to shift it to Accumulo. I am designing the
schema and indexing approach but having some difficulties in managing
indexes and a few concerns with generating UUID in Accumulo.
UUID : The data is being indexed in MongoDB 24 hours. MongoDB generates
a 12 byte UUID sorted on current time and good for multi-user
multi-process environment (<time> <Mac add> <process id> <client
counter> ) which is perfect. but if I concatenate the time,mac add,
process-id, client counter. These are around 28 to 30 characters which
means around 60 bytes. And If I store it in reverse order so that the
latest document shows on top, the size would be doubled( more than 120
bytes) as described by David Medinets. Is there any way to store this
UUID in lesser size or any other efficient way to generate UUID reverse
sorted on current time.
Indexing : I need to retrieve documents from index based on some query
on fields. I found two approaches to index documents in Accumulo.
(1) Term based reverse indexing and
(2) Document partitioning indexing
As Adam described in this video
https://www.youtube.com/watch?v=Ck70G6OuGT4. If I use Document
partitioning indexing.
Row <partition id>
/ \
CF <doc> <index>
| |
CQ <UUID> <Term>
| |
<field> <UUID>
| |
| <Field>
Value <value>
If I just want to serve documents based on single term query. Would it
be better to store <term> in column family so that I can limit on single
term in CF. It will reduce the data by a good factor. what can be other
pros and cons of this approach?
And how should i decide the on partition_Id. If i storing tweets on 3
node cluster?
Regards
Mohit Kaushik
--
Signature
*Mohit Kaushik*
Software Engineer
A Square,Plot No. 278, Udyog Vihar, Phase 2, Gurgaon 122016, India
*Tel:*+91 (124) 4969352 | *Fax:*+91 (124) 4033553
<http://politicomapper.orkash.com>interactive social intelligence at work...
<https://www.facebook.com/Orkash2012>
<http://www.linkedin.com/company/orkash-services-private-limited>
<https://twitter.com/Orkash> <http://www.orkash.com/blog/>
<http://www.orkash.com>
<http://www.orkash.com> ... ensuring Assurance in complexity and uncertainty
/This message including the attachments, if any, is a confidential
business communication. If you are not the intended recipient it may be
unlawful for you to read, copy, distribute, disclose or otherwise use
the information in this e-mail. If you have received it in error or are
not the intended recipient, please destroy it and notify the sender
immediately. Thank you /