Re: fastest way to bulk insert in geode

Michael Stolz Fri, 03 Mar 2017 11:26:07 -0800

And of course, it depends on your access patterns.
If all access is by primary key, then CacheLoaders are a viable option.
If access is by query on non-primary key fields, then ALL data needs to be
pre-loaded, otherwise you won't know if you got the right query result.


So for situations where pre-loading is either required or desirable, putAll
is probably the best tool BUT don't try to put too much all at once because
that will bog down at the network layer. Keep yourself down to a couple of
hundred objects per call to putAll, and tune that number to get best
overall throughput.

--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: +1-631-835-4771

On Fri, Mar 3, 2017 at 1:10 PM, John Blum <[email protected]> wrote:

> SIMPLE ANSWER:
>
> Well, I am not certain about "fastest", but it is convenient, and maybe 1
> of the few ways (perhaps the only way, other than individual Region puts,
> which I gather would be slower).
>
> If we are talking about a simple Map of data that is relatively small,
> then Region.putAll(:Map) is your best option.
>
> However...
>
>
> DETAILED ANSWER*:*
>
> I.e. don't equate loading a simple Map with bulk data loads in general.
>
> It really depends on many factors, like distribution factors in
> particular... Region type (e.g. REPLICATE vs. PARTITION), Scope (as in
> DISTRIBUTED_ACK, DISTRIBUTED_NO_ACK (only applicable for REPLICATE Regions;
> i.e. PARTITION Regions are DISTRIBUTED_ACK only), number of redundant
> copies (for PARTITION Regions), number of nodes in cluster hosting the
> "target" Region, etc, etc.  All these can affect speed.
>
> But typically, bulk loading data (batch) is not so much about speed as it
> is consistency/accuracy, or data availability.
>
> A more sophisticated approach in a distributed scenario, say if you were
> using PARTITION Regions with a fixed partitioning strategy would be to load
> the data in parallel from a Function, where the Function handles the data
> set for the individual nodes based on the partitioning strategy.  Of course
> redundant copies (along with Redundancy Zones) are still going to affect
> perf, even in this approach.
>
> So, again, it is a factor of your consistency and availability guarantees.
>
> See here
> <http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/how_pr_ha_works.html>
>  [1]
> and here
> <http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/configuring_ha_for_pr.html>
>  [2]
> for more details.
>
> I think the more pertinent question is where do you want to make your data
> available to best serve the needs of your application in a reliable fashion
> at runtime, rather than how it gets there.  You must be mindful of how much
> memory your data takes up in the first place.  Additionally, using a
> CacheLoader to lazily load the data in certain cases might make more
> sense.  I.e. w.r.t. to bulk load, it is not about having all your data in
> memory, but having the right data in-memory at the right time.  That is
> going give your application the best responsiveness.
>
> Food for thought,
>
> -j
>
> [1] http://gemfire90.docs.pivotal.io/geode/developing/
> partitioned_regions/how_pr_ha_works.html
> [2] http://gemfire90.docs.pivotal.io/geode/developing/partitioned_regions/
> configuring_ha_for_pr.html
>
>
> On Fri, Mar 3, 2017 at 9:26 AM, Amit Pandey <[email protected]>
> wrote:
>
>> Hey John ,
>>
>> Thanks I am planning to use Spring XD. But my current usecase is that I
>> am aggregating and doing some computes in a Function and then I want to
>> populate it with the values I have a map , is region.putAll the fastest?
>>
>> Regards
>>
>> On Fri, Mar 3, 2017 at 10:52 PM, John Blum <[email protected]> wrote:
>>
>>> You might consider using the Snapshot service
>>> <http://gemfire90.docs.pivotal.io/geode/managing/cache_snapshots/chapter_overview.html>
>>>  [1]
>>> if you previously had data in a Region of another Cluster (for instance).
>>>
>>> If the data is coming externally, then *Spring XD
>>> <http://projects.spring.io/spring-xd/> *[2] is a great tool for moving
>>> (streaming) data from a source
>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sources>
>>>  [3]
>>> to a sink
>>> <http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/reference/html/#sinks> 
>>> [4].
>>> It also allows you to perform all manners of transformations/conversions,
>>> trigger events, and so and so forth.
>>>
>>> -j
>>>
>>>
>>> [1] http://gemfire90.docs.pivotal.io/geode/managing/cache_sn
>>> apshots/chapter_overview.html
>>> [2] http://projects.spring.io/spring-xd/
>>> [3] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>> ence/html/#sources
>>> [4] http://docs.spring.io/spring-xd/docs/1.3.1.RELEASE/refer
>>> ence/html/#sinks
>>>
>>>
>>> On Fri, Mar 3, 2017 at 9:13 AM, Amit Pandey <[email protected]>
>>> wrote:
>>>
>>>> Hey Guys,
>>>>
>>>> Whats the fastest way to do bulk insert in a region?
>>>>
>>>> I am using region.putAll , is there any alternative/faster API?
>>>>
>>>> regards
>>>>
>>>
>>>
>>>
>>> --
>>> -John
>>> john.blum10101 (skype)
>>>
>>
>>
>
>
> --
> -John
> john.blum10101 (skype)
>

Re: fastest way to bulk insert in geode

Reply via email to