Re: very fast loading of very big table

Vladimir Tchernyi Fri, 19 Feb 2021 05:49:58 -0800

Hi folks,
thanks for your interest in my work.

I didn't try COPY FROM since I've tried to work with Ignite SQL a couple of
years ago and didn't succeed. Probably because examples available aren't
complete/downloadable/compilable (the paper [1] contains GitHub repo, that
is my five cents in changing the status quo). My interest is in KV API.

I did try a data streamer, and that was my first try. I did not notice a
significant time reduction in using code from my paper [1] versus data
streamer/receiver. There was some memory economy with the streamer, though.
I must say my experiment was made on a heavily loaded production mssql
server. Filtered query with 300K rows resultset takes about 15 sec. The
story follows.

First of all, I tried to select the whole table as once, I got the network
timeout and the client node was dropped off the cluster (is node still
alive?).
So I'd partitioned the table and executed a number of queries one-by-one on
the client node, each query for the specific table partition. That process
took about 90 min. Inacceptable time.

Then I tried to execute my queries in parallel on the client node, each
query executing dataStreamer.addData() for a single dataStreamer. The
timing was not less than 15 min. All the attempts were the same, probably
that was the network throughput limit on the client node (same interface
used for the resultset and for cluster intercom). Say it again - that was
the production environment.

Final schema:
* ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one job
for one table partition;
* each job executes SQL query, constructs a map with binary object key and
value. Then the job executes targetCache.invokeAll() specifying the
constructed map and the static EntryProcessor class. The EntryProcessor
contains the logic for cache binary entry update;
* ComputeTask.reduce() summarizes the row count reported by each job.

The schema described proved to be network error-free in my production
network and gives acceptable timing.

Vladimir

[1]
https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api

пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
[email protected]>:

> I think it’s more that that putAll is mostly atomic, so the more records
> you save in one chunk, the more locking, etc. happens. Distributing as
> compute jobs means all the putAlls will be local which is beneficial, and
> the size of each put is going to be smaller (also beneficial).
>
> But that’s a lot of work that the data streamer already does for you and
> the data streamer also batches updates so would still be faster.
>
> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <[email protected]>
> wrote:
>
> What would be the difference between doing cache.putAll(all rows) and
> separating them by affinity key+executing putAll inside a compute job.
> If I'm not mistaken, doing putAll should end up splitting those rows by
> affinity key in one of the servers, right?
> Is there a comparison of that?
>
> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]> wrote:
>
>> Hi Vladimir,
>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>> This command uses 'IgniteDataStreamer' to write data into cluster and
>> parse CSV on the server node.
>>
>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>
>> Hi Denis,
>>
>> Data space is 3.7Gb according to MSSQL table properries
>>
>> Vladimir
>>
>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]>
>> <[email protected]>:
>>
>> Hello Vladimir,
>>
>> Good to hear from you! How much is that in gigabytes?
>>
>> -
>> Denis
>>
>>
>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote:
>>
>> Sep 2020 I've published the paper about Loading Large Datasets into
>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>> version). The approach described works in production, but shows
>> inacceptable perfomance for very large tables.
>>
>> The story continues, and yesterday I've finished the proof of concept for
>> very fast loading of very big table. The partitioned MSSQL table about 295
>> million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. Each
>> node had executed its own SQL queries in parallel and then distributed the
>> loaded values across the other cluster nodes.
>>
>> Probably that result will be of interest for the community.
>>
>> Regards,
>> Vladimir Chernyi
>>
>> [1]
>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>> [2] https://m.habr.com/ru/post/526708/
>>
>>
>>
>> --
>> Отправлено из мобильного приложения Яндекс.Почты
>>
>> --
>> Taras Ledkov
>> Mail-To: [email protected]
>>
>>
>
>

Re: very fast loading of very big table

Reply via email to