Re: very fast loading of very big table

Pavel Tupitsyn Thu, 25 Feb 2021 00:01:30 -0800

Vladimir,

Thanks for getting back to us. A full example that clarifies the situation
will be great!


> Can you share your code as a GitHub project? Maybe with the script to
reproduce 6 GB of data.

It is super trivial, I just wanted to get a sense of the throughput and
check if we have some kind of a regression in recent versions (we don't) [1]
Also I realised that the data size can be counted very differently - do we
account for DB overhead and how?

[1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15

On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <[email protected]>
wrote:

> Hi,
>
> I'd spent some time thinking about the community comments on my post. It
> seems that Ignite is really not a bottleneck here. The performance of my
> production MSSQL is a given restriction and the problem is to ensure fast
> loading by executing multiple parallel queries. I'll test my code in
> production for a couple of months for possible problems. If it will be OK,
> probably the complete/downloadable/compilable GitHub example will be useful
> for the community.
>
> WDYT?
>
> пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[email protected]>:
>
>> Pavel,
>>
>> maybe it's time to put your five-cent in. Can you share your code as a
>> GitHub project? Maybe with the script to reproduce 6 GB of data.
>>
>> As for MSSQL data retrieval being the bottleneck - don't think so, I got
>> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a
>> linear dependency (the table and the RDBMS server were the same).
>> --
>> Vladimir
>>
>> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[email protected]>:
>>
>>> > First of all, I tried to select the whole table as once
>>>
>>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not
>>> Ignite.
>>>
>>> Can you run a test where some dummy data of the same size as real data
>>> is generated and inserted into Ignite,
>>> so that we test Ignite perf only, excluding MSSQL from the equation?
>>> For example, streaming 300 million entries (total size 6 GB) takes
>>> around 1 minute on my machine, with a simple single-threaded DataStreamer.
>>>
>>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[email protected]>
>>> wrote:
>>>
>>>> Hi folks,
>>>> thanks for your interest in my work.
>>>>
>>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a
>>>> couple of years ago and didn't succeed. Probably because examples available
>>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub
>>>> repo, that is my five cents in changing the status quo). My interest is in
>>>> KV API.
>>>>
>>>> I did try a data streamer, and that was my first try. I did not notice
>>>> a significant time reduction in using code from my paper [1] versus data
>>>> streamer/receiver. There was some memory economy with the streamer, though.
>>>> I must say my experiment was made on a heavily loaded production mssql
>>>> server. Filtered query with 300K rows resultset takes about 15 sec. The
>>>> story follows.
>>>>
>>>> First of all, I tried to select the whole table as once, I got the
>>>> network timeout and the client node was dropped off the cluster (is node
>>>> still alive?).
>>>> So I'd partitioned the table and executed a number of queries
>>>> one-by-one on the client node, each query for the specific table partition.
>>>> That process took about 90 min. Inacceptable time.
>>>>
>>>> Then I tried to execute my queries in parallel on the client node, each
>>>> query executing dataStreamer.addData() for a single dataStreamer. The
>>>> timing was not less than 15 min. All the attempts were the same, probably
>>>> that was the network throughput limit on the client node (same interface
>>>> used for the resultset and for cluster intercom). Say it again - that was
>>>> the production environment.
>>>>
>>>> Final schema:
>>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one
>>>> job for one table partition;
>>>> * each job executes SQL query, constructs a map with binary object key
>>>> and value. Then the job executes targetCache.invokeAll() specifying the
>>>> constructed map and the static EntryProcessor class. The EntryProcessor
>>>> contains the logic for cache binary entry update;
>>>> * ComputeTask.reduce() summarizes the row count reported by each job.
>>>>
>>>> The schema described proved to be network error-free in my production
>>>> network and gives acceptable timing.
>>>>
>>>> Vladimir
>>>>
>>>> [1]
>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>
>>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington <
>>>> [email protected]>:
>>>>
>>>>> I think it’s more that that putAll is mostly atomic, so the more
>>>>> records you save in one chunk, the more locking, etc. happens. 
>>>>> Distributing
>>>>> as compute jobs means all the putAlls will be local which is beneficial,
>>>>> and the size of each put is going to be smaller (also beneficial).
>>>>>
>>>>> But that’s a lot of work that the data streamer already does for you
>>>>> and the data streamer also batches updates so would still be faster.
>>>>>
>>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez <
>>>>> [email protected]> wrote:
>>>>>
>>>>> What would be the difference between doing cache.putAll(all rows) and
>>>>> separating them by affinity key+executing putAll inside a compute job.
>>>>> If I'm not mistaken, doing putAll should end up splitting those rows
>>>>> by affinity key in one of the servers, right?
>>>>> Is there a comparison of that?
>>>>>
>>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Vladimir,
>>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC?
>>>>>> This command uses 'IgniteDataStreamer' to write data into cluster and
>>>>>> parse CSV on the server node.
>>>>>>
>>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data.
>>>>>>
>>>>>> Hi Denis,
>>>>>>
>>>>>> Data space is 3.7Gb according to MSSQL table properries
>>>>>>
>>>>>> Vladimir
>>>>>>
>>>>>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]>
>>>>>> <[email protected]>:
>>>>>>
>>>>>> Hello Vladimir,
>>>>>>
>>>>>> Good to hear from you! How much is that in gigabytes?
>>>>>>
>>>>>> -
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote:
>>>>>>
>>>>>> Sep 2020 I've published the paper about Loading Large Datasets into
>>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2]
>>>>>> version). The approach described works in production, but shows
>>>>>> inacceptable perfomance for very large tables.
>>>>>>
>>>>>> The story continues, and yesterday I've finished the proof of concept
>>>>>> for very fast loading of very big table. The partitioned MSSQL table 
>>>>>> about
>>>>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec.
>>>>>> Each node had executed its own SQL queries in parallel and then 
>>>>>> distributed
>>>>>> the loaded values across the other cluster nodes.
>>>>>>
>>>>>> Probably that result will be of interest for the community.
>>>>>>
>>>>>> Regards,
>>>>>> Vladimir Chernyi
>>>>>>
>>>>>> [1]
>>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api
>>>>>> [2] https://m.habr.com/ru/post/526708/
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Отправлено из мобильного приложения Яндекс.Почты
>>>>>>
>>>>>> --
>>>>>> Taras Ledkov
>>>>>> Mail-To: [email protected]
>>>>>>
>>>>>>
>>>>>
>>>>>

Re: very fast loading of very big table

Reply via email to