Vladimir, Thanks for getting back to us. A full example that clarifies the situation will be great!
> Can you share your code as a GitHub project? Maybe with the script to reproduce 6 GB of data. It is super trivial, I just wanted to get a sense of the throughput and check if we have some kind of a regression in recent versions (we don't) [1] Also I realised that the data size can be counted very differently - do we account for DB overhead and how? [1] https://gist.github.com/ptupitsyn/4f54230636178865fc93c97e4d419f15 On Thu, Feb 25, 2021 at 10:49 AM Vladimir Tchernyi <[email protected]> wrote: > Hi, > > I'd spent some time thinking about the community comments on my post. It > seems that Ignite is really not a bottleneck here. The performance of my > production MSSQL is a given restriction and the problem is to ensure fast > loading by executing multiple parallel queries. I'll test my code in > production for a couple of months for possible problems. If it will be OK, > probably the complete/downloadable/compilable GitHub example will be useful > for the community. > > WDYT? > > пт, 19 февр. 2021 г. в 21:47, Vladimir Tchernyi <[email protected]>: > >> Pavel, >> >> maybe it's time to put your five-cent in. Can you share your code as a >> GitHub project? Maybe with the script to reproduce 6 GB of data. >> >> As for MSSQL data retrieval being the bottleneck - don't think so, I got >> 15 min load time for 1 node and 3.5 min time for 4 nodes. Looks like a >> linear dependency (the table and the RDBMS server were the same). >> -- >> Vladimir >> >> пт, 19 февр. 2021 г. в 19:47, Pavel Tupitsyn <[email protected]>: >> >>> > First of all, I tried to select the whole table as once >>> >>> Hmm, it looks like MSSQL data retrieval may be the bottleneck here, not >>> Ignite. >>> >>> Can you run a test where some dummy data of the same size as real data >>> is generated and inserted into Ignite, >>> so that we test Ignite perf only, excluding MSSQL from the equation? >>> For example, streaming 300 million entries (total size 6 GB) takes >>> around 1 minute on my machine, with a simple single-threaded DataStreamer. >>> >>> On Fri, Feb 19, 2021 at 4:49 PM Vladimir Tchernyi <[email protected]> >>> wrote: >>> >>>> Hi folks, >>>> thanks for your interest in my work. >>>> >>>> I didn't try COPY FROM since I've tried to work with Ignite SQL a >>>> couple of years ago and didn't succeed. Probably because examples available >>>> aren't complete/downloadable/compilable (the paper [1] contains GitHub >>>> repo, that is my five cents in changing the status quo). My interest is in >>>> KV API. >>>> >>>> I did try a data streamer, and that was my first try. I did not notice >>>> a significant time reduction in using code from my paper [1] versus data >>>> streamer/receiver. There was some memory economy with the streamer, though. >>>> I must say my experiment was made on a heavily loaded production mssql >>>> server. Filtered query with 300K rows resultset takes about 15 sec. The >>>> story follows. >>>> >>>> First of all, I tried to select the whole table as once, I got the >>>> network timeout and the client node was dropped off the cluster (is node >>>> still alive?). >>>> So I'd partitioned the table and executed a number of queries >>>> one-by-one on the client node, each query for the specific table partition. >>>> That process took about 90 min. Inacceptable time. >>>> >>>> Then I tried to execute my queries in parallel on the client node, each >>>> query executing dataStreamer.addData() for a single dataStreamer. The >>>> timing was not less than 15 min. All the attempts were the same, probably >>>> that was the network throughput limit on the client node (same interface >>>> used for the resultset and for cluster intercom). Say it again - that was >>>> the production environment. >>>> >>>> Final schema: >>>> * ComputeTask.map() schedules ComputeJobs amongst cluster nodes, one >>>> job for one table partition; >>>> * each job executes SQL query, constructs a map with binary object key >>>> and value. Then the job executes targetCache.invokeAll() specifying the >>>> constructed map and the static EntryProcessor class. The EntryProcessor >>>> contains the logic for cache binary entry update; >>>> * ComputeTask.reduce() summarizes the row count reported by each job. >>>> >>>> The schema described proved to be network error-free in my production >>>> network and gives acceptable timing. >>>> >>>> Vladimir >>>> >>>> [1] >>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>> >>>> пт, 19 февр. 2021 г. в 16:41, Stephen Darlington < >>>> [email protected]>: >>>> >>>>> I think it’s more that that putAll is mostly atomic, so the more >>>>> records you save in one chunk, the more locking, etc. happens. >>>>> Distributing >>>>> as compute jobs means all the putAlls will be local which is beneficial, >>>>> and the size of each put is going to be smaller (also beneficial). >>>>> >>>>> But that’s a lot of work that the data streamer already does for you >>>>> and the data streamer also batches updates so would still be faster. >>>>> >>>>> On 19 Feb 2021, at 13:33, Maximiliano Gazquez < >>>>> [email protected]> wrote: >>>>> >>>>> What would be the difference between doing cache.putAll(all rows) and >>>>> separating them by affinity key+executing putAll inside a compute job. >>>>> If I'm not mistaken, doing putAll should end up splitting those rows >>>>> by affinity key in one of the servers, right? >>>>> Is there a comparison of that? >>>>> >>>>> On Fri, Feb 19, 2021 at 9:51 AM Taras Ledkov <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Vladimir, >>>>>> Did you try to use SQL command 'COPY FROM <csv_file>' via thin JDBC? >>>>>> This command uses 'IgniteDataStreamer' to write data into cluster and >>>>>> parse CSV on the server node. >>>>>> >>>>>> PS. AFAIK IgniteDataStreamer is one of the fastest ways to load data. >>>>>> >>>>>> Hi Denis, >>>>>> >>>>>> Data space is 3.7Gb according to MSSQL table properries >>>>>> >>>>>> Vladimir >>>>>> >>>>>> 9:47, 19 февраля 2021 г., Denis Magda <[email protected]> >>>>>> <[email protected]>: >>>>>> >>>>>> Hello Vladimir, >>>>>> >>>>>> Good to hear from you! How much is that in gigabytes? >>>>>> >>>>>> - >>>>>> Denis >>>>>> >>>>>> >>>>>> On Thu, Feb 18, 2021 at 10:06 PM <[email protected]> wrote: >>>>>> >>>>>> Sep 2020 I've published the paper about Loading Large Datasets into >>>>>> Apache Ignite by Using a Key-Value API (English [1] and Russian [2] >>>>>> version). The approach described works in production, but shows >>>>>> inacceptable perfomance for very large tables. >>>>>> >>>>>> The story continues, and yesterday I've finished the proof of concept >>>>>> for very fast loading of very big table. The partitioned MSSQL table >>>>>> about >>>>>> 295 million rows was loaded by the 4-node Ignite cluster in 3 min 35 sec. >>>>>> Each node had executed its own SQL queries in parallel and then >>>>>> distributed >>>>>> the loaded values across the other cluster nodes. >>>>>> >>>>>> Probably that result will be of interest for the community. >>>>>> >>>>>> Regards, >>>>>> Vladimir Chernyi >>>>>> >>>>>> [1] >>>>>> https://www.gridgain.com/resources/blog/how-fast-load-large-datasets-apache-ignite-using-key-value-api >>>>>> [2] https://m.habr.com/ru/post/526708/ >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Отправлено из мобильного приложения Яндекс.Почты >>>>>> >>>>>> -- >>>>>> Taras Ledkov >>>>>> Mail-To: [email protected] >>>>>> >>>>>> >>>>> >>>>>
