Parallel processing in any way will help, including Spark w/ a DFS like S3 or HDFS. Your three machines could end up being a bottleneck and you may need more nodes.
On Mar 20, 2018, 2:36 AM -0500, LOPEZ-CORTES Mariano-ext <mariano.lopez-cortes-...@pole-emploi.fr>, wrote: > CSV file is 5GB aprox. for 29 millions. > > As you say Christopher, at the beggining we thougth that reading chunk by > chunk from Oracle and writing to Solr > was the best strategy. > > But, from our tests we've remarked: > > CSV creation via PL/SQL is really really fast. 40 minutes for the full > dataset (with bulk collect). > Multiple SELECT calls from java slows down the process. I think Oracle is the > bottleneck here. > > Any other ideas/alternatives? > > Some other points to remark: > > We are going to enable autoCommit for every 10 minutes / 10000 rows. No > commit from client. > During indexing, whe call all the time a front-end load-balancer that > redirect calls to the 3-node cluster. > > Thanks in advance!! > > ==>Great maillist and really awesome tool!! > > -----Message d'origine----- > De : Christopher Schultz [mailto:ch...@christopherschultz.net] > Envoyé : lundi 19 mars 2018 18:05 > À : solr-user@lucene.apache.org > Objet : Re: Question liste solr > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > Mariano, > > On 3/19/18 11:50 AM, LOPEZ-CORTES Mariano-ext wrote: > > Hello > > > > We have an index Solr with 3 nodes, 1 shard et 2 replicas. > > > > Our goal is to index 42 millions rows. Indexing time is important. > > The data source is an oracle database. > > > > Our indexing strategy is : > > > > * Reading from Oracle to a big CSV file. > > > > * Reading from 4 files (big file chunked) and injection via > > ConcurrentUpdateSolrClient > > > > Is it the optimal way of injecting such mass of data into Solr ? > > > > For information, estimated time for our solution is 6h. > > How big are the CSV files? If most of the time is taken performing the > various SELECT operations, then it's probably a good strategy. > > However, you may find that using the disk as a buffer slows everything down > because disk-writes can be very slow. > > Why not perform your SELECT(s) and write directly to Solr using one of the > APIs (either a language-specific API, or through the HTTP API)? > > Hope that helps, > - -chris > -----BEGIN PGP SIGNATURE----- > Comment: GPGTools - http://gpgtools.org > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqv7aEdHGNocmlzQGNo > cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFgJrg//RushznZlTg60TxdE > s/XKK+69s9c0+DwZ/IrU366j2ZOcJl8Osu9TpzaCSEpdWuulFG8qCSYThTngaijH > I02YCqnK9Ey4+6B7u9QECWNXjdlQXoeINjCnRLVENWzkSmht/U2nW3WTFEPKOvQ3 > 6ISTPATFnfo6Wt4VYrVefqO/yCCiR5bGL5LsSZYwvqlh9egR8K/wtf4sQ5kji3z+ > r2Z0gYpR9igE3ZCIByf6QGq0Ftku90oFCG+kCVNOdgfqwkUaMdc7krv92oTSH4o5 > BH+trc2jPf3HKFmp/ywRAPEhAfA5BwbT8vB9gwl/6vuT6efAot7xrLqduF3h7jG6 > ffPtkEBbD/ld3inIVta6/hnUwxX9O1fBtJrZegD14cezLV9QcEWFJ8/lUfgGOTdX > ZuvwxBFhmCXE9EMWLlpdUOWK9iVBsZoQZxawoqw9xQauBp/Adg29fdeXmEkUssey > 85HGDv/x33Bcr1xPGa8nOygWcZRUgGFCh871qStg9GeTNx3C/mSk0wxdKeUDRePg > GEuL0p803yCJYAddyF66nnx676LfFeDaocBJelx5UbiteNT23xut7jWP/COyOvoy > tpq3c9UfIkobgcA7bZ3IL2Og+hExgo+tLQXiOx6bf2TD1Jk2UOWWk1TAUspuUybD > VH6PlwgqcrO28Jx799mJvpIotoE= > =aMPk > -----END PGP SIGNATURE-----