Re: Full import alternatives
Dear Furkan, I did. What i am not able to understand correctly at the moment, how to run SOLR in parallel. So, i figured out that we can run indexing with SolrJ with XML file. http://lucene.472066.n3.nabble.com/Index-database-with-SolrJ-using-xml-file-directly-throws-an-error-td4426491.html Now, I would like to run the job in parallel for dataimport not deltaimport to index my documents to start with. What i m not sure, how to implement it. https://stackoverflow.com/questions/35690638/how-to-bulk-index-html-files-with-solr-cell Here it is done with the multi-threading way. but how it will work with XML file. As far as i understand till now, I need to specify XML file in conf directory. and this conf directory has data-config.xml and solr-config.xml files. one has to write several different files to override the existing one or how it works that i m not really sure about. I thought of writing a properties file. but then, i m confused how to implement it futher. Properties prop = new Properties(); InputStream input = null; try { String filename = "indexer.properties"; input = App.class.getClassLoader().getResourceAsStream(filename); if(input==null){ System.out.println("Indexer properties file not found error: " + filename); return; } prop.load(input); System.out.println(prop.getProperty("xmlpath")); System.out.println(prop.getProperty("solr-url")); } catch (IOException ex) { ex.printStackTrace(); } finally{ if(input!=null){ try { input.close(); } catch (IOException e) { e.printStackTrace(); } } } public void indexFiles() throws IOException, SolrServerException { ModifiableSolrParams params = new ModifiableSolrParams(); params.set("qt", "/dataimport"); params.set("command", "full-import"); params.set("commit", "true"); try { solr.query(params); } catch (Exception e) { e.printStackTrace(); } } I am bit lost here. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Full import alternatives
Hi Sami, Did you check delta import documentation: https://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command Kind Regards, Furkan KAMACI On Thu, Feb 28, 2019 at 7:24 PM sami wrote: > Hi Shawan, can you please suggest a small program or atleast a backbone of > a > program which can give me hints how exactly to achieve, I quote: "I send a > full-import DIH command to all of the > shards, and each one makes an SQL query to MySQL, all of them running in > parallel. " > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
Re: Full import alternatives
Hi Shawan, can you please suggest a small program or atleast a backbone of a program which can give me hints how exactly to achieve, I quote: "I send a full-import DIH command to all of the shards, and each one makes an SQL query to MySQL, all of them running in parallel. " -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Full import alternatives
On 4/13/2018 11:34 AM, Jesus Olivan wrote: > first of all, thanks for your answer. > > How you import simultaneously these 6 shards? I'm not running in SolrCloud mode, so Solr doesn't know that each shard is part of a larger index. What I'm doing would probably not work in SolrCloud mode without making some significant changes. On each of the cores representing a shard, I have a DIH config. When I do a full rebuild, I send a full-import DIH command to all of the shards, and each one makes an SQL query to MySQL, all of them running in parallel. Thanks, Shawn
Re: Full import alternatives
hi Shawn, first of all, thanks for your answer. How you import simultaneously these 6 shards? 2018-04-13 19:30 GMT+02:00 Shawn Heisey: > On 4/13/2018 11:03 AM, Jesus Olivan wrote: > > thanks for your answer. It happens that when we launch full import > process > > didn't finished (we wait for more than 60 hours last time, and we > cancelled > > it, because this is not an acceptable time for us) There weren't any > errors > > in solr logfile simply because it was working fine. The problem is that > it > > lasted eternally and didn't finish. We tried it on Aurora cluster under > > AWS, and after 20 hours of work, it failed due to lack of space in Aurora > > tmp folder. > > 375 million documents importing from MySQL with one DIH import is going > to take quite a while. > > The last full rebuild I did of my main index took 21.61 hours. This is > an index where six large shards build simultaneously, using DIH, each > one having more than 30 million documents. If I were to build it as a > single 180 million document import, it would probably take 5 days, maybe > longer. > > We had another index (since retired) that had more than 400 million > total documents, built similarly with multiple shards at the same time. > The last rebuild I can remember on that index took about two days. > > Thanks, > Shawn > >
Re: Full import alternatives
On 4/13/2018 11:03 AM, Jesus Olivan wrote: > thanks for your answer. It happens that when we launch full import process > didn't finished (we wait for more than 60 hours last time, and we cancelled > it, because this is not an acceptable time for us) There weren't any errors > in solr logfile simply because it was working fine. The problem is that it > lasted eternally and didn't finish. We tried it on Aurora cluster under > AWS, and after 20 hours of work, it failed due to lack of space in Aurora > tmp folder. 375 million documents importing from MySQL with one DIH import is going to take quite a while. The last full rebuild I did of my main index took 21.61 hours. This is an index where six large shards build simultaneously, using DIH, each one having more than 30 million documents. If I were to build it as a single 180 million document import, it would probably take 5 days, maybe longer. We had another index (since retired) that had more than 400 million total documents, built similarly with multiple shards at the same time. The last rebuild I can remember on that index took about two days. Thanks, Shawn
Re: Full import alternatives
Hi Shawn, thanks for your answer. It happens that when we launch full import process didn't finished (we wait for more than 60 hours last time, and we cancelled it, because this is not an acceptable time for us) There weren't any errors in solr logfile simply because it was working fine. The problem is that it lasted eternally and didn't finish. We tried it on Aurora cluster under AWS, and after 20 hours of work, it failed due to lack of space in Aurora tmp folder. 2018-04-13 18:41 GMT+02:00 Shawn Heisey: > On 4/13/2018 10:11 AM, Jesus Olivan wrote: > > we're trying to launch a full import of 375 millions of docs aprox. from > a > > MySQL database to our solrcloud cluster. Until now, this full import > > process takes around 24/27 hours to finish due to an huge import query > > (several group bys, left joins, etc), but after another import query > > modification (adding more complexity), we're unable to execute this full > > import from MySQL. > > > > We've done some research about migrating to PostgreSQL, but this option > is > > now a real option at this time, because it implies a big refatoring from > > several dev teams. > > > > Is there some alternative ways to perform successfully this full import > > process? > > DIH is a capable tool, and for what it does, it's remarkably efficient. > > It can't really be made any faster, because it's single threaded. To > get increased index speed with Solr, you must index documents from > several sources/processes/threads at the same time. Writing custom > software that can retrieve information from your source, build the > documents you require, and send several update requests simultaneously > will yield the best results. The source itself may be a bottleneck > though -- this is frequently the case, and Solr is often MUCH faster > than the information source. > > You said that you're unable to execute an updated import from MySQL. > What exactly happens when you try? Are there any errors in your solr > logfile? > > I'm not going to debate whether MySQL or PostgreSQL is the better > solution. For my indexes, my source data is in MySQL. It works well, > but full rebuilds using DIH are slower than I would like -- because it's > single-threaded. Our overall system architecture would probably be > improved by a switch to PostgreSQL, but it would be an extremely > time-consuming transition process. We aren't having any real issues > with MySQL, so we have no incentive to spend the required effort. > > Thanks, > Shawn > >
Re: Full import alternatives
Jesus, Usually zipper join (aka external merge in old ETL world) and explicit partitioning is able to boost import. https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#entity-processors On Fri, Apr 13, 2018 at 7:11 PM, Jesus Olivanwrote: > Hi! > > we're trying to launch a full import of 375 millions of docs aprox. from a > MySQL database to our solrcloud cluster. Until now, this full import > process takes around 24/27 hours to finish due to an huge import query > (several group bys, left joins, etc), but after another import query > modification (adding more complexity), we're unable to execute this full > import from MySQL. > > We've done some research about migrating to PostgreSQL, but this option is > now a real option at this time, because it implies a big refatoring from > several dev teams. > > Is there some alternative ways to perform successfully this full import > process? > > Any ideas are welcome :) > > Thanks in advance! > -- Sincerely yours Mikhail Khludnev
Re: Full import alternatives
_how_ are you importing? DIH? SolrJ? Here's an article about using SolrJ https://lucidworks.com/2012/02/14/indexing-with-solrj/ But without more details it's really impossible to say much. Things I've done in the past: 1> use SolrJ and partition the job up amongst a bunch of clients each of which works on a subset of docs. This requires, of course, that there's a way to partition the import. 2> For joins and the like, I've sometimes been able to cache data in local storage (SolrJ) and use that rather than using the joins. May not be possible of course depending on the size of some of your tables. 3> with DIH, there are some caching capabilities although I confess I don't know the pros and cons. 4> Work with your DB administrator to tune your query. Sometimes this means creating a view, sometimes adding indexes sometimes. Best, Erick On Fri, Apr 13, 2018 at 9:11 AM, Jesus Olivanwrote: > Hi! > > we're trying to launch a full import of 375 millions of docs aprox. from a > MySQL database to our solrcloud cluster. Until now, this full import > process takes around 24/27 hours to finish due to an huge import query > (several group bys, left joins, etc), but after another import query > modification (adding more complexity), we're unable to execute this full > import from MySQL. > > We've done some research about migrating to PostgreSQL, but this option is > now a real option at this time, because it implies a big refatoring from > several dev teams. > > Is there some alternative ways to perform successfully this full import > process? > > Any ideas are welcome :) > > Thanks in advance!
Re: Full import alternatives
On 4/13/2018 10:11 AM, Jesus Olivan wrote: > we're trying to launch a full import of 375 millions of docs aprox. from a > MySQL database to our solrcloud cluster. Until now, this full import > process takes around 24/27 hours to finish due to an huge import query > (several group bys, left joins, etc), but after another import query > modification (adding more complexity), we're unable to execute this full > import from MySQL. > > We've done some research about migrating to PostgreSQL, but this option is > now a real option at this time, because it implies a big refatoring from > several dev teams. > > Is there some alternative ways to perform successfully this full import > process? DIH is a capable tool, and for what it does, it's remarkably efficient. It can't really be made any faster, because it's single threaded. To get increased index speed with Solr, you must index documents from several sources/processes/threads at the same time. Writing custom software that can retrieve information from your source, build the documents you require, and send several update requests simultaneously will yield the best results. The source itself may be a bottleneck though -- this is frequently the case, and Solr is often MUCH faster than the information source. You said that you're unable to execute an updated import from MySQL. What exactly happens when you try? Are there any errors in your solr logfile? I'm not going to debate whether MySQL or PostgreSQL is the better solution. For my indexes, my source data is in MySQL. It works well, but full rebuilds using DIH are slower than I would like -- because it's single-threaded. Our overall system architecture would probably be improved by a switch to PostgreSQL, but it would be an extremely time-consuming transition process. We aren't having any real issues with MySQL, so we have no incentive to spend the required effort. Thanks, Shawn
Full import alternatives
Hi! we're trying to launch a full import of 375 millions of docs aprox. from a MySQL database to our solrcloud cluster. Until now, this full import process takes around 24/27 hours to finish due to an huge import query (several group bys, left joins, etc), but after another import query modification (adding more complexity), we're unable to execute this full import from MySQL. We've done some research about migrating to PostgreSQL, but this option is now a real option at this time, because it implies a big refatoring from several dev teams. Is there some alternative ways to perform successfully this full import process? Any ideas are welcome :) Thanks in advance!