Re: Full import alternatives

2019-03-04 Thread sami
Dear Furkan,

I did. What i am not able to understand correctly at the moment, how to run
SOLR in parallel. 

So, i figured out that we can run indexing with SolrJ with XML file. 

http://lucene.472066.n3.nabble.com/Index-database-with-SolrJ-using-xml-file-directly-throws-an-error-td4426491.html

Now, I would like to run the job in parallel for dataimport not deltaimport
to index my documents to start with. What i m not sure, how to implement it. 

https://stackoverflow.com/questions/35690638/how-to-bulk-index-html-files-with-solr-cell

Here it is done with the multi-threading way. but how it will work with XML
file. As far as i understand till now, I need to specify XML file in conf
directory. 

and this conf directory has data-config.xml and solr-config.xml files. one
has to write several different files to override the existing one or how it
works that i m not really sure about. I thought of writing a properties
file. but then, i m confused how to implement it futher. 

Properties prop = new Properties();
InputStream input = null;
try {

String filename = "indexer.properties";
input = 
App.class.getClassLoader().getResourceAsStream(filename);
if(input==null){
System.out.println("Indexer properties file not found 
error: "
+ filename);
return;
}

prop.load(input);

System.out.println(prop.getProperty("xmlpath"));
System.out.println(prop.getProperty("solr-url"));


} catch (IOException ex) {
ex.printStackTrace();
} finally{
if(input!=null){
try {
input.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

public void indexFiles() throws IOException, SolrServerException {

ModifiableSolrParams params = new ModifiableSolrParams();
params.set("qt", "/dataimport");
params.set("command", "full-import");
params.set("commit", "true");
try {
solr.query(params);
} catch (Exception e) {
e.printStackTrace();
}
}

I am bit lost here. 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Full import alternatives

2019-03-04 Thread Furkan KAMACI
Hi Sami,

Did you check delta import documentation:
https://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command

Kind Regards,
Furkan KAMACI

On Thu, Feb 28, 2019 at 7:24 PM sami  wrote:

> Hi Shawan, can you please suggest a small program or atleast a backbone of
> a
> program which can give me hints how exactly to achieve, I quote: "I send a
> full-import DIH command to all of the
> shards, and each one makes an SQL query to MySQL, all of them running in
> parallel. "
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Full import alternatives

2019-02-28 Thread sami
Hi Shawan, can you please suggest a small program or atleast a backbone of a
program which can give me hints how exactly to achieve, I quote: "I send a
full-import DIH command to all of the
shards, and each one makes an SQL query to MySQL, all of them running in
parallel. "



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Full import alternatives

2018-04-13 Thread Shawn Heisey
On 4/13/2018 11:34 AM, Jesus Olivan wrote:
> first of all, thanks for your answer.
>
> How you import simultaneously these 6 shards?

I'm not running in SolrCloud mode, so Solr doesn't know that each shard
is part of a larger index.  What I'm doing would probably not work in
SolrCloud mode without making some significant changes.

On each of the cores representing a shard, I have a DIH config.  When I
do a full rebuild, I send a full-import DIH command to all of the
shards, and each one makes an SQL query to MySQL, all of them running in
parallel.

Thanks,
Shawn



Re: Full import alternatives

2018-04-13 Thread Jesus Olivan
hi Shawn,

first of all, thanks for your answer.

How you import simultaneously these 6 shards?

2018-04-13 19:30 GMT+02:00 Shawn Heisey :

> On 4/13/2018 11:03 AM, Jesus Olivan wrote:
> > thanks for your answer. It happens that when we launch full import
> process
> > didn't finished (we wait for more than 60 hours last time, and we
> cancelled
> > it, because this is not an acceptable time for us) There weren't any
> errors
> > in solr logfile simply because it was working fine. The problem is that
> it
> > lasted eternally and didn't finish. We tried it on Aurora cluster under
> > AWS, and after 20 hours of work, it failed due to lack of space in Aurora
> > tmp folder.
>
> 375 million documents importing from MySQL with one DIH import is going
> to take quite a while.
>
> The last full rebuild I did of my main index took 21.61 hours.  This is
> an index where six large shards build simultaneously, using DIH, each
> one having more than 30 million documents.  If I were to build it as a
> single 180 million document import, it would probably take 5 days, maybe
> longer.
>
> We had another index (since retired) that had more than 400 million
> total documents, built similarly with multiple shards at the same time.
> The last rebuild I can remember on that index took about two days.
>
> Thanks,
> Shawn
>
>


Re: Full import alternatives

2018-04-13 Thread Shawn Heisey
On 4/13/2018 11:03 AM, Jesus Olivan wrote:
> thanks for your answer. It happens that when we launch full import process
> didn't finished (we wait for more than 60 hours last time, and we cancelled
> it, because this is not an acceptable time for us) There weren't any errors
> in solr logfile simply because it was working fine. The problem is that it
> lasted eternally and didn't finish. We tried it on Aurora cluster under
> AWS, and after 20 hours of work, it failed due to lack of space in Aurora
> tmp folder.

375 million documents importing from MySQL with one DIH import is going
to take quite a while.

The last full rebuild I did of my main index took 21.61 hours.  This is
an index where six large shards build simultaneously, using DIH, each
one having more than 30 million documents.  If I were to build it as a
single 180 million document import, it would probably take 5 days, maybe
longer.

We had another index (since retired) that had more than 400 million
total documents, built similarly with multiple shards at the same time. 
The last rebuild I can remember on that index took about two days.

Thanks,
Shawn



Re: Full import alternatives

2018-04-13 Thread Jesus Olivan
Hi Shawn,

thanks for your answer. It happens that when we launch full import process
didn't finished (we wait for more than 60 hours last time, and we cancelled
it, because this is not an acceptable time for us) There weren't any errors
in solr logfile simply because it was working fine. The problem is that it
lasted eternally and didn't finish. We tried it on Aurora cluster under
AWS, and after 20 hours of work, it failed due to lack of space in Aurora
tmp folder.



2018-04-13 18:41 GMT+02:00 Shawn Heisey :

> On 4/13/2018 10:11 AM, Jesus Olivan wrote:
> > we're trying to launch a full import of 375 millions of docs aprox. from
> a
> > MySQL database to our solrcloud cluster. Until now, this full import
> > process takes around 24/27 hours to finish due to an huge import query
> > (several group bys, left joins, etc), but after another import query
> > modification (adding more complexity), we're unable to execute this full
> > import from MySQL.
> >
> > We've done some research about migrating to PostgreSQL, but this option
> is
> > now a real option at this time, because it implies a big refatoring from
> > several dev teams.
> >
> > Is there some alternative ways to perform successfully this full import
> > process?
>
> DIH is a capable tool, and for what it does, it's remarkably efficient.
>
> It can't really be made any faster, because it's single threaded.  To
> get increased index speed with Solr, you must index documents from
> several sources/processes/threads at the same time.  Writing custom
> software that can retrieve information from your source, build the
> documents you require, and send several update requests simultaneously
> will yield the best results.  The source itself may be a bottleneck
> though -- this is frequently the case, and Solr is often MUCH faster
> than the information source.
>
> You said that you're unable to execute an updated import from MySQL.
> What exactly happens when you try?  Are there any errors in your solr
> logfile?
>
> I'm not going to debate whether MySQL or PostgreSQL is the better
> solution.  For my indexes, my source data is in MySQL.  It works well,
> but full rebuilds using DIH are slower than I would like -- because it's
> single-threaded.  Our overall system architecture would probably be
> improved by a switch to PostgreSQL, but it would be an extremely
> time-consuming transition process.  We aren't having any real issues
> with MySQL, so we have no incentive to spend the required effort.
>
> Thanks,
> Shawn
>
>


Re: Full import alternatives

2018-04-13 Thread Mikhail Khludnev
Jesus,
Usually zipper join (aka external merge in old ETL world) and explicit
partitioning is able to boost import.
https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#entity-processors

On Fri, Apr 13, 2018 at 7:11 PM, Jesus Olivan 
wrote:

> Hi!
>
> we're trying to launch a full import of 375 millions of docs aprox. from a
> MySQL database to our solrcloud cluster. Until now, this full import
> process takes around 24/27 hours to finish due to an huge import query
> (several group bys, left joins, etc), but after another import query
> modification (adding more complexity), we're unable to execute this full
> import from MySQL.
>
> We've done some research about migrating to PostgreSQL, but this option is
> now a real option at this time, because it implies a big refatoring from
> several dev teams.
>
> Is there some alternative ways to perform successfully this full import
> process?
>
> Any ideas are welcome :)
>
> Thanks in advance!
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Full import alternatives

2018-04-13 Thread Erick Erickson
_how_ are you importing? DIH? SolrJ?

Here's an article about using SolrJ
https://lucidworks.com/2012/02/14/indexing-with-solrj/

But without more details it's really impossible to say much. Things
I've done in the past:
1> use SolrJ and partition the job up amongst a bunch of clients each
of which works on a subset of docs. This requires, of course, that
there's a way to partition the import.
2> For joins and the like, I've sometimes been able to cache data in
local storage (SolrJ) and use that rather than using the joins. May
not be possible of course depending on the size of some of your
tables.
3> with DIH, there are some caching capabilities although I confess I
don't know the pros and cons.
4> Work with your DB administrator to tune your query. Sometimes this
means creating a view, sometimes adding indexes sometimes.

Best,
Erick

On Fri, Apr 13, 2018 at 9:11 AM, Jesus Olivan  wrote:
> Hi!
>
> we're trying to launch a full import of 375 millions of docs aprox. from a
> MySQL database to our solrcloud cluster. Until now, this full import
> process takes around 24/27 hours to finish due to an huge import query
> (several group bys, left joins, etc), but after another import query
> modification (adding more complexity), we're unable to execute this full
> import from MySQL.
>
> We've done some research about migrating to PostgreSQL, but this option is
> now a real option at this time, because it implies a big refatoring from
> several dev teams.
>
> Is there some alternative ways to perform successfully this full import
> process?
>
> Any ideas are welcome :)
>
> Thanks in advance!


Re: Full import alternatives

2018-04-13 Thread Shawn Heisey
On 4/13/2018 10:11 AM, Jesus Olivan wrote:
> we're trying to launch a full import of 375 millions of docs aprox. from a
> MySQL database to our solrcloud cluster. Until now, this full import
> process takes around 24/27 hours to finish due to an huge import query
> (several group bys, left joins, etc), but after another import query
> modification (adding more complexity), we're unable to execute this full
> import from MySQL.
>
> We've done some research about migrating to PostgreSQL, but this option is
> now a real option at this time, because it implies a big refatoring from
> several dev teams.
>
> Is there some alternative ways to perform successfully this full import
> process?

DIH is a capable tool, and for what it does, it's remarkably efficient.

It can't really be made any faster, because it's single threaded.  To
get increased index speed with Solr, you must index documents from
several sources/processes/threads at the same time.  Writing custom
software that can retrieve information from your source, build the
documents you require, and send several update requests simultaneously
will yield the best results.  The source itself may be a bottleneck
though -- this is frequently the case, and Solr is often MUCH faster
than the information source.

You said that you're unable to execute an updated import from MySQL. 
What exactly happens when you try?  Are there any errors in your solr
logfile?

I'm not going to debate whether MySQL or PostgreSQL is the better
solution.  For my indexes, my source data is in MySQL.  It works well,
but full rebuilds using DIH are slower than I would like -- because it's
single-threaded.  Our overall system architecture would probably be
improved by a switch to PostgreSQL, but it would be an extremely
time-consuming transition process.  We aren't having any real issues
with MySQL, so we have no incentive to spend the required effort.

Thanks,
Shawn



Full import alternatives

2018-04-13 Thread Jesus Olivan
Hi!

we're trying to launch a full import of 375 millions of docs aprox. from a
MySQL database to our solrcloud cluster. Until now, this full import
process takes around 24/27 hours to finish due to an huge import query
(several group bys, left joins, etc), but after another import query
modification (adding more complexity), we're unable to execute this full
import from MySQL.

We've done some research about migrating to PostgreSQL, but this option is
now a real option at this time, because it implies a big refatoring from
several dev teams.

Is there some alternative ways to perform successfully this full import
process?

Any ideas are welcome :)

Thanks in advance!