Re: import efficiencies

2016-05-26 Thread John Blythe
all good ideas and recs, guys. erick, i'd thought of much the same after
reading through the SolrJ post and beginning to get a bit anxious at the
idea of implementation (not a java dev here lol). we're already doing some
processing before the import, taking a few million records, rolling them up
/ flattening them down into single versions of representative data, and
then running some processes on them to get even more insight out of them. i
think the lowest hanging fruit at this point will be to simply include some
extra processing at this stage, further upstream, and grab up the related
data that the DIH is currently straining under due to the plethora of open
connections.

thanks for all the thoughts and sparks flying around on this one, guys!

best,


-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Thu, May 26, 2016 at 2:42 PM, John Bickerstaff 
wrote:

> Having more carefully read Erick's post - I see that is essentially what he
> said in a much more straightforward way.
>
> I will also second Erick's suggestion of hammering on the SQL.  We found
> that fruitful many times at the same gig.  I develop and am not a SQL
> master.  In a similar situation I'll usually seek out a specialist to help
> me make sure the query isn't wasteful.  It frequently was and I learned a
> lot.
>
> On Thu, May 26, 2016 at 12:31 PM, John Bickerstaff <
> j...@johnbickerstaff.com
> > wrote:
>
> > It may or may not be helpful, but there's a similar class of problem that
> > is frequently solved either by stored procedures or by running the query
> on
> > a time-frame and storing the results...  Doesn't matter if the end-point
> > for the data is Solr or somewhere else.
> >
> > The problem is long running queries that are extremely complex and stress
> > the database performance too heavily.
> >
> > The solution is to de-normalize the data you need... store it in that
> form
> > and then the query gets really fast... sort of like a data warehouse type
> > of thing.  (Don't shoot, I know this isn't data warehousing...)
> >
> > Postgres even has something called an "automatically updateable view"
> that
> > might serve - if that's your back end.
> >
> > Anyway - the underlying strategy is to find a way to flatten your data
> > preparatory to turning it into solr documents by some means - either by
> > getting it out on shorter-running queries all the time into some kind of
> > store (Kafka, text file, whatever) or by using some feature of the
> database
> > (stored procs writing to a summary table, automatically updatable view or
> > similar).
> >
> > In this way, when you make your query, you make it against the
> "flattened"
> > data - which is, ideally, all in one table - and then all the complexity
> of
> > joins etc... is washed away and things ought to run pretty fast.
> >
> > The cost, of course, is a huge table with tons of duplicated data...
> Only
> > you can say if that's worth it.  I did this at my last gig and we
> truncated
> > the table every 2 weeks to prevent it growing forever.
> >
> > In case it's helpful...
> >
> > PS - if you have the resources, a duplicate database can really help here
> > too - again my experience is mostly with Postgres which allows a "warm"
> > backup to be live.  We frequently used this for executive queries that
> were
> > using the database like a data warehouse because they were so
> > time-consuming.  It kept the load off production.
> >
> > On Thu, May 26, 2016 at 12:18 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> Forgot to add... sometimes really hammering at the SQL query in DIH
> >> can be fruitful, can you make a huge, monster query that's faster than
> >> the sub-queries?
> >>
> >> I've also seen people run processes on the DB that move all the
> >> data into a temporary place making use of all of the nifty stuff you
> >> can do there and then use DIH on _that_. Or the view.
> >>
> >> All that said, I generally prefer using SolrJ if DIH doesn't do the job
> >> after a day or two of fiddling, it gives more control.
> >>
> >> Good Luck!
> >> Erick
> >>
> >> On Thu, May 26, 2016 at 11:02 AM, John Blythe 
> wrote:
> >> > oo gotcha. cool, will make sure to check it out and bounce any related
> >> > questions through here.
> >> >
> >> > thanks!
> >> >
> >> > best,
> >> >
> >> >
> >> > --
> >> > *John Blythe*
> >> > Product Manager & Lead Developer
> >> >
> >> > 251.605.3071 | j...@curvolabs.com
> >> > www.curvolabs.com
> >> >
> >> > 58 Adams Ave
> >> > Evansville, IN 47713
> >> >
> >> > On Thu, May 26, 2016 at 1:45 PM, Erick Erickson <
> >> erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> >> Solr commits aren't the issue I'd guess. All the time is
> >> >> probably being spent getting the data from MySQL.
> >> >>
> >> >> I've had some luck writing to Solr from a DB through a
> >> >> SolrJ program, here's a place to get 

Re: import efficiencies

2016-05-26 Thread John Bickerstaff
Having more carefully read Erick's post - I see that is essentially what he
said in a much more straightforward way.

I will also second Erick's suggestion of hammering on the SQL.  We found
that fruitful many times at the same gig.  I develop and am not a SQL
master.  In a similar situation I'll usually seek out a specialist to help
me make sure the query isn't wasteful.  It frequently was and I learned a
lot.

On Thu, May 26, 2016 at 12:31 PM, John Bickerstaff  wrote:

> It may or may not be helpful, but there's a similar class of problem that
> is frequently solved either by stored procedures or by running the query on
> a time-frame and storing the results...  Doesn't matter if the end-point
> for the data is Solr or somewhere else.
>
> The problem is long running queries that are extremely complex and stress
> the database performance too heavily.
>
> The solution is to de-normalize the data you need... store it in that form
> and then the query gets really fast... sort of like a data warehouse type
> of thing.  (Don't shoot, I know this isn't data warehousing...)
>
> Postgres even has something called an "automatically updateable view" that
> might serve - if that's your back end.
>
> Anyway - the underlying strategy is to find a way to flatten your data
> preparatory to turning it into solr documents by some means - either by
> getting it out on shorter-running queries all the time into some kind of
> store (Kafka, text file, whatever) or by using some feature of the database
> (stored procs writing to a summary table, automatically updatable view or
> similar).
>
> In this way, when you make your query, you make it against the "flattened"
> data - which is, ideally, all in one table - and then all the complexity of
> joins etc... is washed away and things ought to run pretty fast.
>
> The cost, of course, is a huge table with tons of duplicated data...  Only
> you can say if that's worth it.  I did this at my last gig and we truncated
> the table every 2 weeks to prevent it growing forever.
>
> In case it's helpful...
>
> PS - if you have the resources, a duplicate database can really help here
> too - again my experience is mostly with Postgres which allows a "warm"
> backup to be live.  We frequently used this for executive queries that were
> using the database like a data warehouse because they were so
> time-consuming.  It kept the load off production.
>
> On Thu, May 26, 2016 at 12:18 PM, Erick Erickson 
> wrote:
>
>> Forgot to add... sometimes really hammering at the SQL query in DIH
>> can be fruitful, can you make a huge, monster query that's faster than
>> the sub-queries?
>>
>> I've also seen people run processes on the DB that move all the
>> data into a temporary place making use of all of the nifty stuff you
>> can do there and then use DIH on _that_. Or the view.
>>
>> All that said, I generally prefer using SolrJ if DIH doesn't do the job
>> after a day or two of fiddling, it gives more control.
>>
>> Good Luck!
>> Erick
>>
>> On Thu, May 26, 2016 at 11:02 AM, John Blythe  wrote:
>> > oo gotcha. cool, will make sure to check it out and bounce any related
>> > questions through here.
>> >
>> > thanks!
>> >
>> > best,
>> >
>> >
>> > --
>> > *John Blythe*
>> > Product Manager & Lead Developer
>> >
>> > 251.605.3071 | j...@curvolabs.com
>> > www.curvolabs.com
>> >
>> > 58 Adams Ave
>> > Evansville, IN 47713
>> >
>> > On Thu, May 26, 2016 at 1:45 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Solr commits aren't the issue I'd guess. All the time is
>> >> probably being spent getting the data from MySQL.
>> >>
>> >> I've had some luck writing to Solr from a DB through a
>> >> SolrJ program, here's a place to get started:
>> >> searchhub.org/2012/02/14/indexing-with-solrj/
>> >> you can peel out the Tika bits pretty easily I should
>> >> think.
>> >>
>> >> One technique I've used is to cache
>> >> some of the DB tables in Java's memory to keep
>> >> from having to do the secondary lookup(s). This only
>> >> really works if the "secondary table" is small enough to fit in
>> >> Java's memory of course. You can do some creative
>> >> things with caching partial tables if you can sort appropriately.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, May 26, 2016 at 9:01 AM, John Blythe 
>> wrote:
>> >> > hi all,
>> >> >
>> >> > i've got layered entities in my solr import. it's calling on some
>> >> > transactional data from a MySQL instance. there are two fields that
>> are
>> >> > used to then lookup other information from other tables via their
>> related
>> >> > UIDs, one of which has its own child entity w yet another select
>> >> statement
>> >> > to grab up more data.
>> >> >
>> >> > it fetches at about 120/s but processes at ~50-60/s. we currently
>> only
>> >> have
>> >> > close to 500k records, but it's growing quickly and thus is becoming
>> >> > increasingly painful 

Re: import efficiencies

2016-05-26 Thread John Bickerstaff
It may or may not be helpful, but there's a similar class of problem that
is frequently solved either by stored procedures or by running the query on
a time-frame and storing the results...  Doesn't matter if the end-point
for the data is Solr or somewhere else.

The problem is long running queries that are extremely complex and stress
the database performance too heavily.

The solution is to de-normalize the data you need... store it in that form
and then the query gets really fast... sort of like a data warehouse type
of thing.  (Don't shoot, I know this isn't data warehousing...)

Postgres even has something called an "automatically updateable view" that
might serve - if that's your back end.

Anyway - the underlying strategy is to find a way to flatten your data
preparatory to turning it into solr documents by some means - either by
getting it out on shorter-running queries all the time into some kind of
store (Kafka, text file, whatever) or by using some feature of the database
(stored procs writing to a summary table, automatically updatable view or
similar).

In this way, when you make your query, you make it against the "flattened"
data - which is, ideally, all in one table - and then all the complexity of
joins etc... is washed away and things ought to run pretty fast.

The cost, of course, is a huge table with tons of duplicated data...  Only
you can say if that's worth it.  I did this at my last gig and we truncated
the table every 2 weeks to prevent it growing forever.

In case it's helpful...

PS - if you have the resources, a duplicate database can really help here
too - again my experience is mostly with Postgres which allows a "warm"
backup to be live.  We frequently used this for executive queries that were
using the database like a data warehouse because they were so
time-consuming.  It kept the load off production.

On Thu, May 26, 2016 at 12:18 PM, Erick Erickson 
wrote:

> Forgot to add... sometimes really hammering at the SQL query in DIH
> can be fruitful, can you make a huge, monster query that's faster than
> the sub-queries?
>
> I've also seen people run processes on the DB that move all the
> data into a temporary place making use of all of the nifty stuff you
> can do there and then use DIH on _that_. Or the view.
>
> All that said, I generally prefer using SolrJ if DIH doesn't do the job
> after a day or two of fiddling, it gives more control.
>
> Good Luck!
> Erick
>
> On Thu, May 26, 2016 at 11:02 AM, John Blythe  wrote:
> > oo gotcha. cool, will make sure to check it out and bounce any related
> > questions through here.
> >
> > thanks!
> >
> > best,
> >
> >
> > --
> > *John Blythe*
> > Product Manager & Lead Developer
> >
> > 251.605.3071 | j...@curvolabs.com
> > www.curvolabs.com
> >
> > 58 Adams Ave
> > Evansville, IN 47713
> >
> > On Thu, May 26, 2016 at 1:45 PM, Erick Erickson  >
> > wrote:
> >
> >> Solr commits aren't the issue I'd guess. All the time is
> >> probably being spent getting the data from MySQL.
> >>
> >> I've had some luck writing to Solr from a DB through a
> >> SolrJ program, here's a place to get started:
> >> searchhub.org/2012/02/14/indexing-with-solrj/
> >> you can peel out the Tika bits pretty easily I should
> >> think.
> >>
> >> One technique I've used is to cache
> >> some of the DB tables in Java's memory to keep
> >> from having to do the secondary lookup(s). This only
> >> really works if the "secondary table" is small enough to fit in
> >> Java's memory of course. You can do some creative
> >> things with caching partial tables if you can sort appropriately.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, May 26, 2016 at 9:01 AM, John Blythe 
> wrote:
> >> > hi all,
> >> >
> >> > i've got layered entities in my solr import. it's calling on some
> >> > transactional data from a MySQL instance. there are two fields that
> are
> >> > used to then lookup other information from other tables via their
> related
> >> > UIDs, one of which has its own child entity w yet another select
> >> statement
> >> > to grab up more data.
> >> >
> >> > it fetches at about 120/s but processes at ~50-60/s. we currently only
> >> have
> >> > close to 500k records, but it's growing quickly and thus is becoming
> >> > increasingly painful to make modifications due to the reimport that
> needs
> >> > to then occur.
> >> >
> >> > i feel like i'd seen some threads regarding commits of new data,
> >> > master/slave, or solrcloud/sharding that could help in some ways
> related
> >> to
> >> > this but as of yet can't scrounge them up w my searches (ironic :p).
> >> >
> >> > can someone help by pointing me to some good material related to this
> >> sort
> >> > of thing?
> >> >
> >> > thanks-
> >>
>


Re: import efficiencies

2016-05-26 Thread Erick Erickson
Forgot to add... sometimes really hammering at the SQL query in DIH
can be fruitful, can you make a huge, monster query that's faster than
the sub-queries?

I've also seen people run processes on the DB that move all the
data into a temporary place making use of all of the nifty stuff you
can do there and then use DIH on _that_. Or the view.

All that said, I generally prefer using SolrJ if DIH doesn't do the job
after a day or two of fiddling, it gives more control.

Good Luck!
Erick

On Thu, May 26, 2016 at 11:02 AM, John Blythe  wrote:
> oo gotcha. cool, will make sure to check it out and bounce any related
> questions through here.
>
> thanks!
>
> best,
>
>
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>
> On Thu, May 26, 2016 at 1:45 PM, Erick Erickson 
> wrote:
>
>> Solr commits aren't the issue I'd guess. All the time is
>> probably being spent getting the data from MySQL.
>>
>> I've had some luck writing to Solr from a DB through a
>> SolrJ program, here's a place to get started:
>> searchhub.org/2012/02/14/indexing-with-solrj/
>> you can peel out the Tika bits pretty easily I should
>> think.
>>
>> One technique I've used is to cache
>> some of the DB tables in Java's memory to keep
>> from having to do the secondary lookup(s). This only
>> really works if the "secondary table" is small enough to fit in
>> Java's memory of course. You can do some creative
>> things with caching partial tables if you can sort appropriately.
>>
>> Best,
>> Erick
>>
>> On Thu, May 26, 2016 at 9:01 AM, John Blythe  wrote:
>> > hi all,
>> >
>> > i've got layered entities in my solr import. it's calling on some
>> > transactional data from a MySQL instance. there are two fields that are
>> > used to then lookup other information from other tables via their related
>> > UIDs, one of which has its own child entity w yet another select
>> statement
>> > to grab up more data.
>> >
>> > it fetches at about 120/s but processes at ~50-60/s. we currently only
>> have
>> > close to 500k records, but it's growing quickly and thus is becoming
>> > increasingly painful to make modifications due to the reimport that needs
>> > to then occur.
>> >
>> > i feel like i'd seen some threads regarding commits of new data,
>> > master/slave, or solrcloud/sharding that could help in some ways related
>> to
>> > this but as of yet can't scrounge them up w my searches (ironic :p).
>> >
>> > can someone help by pointing me to some good material related to this
>> sort
>> > of thing?
>> >
>> > thanks-
>>


Re: import efficiencies

2016-05-26 Thread John Blythe
oo gotcha. cool, will make sure to check it out and bounce any related
questions through here.

thanks!

best,


-- 
*John Blythe*
Product Manager & Lead Developer

251.605.3071 | j...@curvolabs.com
www.curvolabs.com

58 Adams Ave
Evansville, IN 47713

On Thu, May 26, 2016 at 1:45 PM, Erick Erickson 
wrote:

> Solr commits aren't the issue I'd guess. All the time is
> probably being spent getting the data from MySQL.
>
> I've had some luck writing to Solr from a DB through a
> SolrJ program, here's a place to get started:
> searchhub.org/2012/02/14/indexing-with-solrj/
> you can peel out the Tika bits pretty easily I should
> think.
>
> One technique I've used is to cache
> some of the DB tables in Java's memory to keep
> from having to do the secondary lookup(s). This only
> really works if the "secondary table" is small enough to fit in
> Java's memory of course. You can do some creative
> things with caching partial tables if you can sort appropriately.
>
> Best,
> Erick
>
> On Thu, May 26, 2016 at 9:01 AM, John Blythe  wrote:
> > hi all,
> >
> > i've got layered entities in my solr import. it's calling on some
> > transactional data from a MySQL instance. there are two fields that are
> > used to then lookup other information from other tables via their related
> > UIDs, one of which has its own child entity w yet another select
> statement
> > to grab up more data.
> >
> > it fetches at about 120/s but processes at ~50-60/s. we currently only
> have
> > close to 500k records, but it's growing quickly and thus is becoming
> > increasingly painful to make modifications due to the reimport that needs
> > to then occur.
> >
> > i feel like i'd seen some threads regarding commits of new data,
> > master/slave, or solrcloud/sharding that could help in some ways related
> to
> > this but as of yet can't scrounge them up w my searches (ironic :p).
> >
> > can someone help by pointing me to some good material related to this
> sort
> > of thing?
> >
> > thanks-
>


Re: import efficiencies

2016-05-26 Thread Erick Erickson
Solr commits aren't the issue I'd guess. All the time is
probably being spent getting the data from MySQL.

I've had some luck writing to Solr from a DB through a
SolrJ program, here's a place to get started:
searchhub.org/2012/02/14/indexing-with-solrj/
you can peel out the Tika bits pretty easily I should
think.

One technique I've used is to cache
some of the DB tables in Java's memory to keep
from having to do the secondary lookup(s). This only
really works if the "secondary table" is small enough to fit in
Java's memory of course. You can do some creative
things with caching partial tables if you can sort appropriately.

Best,
Erick

On Thu, May 26, 2016 at 9:01 AM, John Blythe  wrote:
> hi all,
>
> i've got layered entities in my solr import. it's calling on some
> transactional data from a MySQL instance. there are two fields that are
> used to then lookup other information from other tables via their related
> UIDs, one of which has its own child entity w yet another select statement
> to grab up more data.
>
> it fetches at about 120/s but processes at ~50-60/s. we currently only have
> close to 500k records, but it's growing quickly and thus is becoming
> increasingly painful to make modifications due to the reimport that needs
> to then occur.
>
> i feel like i'd seen some threads regarding commits of new data,
> master/slave, or solrcloud/sharding that could help in some ways related to
> this but as of yet can't scrounge them up w my searches (ironic :p).
>
> can someone help by pointing me to some good material related to this sort
> of thing?
>
> thanks-


import efficiencies

2016-05-26 Thread John Blythe
hi all,

i've got layered entities in my solr import. it's calling on some
transactional data from a MySQL instance. there are two fields that are
used to then lookup other information from other tables via their related
UIDs, one of which has its own child entity w yet another select statement
to grab up more data.

it fetches at about 120/s but processes at ~50-60/s. we currently only have
close to 500k records, but it's growing quickly and thus is becoming
increasingly painful to make modifications due to the reimport that needs
to then occur.

i feel like i'd seen some threads regarding commits of new data,
master/slave, or solrcloud/sharding that could help in some ways related to
this but as of yet can't scrounge them up w my searches (ironic :p).

can someone help by pointing me to some good material related to this sort
of thing?

thanks-