Re: Performance of cross join vs block join

Mikhail Khludnev Fri, 12 Jul 2013 11:43:26 -0700

Hello Roman,

Thanks for your interest. I briefly looked on your approach, and I'm really
interested in your numbers.


Here is the trivial code, I'd rather prefer rely on your testing framework,
and can provide you a version of Solr 4.2 with SOLR-3076 applied. Do you
need it?
https://github.com/m-khl/join-tester

What you are saying about benchmark representativeness definitely makes
sense. I didn't try to establish a complete absolutely representative
benchmark. Just wanted to have rough numbers, related for my usecase,
certainly. I'm from eCommerce, that volume was enough for me.

What I didn't get is, 'not the block joins, because these cannot be used for
citation data - we cannot reasonably index them into one segment'. Usually,
there is no problem with blocks in multi segment index, block definitely
can't span across segments. Anyway, please elaborate.
One of block join benefits is an ability to hit only the first matched
child in group, and jump over followings. It doesn't applicable in general,
but get huge gain some times.


On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla <roman.ch...@gmail.com> wrote:

> Hi Mikhail,
> I have commented on your blog, but it seems I have done st wrong, as the
> comment is not there. Would it be possible to share the test setup
> (script)?
>
> I have found out that the crucial thing with joins is the number of 'joins'
> [hits returned] and it seems that the experiments I have seen so far were
> geared towards small collection - even if Erick's index was 26M, the number
> of hits was probably small - you can see a very different story if you face
> some [other] real data. Here is a citation network and I was comparing
> lucene join's [ie not the block joins, because these cannot be used for
> citation data - we cannot reasonably index them into one segment])
>
>
> https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png
>
> Notice, the y axes is sqrt, so the running time for lucene join is growing
> and growing very fast! It takes lucene 30s to do the search that selects 1M
> hits.
>
> The comparison is against our own implementation of a similar search - but
> the main point I am making is that the join benchmarks should be showing
> the number of hits selected by the join operation. Otherwise, a very
> important detail is hidden.
>
> Best,
>
>   roman
>
>
> On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu <mihaela...@yahoo.com
> > >wrote:
> >
> > > Hi Mikhail,
> > >
> > > I have used wrong the term block join. When I said block join I was
> > > referring to a join performed on a single core versus cross join which
> > was
> > > performed on multiple cores.
> > > But I saw your benchmark (from cache) and it seems that block join has
> > > better performance. Is this functionality available on Solr 4.3.1?
> >
> > nope SOLR-3076 awaits for ages.
> >
> >
> > > I did not find such examples on Solr's wiki page.
> > > Does this functionality require a special schema, or a special
> indexing?
> >
> > Special indexing - yes.
> >
> >
> > > How would I need to index the data from my tables? In my case anyway
> all
> > > the indices have a common schema since I am using dynamic fields, thus
> I
> > > can easily add all documents from all tables in one Solr core, but for
> > each
> > > document to add a discriminator field.
> > >
> > correct. but notion of ' discriminator field' is a little bit different
> for
> > blockjoin.
> >
> >
> > >
> > > Could you point me to some more documentation?
> > >
> >
> > I can recommend only those
> >
> >
> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
> > http://www.youtube.com/watch?v=-OiIlIijWH0
> >
> >
> > > Thanks in advance,
> > > Mihaela
> > >
> > >
> > > ________________________________
> > >  From: Mikhail Khludnev <mkhlud...@griddynamics.com>
> > > To: solr-user <solr-user@lucene.apache.org>; mihaela olteanu <
> > > mihaela...@yahoo.com>
> > > Sent: Thursday, July 11, 2013 2:25 PM
> > > Subject: Re: Performance of cross join vs block join
> > >
> > >
> > > Mihaela,
> > >
> > > For me it's reasonable that single core join takes the same time as
> cross
> > > core one. I just can't see which gain can be obtained from in the
> former
> > > case.
> > > I hardly able to comment join code, I looked into, it's not trivial, at
> > > least. With block join it doesn't need to obtain parentId term
> > > values/numbers and lookup parents by them. Both of these actions are
> > > expensive. Also blockjoin works as an iterator, but join need to
> allocate
> > > memory for parents bitset and populate it out of order that impacts
> > > scalability.
> > > Also in None scoring mode BJQ don't need to walk through all children,
> > but
> > > only hits first. Also, nice feature is 'both side leapfrog' if you
> have a
> > > highly restrictive filter/query intersects with BJQ, it allows to skip
> > many
> > > parents and children as well, that's not possible in Join, which has
> > fairly
> > > 'full-scan' nature.
> > > Main performance factor for Join is number of child docs.
> > > I'm not sure I got all your questions, please specify them in more
> > details,
> > > if something is still unclear.
> > > have you saw my benchmark
> > > http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?
> > >
> > >
> > >
> > > On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu <mihaela...@yahoo.com
> > > >wrote:
> > >
> > > > Hello,
> > > >
> > > > Does anyone know about some measurements in terms of performance for
> > > cross
> > > > joins compared to joins inside a single index?
> > > >
> > > > Is it faster the join inside a single index that stores all documents
> > of
> > > > various types (from parent table or from children tables)with a
> > > > discriminator field compared to the cross join (basically in this
> case
> > > each
> > > > document type resides in its own index)?
> > > >
> > > > I have performed some tests but to me it seems that having a join in
> a
> > > > single index (bigger index) does not add too much speed improvements
> > > > compared to cross joins.
> > > >
> > > > Why a block join would be faster than a cross join if this is the
> case?
> > > > What are the variables that count when trying to improve the query
> > > > execution time?
> > > >
> > > > Thanks!
> > > > Mihaela
> > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > > <mkhlud...@griddynamics.com>
> >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> >  <http://www.griddynamics.com>
> > <mkhlud...@griddynamics.com>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Re: Performance of cross join vs block join

Reply via email to