Lucene tends to omit full scans as possible with leap-frogging on skiplists. It will enumerate all *matching* docs O(m) and rank every result with O(log(page size)). ie O(m log p). Early I remember that BJQ enumerated all matching children even most time it's enough find only one, potentially it's a room for improvement but it never be a problem.
On Fri, Apr 6, 2018 at 4:15 PM, Arturas Mazeika <maze...@gmail.com> wrote: > Hi Mikhail et al, > > I must say that this complexity question is still bugging me, and I wonder > if it is possible to get even partial answers in Big-O notation.. > > Say that we have N (for example 10^6) documents, each having 10 SKUs and > each in turn having 10 storage as well as every product having 10 vendors. > Consider then answer to be 1% large (there are 10 000 documents satisfying > the query). What would be the complexity of answering it? > > Cheers, > Arturas > > On Thu, Apr 5, 2018 at 11:47 AM, Arturas Mazeika <maze...@gmail.com> > wrote: > > > Hi Mikhail et al, > > > > Thanks a lot for sharing the code snippet. I would not have been able to > > dig this Java file myself to investigate the complexity of the search > > query. Scanning the code I get a feeling that it is well structured and > > well thought of. There is a concept like advance (Parent Approximation) > as > > well as ParentPhaseTwo, matches, matchCost, BlockJoinScorer, Explanation, > > Query rewriting. Is there a documentation available how the architecture > > looks like and what school of thought/doctrine used here? > > > > W.r.t. to my complexity question, I expected to see an answer in the > Big-O > > notation (rather than as Java code). Typically one makes assumptions > there > > about the key parameters (e.g., number of Products to be N_P, number of > > SKUs to be N_Sk, number of storages to be N_St, number of vendors to be > > N_V, JOIN Selectivities (in terms of percentage) be p(P,SK), p(SK,ST), > > p(P,V) between the corresponding entities and computes a formula. > > > > What is the complexity of this query in big-O notation? > > > > Cheers, > > Arturas > > > > > > > > On Wed, Apr 4, 2018 at 6:16 PM, Mikhail Khludnev <m...@apache.org> > wrote: > > > >> > > >> > What's happening under the hood of > >> > solr in answering query [1] from [2]? > >> > >> https://github.com/apache/lucene-solr/blob/master/lucene/ > >> join/src/java/org/apache/lucene/search/join/ToParentBlo > >> ckJoinQuery.java#L178 > >> > >> On Wed, Apr 4, 2018 at 3:39 PM, Arturas Mazeika <maze...@gmail.com> > >> wrote: > >> > >> > Hi Mikhail et al, > >> > > >> > Thanks a lot for a very thorough answer. This is an impressive piece > of > >> > knowledge you just shared. > >> > > >> > Not surprisingly, I was caught unprepared by the 'v=...' part of the > >> > answer. This brought me to the links you posted (starts with http). > From > >> > those links I went to the more updated link (starts with https), which > >> > brought me to other very resourceful links. Combined with some > >> meditation > >> > session, it came into my mind that it is not possible to express block > >> > queries using mathematical logic only. The format of the input > document > >> is > >> > deeply built into the query expression and answering. Expressing these > >> > queries mathematically / logically may give an impression that solr is > >> > capable of answering (NP-?) hard problems. I have a feeling though > that > >> > solr answers to queries in polynomial (or even almost linear) times. > >> > > >> > Just to connect the remaining dots.. What's happening under the hood > of > >> > solr in answering query [1] from [2]? Is it really so that inverted > >> index > >> > is used to identify the vectors of ids, that are scanned linearly in a > >> hope > >> > to get matches on _root_ and other internal variables? > >> > > >> > [1] q=+{!parent which=type_s:product v=$skuq} +{!parent > >> > which=type_s:product v=$vendorq}&skuq=+COLOR_s:Blue +SIZE_s:XL > +{!parent > >> > which=type_s:sku v='+QTY_i:[10 TO *] +STATE_s:CA'}&vendorq=+NAME_s: > Bob > >> > +PRICE_i:[20 TO 25] > >> > [2] > >> > https://blog.griddynamics.com/searching-grandchildren-and- > >> > siblings-with-solr-block-join/ > >> > > >> > Thanks! > >> > Arturas > >> > > >> > On Wed, Apr 4, 2018 at 12:36 PM, Mikhail Khludnev <m...@apache.org> > >> wrote: > >> > > >> > > q=+{!parent which=ntype:p v='+msg:Hello +person:Arturas'} +{!parent > >> > which= > >> > > ntype:p v='+msg:ciao +person:Vai'} > >> > > > >> > > On Wed, Apr 4, 2018 at 12:19 PM, Arturas Mazeika <maze...@gmail.com > > > >> > > wrote: > >> > > > >> > > > Hi Mikhail et al, > >> > > > > >> > > > It seems to me that the nested documents must include nodes that > >> encode > >> > > the > >> > > > level of nodes (within the document). Therefore, the minimal > example > >> > must > >> > > > include the node type. Is the following structure sufficient? > >> > > > > >> > > > { > >> > > > "id":1, > >> > > > "ntype":"p", > >> > > > "_childDocuments_": > >> > > > [ > >> > > > {"id":"1_1", "ntype":"c", "person":"Vai", > "time":"3:14", > >> > > > "msg":"Hello"}, > >> > > > {"id":"1_2", "ntype":"c", "person":"Arturas", > "time":"3:14", > >> > > > "msg":"Hello"}, > >> > > > {"id":"1_3", "ntype":"c", "person":"Vai", > "time":"3:15", > >> > > > "msg":"Coz Mathias is working on another system- different > >> screen."}, > >> > > > {"id":"1_4", "ntype":"c", "person":"Vai", > "time":"3:15", > >> > > > "msg":"It can get annoying"}, > >> > > > {"id":"1_5", "ntype":"c", "person":"Arturas", > "time":"3:15", > >> > > > "msg":"Thank you. this is very nice of you"}, > >> > > > {"id":"1_6", "ntype":"c", "person":"Vai", > "time":"3:16", > >> > > > "msg":"ciao"}, > >> > > > {"id":"1_7", "ntype":"c", "person":"Arturas", > "time":"3:16", > >> > > > "msg":"ciao"} > >> > > > ] > >> > > > }, > >> > > > { > >> > > > "id":2, > >> > > > "ntype":"p", > >> > > > "_childDocuments_": > >> > > > [ > >> > > > {"id":"2_1", "ntype":"c", "person":"Vai", > "time":"4:14", > >> > > > "msg":"Hi"}, > >> > > > {"id":"2_2", "ntype":"c", "person":"Arturas", > "time":"4:14", > >> > > > "msg":"IBM Watson"}, > >> > > > {"id":"2_3", "ntype":"c", "person":"Vai", > "time":"4:15", > >> > > > "msg":"need to retain content"}, > >> > > > {"id":"2_4", "ntype":"c", "person":"Vai", > "time":"4:15", > >> > > > "msg":"It can get annoying"}, > >> > > > {"id":"2_5", "ntype":"c", "person":"Arturas", > "time":"4:15", > >> > > > "msg":"You can make all your meetings more access"}, > >> > > > {"id":"2_6", "ntype":"c", "person":"Vai", > "time":"4:16", > >> > > > "msg":"Make every meeting a Skype meeting"}, > >> > > > {"id":"2_7", "ntype":"c", "person":"Arturas", > "time":"4:16", > >> > > > "msg":"ciao"} > >> > > > ] > >> > > > } > >> > > > > >> > > > How would a query look like that has a Hello from Person Arturas > and > >> > ciao > >> > > > from Person Vai? > >> > > > > >> > > > Cheers, > >> > > > Arturas > >> > > > > >> > > > > >> > > > On Tue, Apr 3, 2018 at 5:21 PM, Arturas Mazeika < > maze...@gmail.com> > >> > > wrote: > >> > > > > >> > > > > Hi Mikhail, > >> > > > > > >> > > > > Thanks a lot for the reply. > >> > > > > > >> > > > > You mentioned that > >> > > > > > >> > > > > q=+{!parent which.. v='+text:hello +person:A'} +{!parent > >> > > > > which..v='+text:ciao +person:B'} > >> > > > > > >> > > > > is the way to go. How would it look like precisely for the > >> following > >> > > > > collection? > >> > > > > > >> > > > > { > >> > > > > "id":1, > >> > > > > "_childDocuments_": > >> > > > > [ > >> > > > > {"id":"1_1", "person":"Vai" , "time":"3:14", > >> > > > > "msg":"Hello"}, > >> > > > > {"id":"1_2", "person":"Arturas" , "time":"3:14", > >> > > > > "msg":"Hello"}, > >> > > > > {"id":"1_3", "person":"Vai" , "time":"3:15", > >> > "msg":"Coz > >> > > > > Mathias is working on another system- different screen."}, > >> > > > > {"id":"1_4", "person":"Vai" , "time":"3:15", > >> > "msg":"It > >> > > > can > >> > > > > get annoying"}, > >> > > > > {"id":"1_5", "person":"Arturas" , "time":"3:15", > >> > > "msg":"Thank > >> > > > > you. this is very nice of you"}, > >> > > > > {"id":"1_6", "person":"Vai" , "time":"3:16", > >> > > > "msg":"ciao"}, > >> > > > > {"id":"1_7", "person":"Arturas" , "time":"3:16", > >> > > > "msg":"ciao"} > >> > > > > ] > >> > > > > }, > >> > > > > { > >> > > > > "id":2, > >> > > > > "_childDocuments_": > >> > > > > [ > >> > > > > {"id":"2_1", "person":"Vai" , "time":"4:14", > >> > > > > "msg":"Hello"}, > >> > > > > {"id":"2_2", "person":"Arturas" , "time":"4:14", > >> > "msg":"IBM > >> > > > > Watson"}, > >> > > > > {"id":"2_3", "person":"Vai" , "time":"4:15", > >> > > "msg":"need > >> > > > > to retain content"}, > >> > > > > {"id":"2_4", "person":"Vai" , "time":"4:15", > >> > "msg":"It > >> > > > can > >> > > > > get annoying"}, > >> > > > > {"id":"2_5", "person":"Arturas" , "time":"4:15", > >> > "msg":"You > >> > > > > can make all your meetings more access"}, > >> > > > > {"id":"2_6", "person":"Vai" , "time":"4:16", > >> > > "msg":"Make > >> > > > > every meeting a Skype meeting"}, > >> > > > > {"id":"2_7", "person":"Arturas" , "time":"4:16", > >> > > > "msg":"ciao"} > >> > > > > ] > >> > > > > } > >> > > > > > >> > > > > Cheers, > >> > > > > Arturas > >> > > > > > >> > > > > > >> > > > > On Tue, Apr 3, 2018 at 4:33 PM, Mikhail Khludnev < > m...@apache.org > >> > > >> > > > wrote: > >> > > > > > >> > > > >> Hello, Arturas. > >> > > > >> > >> > > > >> TLDR; Please find inline below. > >> > > > >> > >> > > > >> On Tue, Apr 3, 2018 at 5:14 PM, Arturas Mazeika < > >> maze...@gmail.com> > >> > > > >> wrote: > >> > > > >> > >> > > > >> > Hi Solr Fans, > >> > > > >> > > >> > > > >> > I am trying to make sense of information retrieval using > >> > expressions > >> > > > >> like > >> > > > >> > "some parent", "*only parent*", " *all parent*". I am also > >> trying > >> > to > >> > > > >> > understand the syntax "!parent which" and "!child of". On the > >> > > > technical > >> > > > >> > level, I am reading the following documents: > >> > > > >> > > >> > > > >> > [1] > >> > > > >> > https://lucene.apache.org/solr/guide/7_2/other-parsers. > >> > > > >> > html#block-join-query-parsers > >> > > > >> > [2] > >> > > > >> > https://lucene.apache.org/solr/guide/7_2/uploading-data- > >> > > > >> > with-index-handlers.html#nested-child-documents > >> > > > >> > [3] http://yonik.com/solr-nested-objects/ > >> > > > >> > > >> > > > >> > and I am confused to read: > >> > > > >> > > >> > > > >> > This parser takes a query that matches some parent documents > >> and > >> > > > returns > >> > > > >> > their children. The syntax for this parser is: q={!child > >> > > > >> > of=<allParents>}<someParents>. The parameter allParents is a > >> > filter > >> > > > that > >> > > > >> > matches *only parent documents*; here you would define the > >> field > >> > and > >> > > > >> value > >> > > > >> > that you used to identify *all parent documents*. The > parameter > >> > > > >> someParents > >> > > > >> > identifies a query that will match some of the parent > >> documents. > >> > The > >> > > > >> output > >> > > > >> > is the children. > >> > > > >> > > >> > > > >> > The first sentence talks about "matching" but does not define > >> what > >> > > > that > >> > > > >> > means (and why it is only some parents matching?). The second > >> > > sentence > >> > > > >> > introduces a syntax of the parser, but blurs the > understanding > >> as > >> > > > "some" > >> > > > >> > and "all" of parents are combined into one sentence. My > >> > > understanding > >> > > > is > >> > > > >> > that all documents are retrieve that satisfy a query. The > query > >> > must > >> > > > >> > express some constraints on the parent node and some on the > >> child > >> > > > node. > >> > > > >> I > >> > > > >> > have a feeling that "only parent documents" reads "criteria > is > >> > > > >> formulated > >> > > > >> > over the parent part of {parent document}->{child document} > of > >> > > entity. > >> > > > >> > My simplified conceptual world of solr looks in the following > >> way: > >> > > > >> > > >> > > > >> > 1. Every document has an ID. > >> > > > >> > 2. Every document may have additional attributes > >> > > > >> > 3. Text attributes is what's at stake in solr. Sure we can > >> search > >> > > for > >> > > > >> > products that costs at most X, but this is the added > >> > functionality. > >> > > > For > >> > > > >> > simplicity I am neglecting those here. > >> > > > >> > 4. The user has an information need. She expresses it with > >> > > (key)words > >> > > > >> and > >> > > > >> > hopes to find matching documents. For simplicity, I am > skipping > >> > all > >> > > > >> issues > >> > > > >> > related to the information presentation of the documents > >> > > > >> > 5. Analysis chain (and inverse index) are the key > technologies > >> > solr > >> > > is > >> > > > >> > based upon. Once the chain-processing is applied, > mathematical > >> > logic > >> > > > >> kicks > >> > > > >> > in, retrieving the documents (that are a set of processed, > >> > > normalized, > >> > > > >> > enriched tokens) matching the query (processed, normalized > and > >> > > > enriched > >> > > > >> > tokens). Clearly, the logic function can be a fancy one (at > >> least > >> > > one > >> > > > of > >> > > > >> > query token is in the document set of tokens, etc.), ranking > is > >> > used > >> > > > to > >> > > > >> > sort the results. > >> > > > >> > 6. A nested document concept is introduced in solr. It needs > >> to be > >> > > > >> uploaded > >> > > > >> > into the index structure using a specific handlers [2]. A > >> nested > >> > > > >> documents > >> > > > >> > is a tree. A root may contain children documents, which may > be > >> > > parents > >> > > > >> of > >> > > > >> > grandchildren documents. > >> > > > >> > 7. Querying nested documents is supported in the following > >> manner: > >> > > > >> > 7.1 Child documents are return that satisfies {parent > >> > > > >> > document}->{document} > >> > > > >> > 7.2 Parent documents are return that satisfy > >> > {document}->{child > >> > > > >> > document} > >> > > > >> > > >> > > > >> > Would I be very wrong to have this conceptual picture? > >> > > > >> > > >> > > > >> > From this point, the situation is a bit bury in my head. At > the > >> > > core, > >> > > > I > >> > > > >> do > >> > > > >> > not really understand what "a document" is anymore (since the > >> > > complete > >> > > > >> json > >> > > > >> > or xml, so is a sub-json and sub-xml are documents, every > >> document > >> > > > must > >> > > > >> > have an ID, does that meant the the subdocuments must have > and > >> ID > >> > > too, > >> > > > >> or > >> > > > >> > sub-ids are also fine?), how to formulate mathematical > >> expressions > >> > > > over > >> > > > >> > documents and what it means that the document satisfies my > >> > (key)word > >> > > > >> query? > >> > > > >> > Can we define a document to be the largest entity of > >> information > >> > > that > >> > > > >> does > >> > > > >> > not contain any other nested documents [4]? If this is > defined > >> and > >> > > > >> > communicated like this already where can I find it? There is > a > >> use > >> > > of > >> > > > >> the > >> > > > >> > clarification, as the concept of the document means different > >> > things > >> > > > in > >> > > > >> > different contexts (e.g., you can update only the "complete > >> > > document" > >> > > > in > >> > > > >> > the index vs. parent document, etc.). > >> > > > >> > > >> > > > >> > Is it possible to formulate what's going on using > mathematical > >> > > logic? > >> > > > >> Can > >> > > > >> > one express something like > >> > > > >> > > >> > > > >> > { give documents d : d is a document, d is parent of document > >> c, d > >> > > > >> > satisfies logical criteria C1,....,CN, c satisfies logical > >> > criteria > >> > > > >> > C1',...,CM'} > >> > > > >> > { give documents c : c is a document, d is parent of document > >> c, d > >> > > > >> > satisfies logical criteria C1,....,CN, c satisfies logical > >> > criteria > >> > > > >> > C1',...,CM'} > >> > > > >> > > >> > > > >> > here the meaning of document is as in definition [4] above. > >> > > > >> > > >> > > > >> > 1. Is it possible to retrieve all parent documents that have > >> two > >> > > > >> children > >> > > > >> > c1 and c2? Consider a document that is a skype chat, and > >> children > >> > > are > >> > > > >> > individual lines of communication in the chat. I would be > >> looking > >> > > for > >> > > > >> the > >> > > > >> > (parent) documents that have "hello" said by person A and > >> "ciao" > >> > > said > >> > > > by > >> > > > >> > person B (as two different sub-documents). > >> > > > >> > > >> > > > >> > >> > > > >> q=+{!parent which.. v='+text:hello +person:A'} +{!parent > which.. > >> > > > >> v='+text:ciao +person:B'} > >> > > > >> The query syntax is really tricky and cumbersome. > >> > > > >> > >> > > > >> > >> > > > >> > > >> > > > >> > 2. Is it possible to search for documents such that they > have a > >> > > > >> grandchild > >> > > > >> > and the grandchild has the word "hello"? > >> > > > >> > > >> > > > >> > >> > > > >> http://blog-archive.griddynamics.com/2013/12/grandchildren- > >> > > > >> and-siblings-with-block.html > >> > > > >> > >> > > > >> > >> > > > >> > > >> > > > >> > 3. Is it possible to search for documents that do not have > >> > children? > >> > > > >> > > >> > > > >> q=-{!parent which..}type:child > >> > > > >> Beware that mixing parents and childfree products is not > >> supported > >> > and > >> > > > >> causes pain. as a workaround you need to put empty child > >> placeholder > >> > > > doc. > >> > > > >> Sic. Sorry. > >> > > > >> > >> > > > >> > >> > > > >> > Is this the right venue to discuss documentation of solr? > >> > > > >> > > >> > > > >> > Thanks! > >> > > > >> > Arturas > >> > > > >> > > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> -- > >> > > > >> Sincerely yours > >> > > > >> Mikhail Khludnev > >> > > > >> > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Sincerely yours > >> > > Mikhail Khludnev > >> > > > >> > > >> > >> > >> > >> -- > >> Sincerely yours > >> Mikhail Khludnev > >> > > > > > -- Sincerely yours Mikhail Khludnev