Re: [DISCUSS] supporting faceting in Oak query engine
Hi, Davide’s proposal (let users specify maximum number of entries per facet) is basically a generalisation of my proposal to return a facet if there is more than 1 entry in the facet. I think we can try either, but we might want to test the performance on cases with large result sets where only few results are readable by the user. AFAIR Amit and Davide have been working on a “micro scalability test framework” (measuring how queries scale with content). We could maybe add these tests there. On Ard’s suggestion “possibly incorrect, fast counts”: I think this is only feasible if “incorrect” is guaranteed to always be lower than the exact amount. Otherwise facets would lead to information leakage as users could find information about nodes they otherwise cannot read. Cheers Michael On 10 Dec 2014, at 11:12, Tommaso Teofili tommaso.teof...@gmail.com wrote: 2014-12-10 10:17 GMT+01:00 Ard Schrijvers a.schrijv...@onehippo.com: On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org wrote: On 09/12/2014 17:10, Michael Marth wrote: ... The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet). As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video. Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation? We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If we're done within it, then we can output the actual number. In case after 1000 nodes checked we still have some left we can leave the number either empty or with something like many, +, or any other fancy way if we want. In the end is the same approach taken by Amazon (as Tommaso already pointed) or for example google. If you run a search, their facets (Searches related to...) are never with results. I don't think Amazon and Google have customers that can demand them to show correct facet counts...our customers typically do :). I see, however something along the lines of what Davide was proposing doesn't sound too bad to me even for such use cases (but I may be wrong). My take on on this would be to have a configurable option between 1) exact and possibly slow counts 2) unauthorized, possibly incorrect, fast counts Obviously, the second just uses the faceted navigation counts from the backing search implementation (with node by node access manager check), whether it is the internal lucene index, solr or Elastic Search. If you opt for the second option, then, depending on your authorization model you can get fast exact authorized counts as well : When the authorization model can be translated into a search query / filter that is AND-ed with every normal search. For ES this is briefly written at [1]. Most likely the filter is internally cached so even for very large authorization queries (like we have at Hippo because of fine grained ACL model) it will just perform. Obviously it depends quite heavily on your authorization model whether it can be translated to a query. If it relies on an external authorization check or has many hierarchical constraints, it will be very hard. If you choose to have it based on, say, nodetype, nodename, node properties and jcr:path (fake pseudo property) it can be easily translated to a query. Note that for the jcr:path hierarchical ACL (eg read everything below /foo) it is not possible to write a lucene query easily unless you index path information as wellthis results in that moves of large subtree's are slow because the entire subtree needs to be re-indexed. A different authorization model might be based on groups, where every node also gets the groups (the token of the group) indexed that can read that node. Although I never looked much into the code, I suspect [2] does something like this. that's what I had in mind in my proposal #4, the hurdles there relate to the fact that each index implementation aiming at providing facets would have to implement such an index and search with ACLs which is not trivial. One possibly good thing is that this is for sure not a new issue, as you pointed out Apache ManifoldCF has something like that for Solr (and I think for ES too). One the other hand this would differ quite a bit from the approach taken so far (indexes see just node and properties, the QueryEngine post filters results on ACLs, node types, etc.), so that'd be a significant change. So, instead of second guessing which might be acceptable (slow queries, wrong counts, etc) for which customers/users I'd try to
Re: [DISCUSS] supporting faceting in Oak query engine
Thanks, Michael. FWIW, with the use cases I have in mind, getting back a count that is less than the actual number (and some indication that there is an unknown amount more) would be perfectly fine if it makes us go from potentially unacceptable performance to acceptable performance. Laurie On 12/12/14 12:41 AM, Michael Marth mma...@adobe.com wrote: Hi, Davide¹s proposal (let users specify maximum number of entries per facet) is basically a generalisation of my proposal to return a facet if there is more than 1 entry in the facet. I think we can try either, but we might want to test the performance on cases with large result sets where only few results are readable by the user. AFAIR Amit and Davide have been working on a ³micro scalability test framework² (measuring how queries scale with content). We could maybe add these tests there. On Ard¹s suggestion ³possibly incorrect, fast counts²: I think this is only feasible if ³incorrect² is guaranteed to always be lower than the exact amount. Otherwise facets would lead to information leakage as users could find information about nodes they otherwise cannot read. Cheers Michael On 10 Dec 2014, at 11:12, Tommaso Teofili tommaso.teof...@gmail.com wrote: 2014-12-10 10:17 GMT+01:00 Ard Schrijvers a.schrijv...@onehippo.com: On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org wrote: On 09/12/2014 17:10, Michael Marth wrote: ... The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet). As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video. Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation? We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If we're done within it, then we can output the actual number. In case after 1000 nodes checked we still have some left we can leave the number either empty or with something like many, +, or any other fancy way if we want. In the end is the same approach taken by Amazon (as Tommaso already pointed) or for example google. If you run a search, their facets (Searches related to...) are never with results. I don't think Amazon and Google have customers that can demand them to show correct facet counts...our customers typically do :). I see, however something along the lines of what Davide was proposing doesn't sound too bad to me even for such use cases (but I may be wrong). My take on on this would be to have a configurable option between 1) exact and possibly slow counts 2) unauthorized, possibly incorrect, fast counts Obviously, the second just uses the faceted navigation counts from the backing search implementation (with node by node access manager check), whether it is the internal lucene index, solr or Elastic Search. If you opt for the second option, then, depending on your authorization model you can get fast exact authorized counts as well : When the authorization model can be translated into a search query / filter that is AND-ed with every normal search. For ES this is briefly written at [1]. Most likely the filter is internally cached so even for very large authorization queries (like we have at Hippo because of fine grained ACL model) it will just perform. Obviously it depends quite heavily on your authorization model whether it can be translated to a query. If it relies on an external authorization check or has many hierarchical constraints, it will be very hard. If you choose to have it based on, say, nodetype, nodename, node properties and jcr:path (fake pseudo property) it can be easily translated to a query. Note that for the jcr:path hierarchical ACL (eg read everything below /foo) it is not possible to write a lucene query easily unless you index path information as wellthis results in that moves of large subtree's are slow because the entire subtree needs to be re-indexed. A different authorization model might be based on groups, where every node also gets the groups (the token of the group) indexed that can read that node. Although I never looked much into the code, I suspect [2] does something like this. that's what I had in mind in my proposal #4, the hurdles there relate to the fact that each index implementation aiming at providing facets would have to implement such an index and search with ACLs which is not trivial. One possibly good thing is that this is for sure not a new issue, as you pointed out Apache ManifoldCF has something like that for Solr (and I think for ES too). One the other
Re: [DISCUSS] supporting faceting in Oak query engine
On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org wrote: On 09/12/2014 17:10, Michael Marth wrote: ... The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet). As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video. Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation? We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If we're done within it, then we can output the actual number. In case after 1000 nodes checked we still have some left we can leave the number either empty or with something like many, +, or any other fancy way if we want. In the end is the same approach taken by Amazon (as Tommaso already pointed) or for example google. If you run a search, their facets (Searches related to...) are never with results. I don't think Amazon and Google have customers that can demand them to show correct facet counts...our customers typically do :). My take on on this would be to have a configurable option between 1) exact and possibly slow counts 2) unauthorized, possibly incorrect, fast counts Obviously, the second just uses the faceted navigation counts from the backing search implementation (with node by node access manager check), whether it is the internal lucene index, solr or Elastic Search. If you opt for the second option, then, depending on your authorization model you can get fast exact authorized counts as well : When the authorization model can be translated into a search query / filter that is AND-ed with every normal search. For ES this is briefly written at [1]. Most likely the filter is internally cached so even for very large authorization queries (like we have at Hippo because of fine grained ACL model) it will just perform. Obviously it depends quite heavily on your authorization model whether it can be translated to a query. If it relies on an external authorization check or has many hierarchical constraints, it will be very hard. If you choose to have it based on, say, nodetype, nodename, node properties and jcr:path (fake pseudo property) it can be easily translated to a query. Note that for the jcr:path hierarchical ACL (eg read everything below /foo) it is not possible to write a lucene query easily unless you index path information as wellthis results in that moves of large subtree's are slow because the entire subtree needs to be re-indexed. A different authorization model might be based on groups, where every node also gets the groups (the token of the group) indexed that can read that node. Although I never looked much into the code, I suspect [2] does something like this. So, instead of second guessing which might be acceptable (slow queries, wrong counts, etc) for which customers/users I'd try to keep the options open, have a default of correct (slow) counts, and make it easy to flip to 'counts from the indexes without accessmanager authorization', where depending on the authorization model, the latter can return correct results. For those who are interested, I will be listening to [3] this afternoon (5 pm GMT). Regards Ard [1] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered [2] http://manifoldcf.apache.org/en_US/index.html [3] http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/ D. -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
Re: [DISCUSS] supporting faceting in Oak query engine
2014-12-10 10:17 GMT+01:00 Ard Schrijvers a.schrijv...@onehippo.com: On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org wrote: On 09/12/2014 17:10, Michael Marth wrote: ... The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet). As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video. Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation? We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If we're done within it, then we can output the actual number. In case after 1000 nodes checked we still have some left we can leave the number either empty or with something like many, +, or any other fancy way if we want. In the end is the same approach taken by Amazon (as Tommaso already pointed) or for example google. If you run a search, their facets (Searches related to...) are never with results. I don't think Amazon and Google have customers that can demand them to show correct facet counts...our customers typically do :). I see, however something along the lines of what Davide was proposing doesn't sound too bad to me even for such use cases (but I may be wrong). My take on on this would be to have a configurable option between 1) exact and possibly slow counts 2) unauthorized, possibly incorrect, fast counts Obviously, the second just uses the faceted navigation counts from the backing search implementation (with node by node access manager check), whether it is the internal lucene index, solr or Elastic Search. If you opt for the second option, then, depending on your authorization model you can get fast exact authorized counts as well : When the authorization model can be translated into a search query / filter that is AND-ed with every normal search. For ES this is briefly written at [1]. Most likely the filter is internally cached so even for very large authorization queries (like we have at Hippo because of fine grained ACL model) it will just perform. Obviously it depends quite heavily on your authorization model whether it can be translated to a query. If it relies on an external authorization check or has many hierarchical constraints, it will be very hard. If you choose to have it based on, say, nodetype, nodename, node properties and jcr:path (fake pseudo property) it can be easily translated to a query. Note that for the jcr:path hierarchical ACL (eg read everything below /foo) it is not possible to write a lucene query easily unless you index path information as wellthis results in that moves of large subtree's are slow because the entire subtree needs to be re-indexed. A different authorization model might be based on groups, where every node also gets the groups (the token of the group) indexed that can read that node. Although I never looked much into the code, I suspect [2] does something like this. that's what I had in mind in my proposal #4, the hurdles there relate to the fact that each index implementation aiming at providing facets would have to implement such an index and search with ACLs which is not trivial. One possibly good thing is that this is for sure not a new issue, as you pointed out Apache ManifoldCF has something like that for Solr (and I think for ES too). One the other hand this would differ quite a bit from the approach taken so far (indexes see just node and properties, the QueryEngine post filters results on ACLs, node types, etc.), so that'd be a significant change. So, instead of second guessing which might be acceptable (slow queries, wrong counts, etc) for which customers/users I'd try to keep the options open, have a default of correct (slow) counts, and make it easy to flip to 'counts from the indexes without accessmanager authorization', where depending on the authorization model, the latter can return correct results. I think the best way of addressing this is by try prototyping (some of) the mentioned options and see where we get, I'll see what I can do there. For those who are interested, I will be listening to [3] this afternoon (5 pm GMT). cool, thanks for the pointer! Regards, Tommaso Regards Ard [1] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered [2] http://manifoldcf.apache.org/en_US/index.html [3] http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/ D. -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA
Re: [DISCUSS] supporting faceting in Oak query engine
On Wed, Dec 10, 2014 at 10:17 AM, Ard Schrijvers a.schrijv...@onehippo.com wrote: On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org wrote: On 09/12/2014 17:10, Michael Marth wrote: ... The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet). As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video. Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation? We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If we're done within it, then we can output the actual number. In case after 1000 nodes checked we still have some left we can leave the number either empty or with something like many, +, or any other fancy way if we want. In the end is the same approach taken by Amazon (as Tommaso already pointed) or for example google. If you run a search, their facets (Searches related to...) are never with results. I don't think Amazon and Google have customers that can demand them to show correct facet counts...our customers typically do :). My take on on this would be to have a configurable option between 1) exact and possibly slow counts 2) unauthorized, possibly incorrect, fast counts Obviously, the second just uses the faceted navigation counts from the backing search implementation (with node by node access manager Here of course I meant to write: '**without** node by node access manager check' check), whether it is the internal lucene index, solr or Elastic Search. If you opt for the second option, then, depending on your authorization model you can get fast exact authorized counts as well : When the authorization model can be translated into a search query / filter that is AND-ed with every normal search. For ES this is briefly written at [1]. Most likely the filter is internally cached so even for very large authorization queries (like we have at Hippo because of fine grained ACL model) it will just perform. Obviously it depends quite heavily on your authorization model whether it can be translated to a query. If it relies on an external authorization check or has many hierarchical constraints, it will be very hard. If you choose to have it based on, say, nodetype, nodename, node properties and jcr:path (fake pseudo property) it can be easily translated to a query. Note that for the jcr:path hierarchical ACL (eg read everything below /foo) it is not possible to write a lucene query easily unless you index path information as wellthis results in that moves of large subtree's are slow because the entire subtree needs to be re-indexed. A different authorization model might be based on groups, where every node also gets the groups (the token of the group) indexed that can read that node. Although I never looked much into the code, I suspect [2] does something like this. So, instead of second guessing which might be acceptable (slow queries, wrong counts, etc) for which customers/users I'd try to keep the options open, have a default of correct (slow) counts, and make it easy to flip to 'counts from the indexes without accessmanager authorization', where depending on the authorization model, the latter can return correct results. For those who are interested, I will be listening to [3] this afternoon (5 pm GMT). Regards Ard [1] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered [2] http://manifoldcf.apache.org/en_US/index.html [3] http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/ D. -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
Re: [DISCUSS] supporting faceting in Oak query engine
2014-12-08 8:15 GMT+01:00 Thomas Mueller muel...@adobe.com: Hi, I think we should do: 1. conservative approach, do not touch JCR API select [jcr:path], [facet(jcr:primaryType)] from [nt:base] where contains([text, 'oak']); The column facet(jcr:primaryType) would return the facet data. I think that's a good approach. The question is, which rows would return that data. I would prefer a solution where _each_ row returns the data (and not just the first row), because that's a bit easier to use, easier to document, and more closely matches the relational model. If just the first row returns the facet data, then we can't sort the result afterwards (otherwise the facet data ends up in another row, which would be weird). sure, I see this point, while it didn't in the first impl me and Thomas discussed offline, the current PoC does exactly that (can return the facets via row.getColumnValue(facet(jcr:primaryType)) for each row). Another approach is to extend the API (create a new interface, for example OakQuery). The JDBC API (but not the JCR API) has a concept of multiple result sets per query (Statement.getMoreResults). We could build a solution that more closely matches this model. But I don't think it's worth the trouble right now (we could still do that later on if really needed). I think that for an end user to leverage facets easily what you propose would really make things nicer, of course there's no hurry in defining that, at least until we have a satisfactory facets implementation. About security, I wonder what are the common configurations. I think we should avoid a complex (but slow, and hard to implement) solution that can solve 100% of all possible _theoretical_ cases, but instead go for a (faster, simpler) solution that covers 99% of all _pratical_ cases. if I think to the simplest usecases I see: - a publicly available website where users can search without logging in - a website where logged in users can search on some content both would require the results and facets to be filtered on the content a logged in user or an anonymous user can see. Perhaps we may also have a use case where the website expose content crawled from the Web (e.g. Google) where there's no filtering on content, maybe just a personalized ranking (but that's a different story that doesn't belong here). @Micheal, Laurie: for filtering out the counts, as I said I'd prefer not to do that because it's an interesting piece of information we would loose, what we may do is making that inclusion/exclusion configurable either in the query index definition node or at runtime somehow within the query depending on the client needs. @Laurie for the option #5 that would mean we would have query indexes which can index and query only data a configured user can see, e.g. we have an 'anonymous-lucene' index being a Lucene index that will only be able to index nodes the user anonymous can see (has jcr:read privilege on), and that will be used only for queries issued by the user anonymous, however as I said I am not sure that's a good idea, because that may not scale (if you want to define 100 users, you would have 100 Lucene indexes dedicated to 100 different users). Regards, Tommaso Regards, Thomas On 05/12/14 12:13, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi all, I am resurrecting this thread as I've managed to find some time to start having a look at how to support faceting in Oak query engine. One important thing is that I agree with Ard (and I've seen it like that from the beginning) that since we have Lucene and Solr Oak index implementations we should rely on them for such advanced features [1][2] instead of reinventing the wheel. Within the above assumption the implementation seems quite straightforward. The not so obvious bits comes when getting to: - exposing facets within the JCR API - correctly filtering facets depending on authorization / privileges For the former here are a quick list of options that came to my mind (originated also when talking to people f2f about this): 1. conservative approach, do not touch JCR API: facets are retrieved as custom columns (String values) of a Row (from QueryResult.getRows()), e.g. row.getValue(facet(jcr:primaryType))). 2. Oak-only approach, do not touch JCR API but provide utilities which can retrieve structured facets from the result, e.g. IterableFacet facets = OakQueryUtils.extractFacets(QueryResult.getRows()); 3. not JCR compliant approach, we add methods to the API similarly to what Ard and AlexK proposed 4. adapter pattern, similarly to what is done in Apache Sling's adaptTo, where QueryResult can be adapted to different things and therefore it's more extensible (but less controllable). Of course other proposals are welcome on this. For the latter the things seem less simple as I foresee that we want the facets to be consistent with the result nodes and therefore to be filtered
Re: [DISCUSS] supporting faceting in Oak query engine
Hi, I would like the counts. I agree. I guess this feature doesn't make much sense without the counts. 1, 2, and 4 seem like bad ideas 1 undercuts the idea that we'd use lucene/solr to get decent performance. Sorry I don't understand... This is just about the API to retrieve the data. It still uses Lucene/Solr (the same as all other options). I'm not sure if you talk about the performance overhead of converting the facet data to a string and back? This performance overhead is very very small (I assume not measurable). Regards, Thomas
Re: [DISCUSS] supporting faceting in Oak query engine
Hi, I agree that facets *with* counts are better than without counts, but disagree that they are worthless without counts (see the Amazon link Tommaso posted earlier on this thread). There is value in providing the information that *some* results will appear when a user selects a facet . The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet). As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video. Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation? (I should note that I have not tested how long it takes to retrieve and ACL-check 1 million nodes - maybe my concern is invalid) Best regards Michael On 09 Dec 2014, at 09:57, Thomas Mueller muel...@adobe.com wrote: Hi, I would like the counts. I agree. I guess this feature doesn't make much sense without the counts. 1, 2, and 4 seem like bad ideas 1 undercuts the idea that we'd use lucene/solr to get decent performance. Sorry I don't understand... This is just about the API to retrieve the data. It still uses Lucene/Solr (the same as all other options). I'm not sure if you talk about the performance overhead of converting the facet data to a string and back? This performance overhead is very very small (I assume not measurable). Regards, Thomas
Re: [DISCUSS] supporting faceting in Oak query engine
On 09 Dec 2014, at 18:10, Michael Marth mma...@adobe.com wrote: Hi, I agree that facets *with* counts are better than without counts, but disagree that they are worthless without counts (see the Amazon link Tommaso posted earlier on this thread). There is value in providing the information that *some* results will appear when a user selects a facet . The use cases problematic case for counting the facets I have in mind are when a query returns millions of results. This is problematic when one wants to retrieve the exact size of the result set (taking ACLs into account, obviously). When facets are to be retrieved this will be an even harder problem (meaning when the exact number is to be calculated per facet). As an illustration consider a digital asset management application that displays mime type as facets. A query could return 1 million images and, say, 10 video. Is there a way we could support such scenarios (while still counting results per facet) and have a performant implementation? (I should note that I have not tested how long it takes to retrieve and ACL-check 1 million nodes - maybe my concern is invalid) yeah such stuff can easily cause severe slow downs. so count optional or count only up to some specified max value is nice but complicates the API. regards, Lukas Kahwe Smith sm...@pooteeweet.org signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [DISCUSS] supporting faceting in Oak query engine
I guess that returning the facets without the counts really weakens the story of facets. Yes, amazon does it for some searches, but usually it does not. For the use case I have in mind, I would like the counts. Options 3 or 6 seem like decent avenues to explore. 1, 2, and 4 seem like bad ideas (1 undercuts the idea that we'd use lucene/solr to get decent performance. 2 drops the counts. 4 feels like something we would regret, because of the complexity). I'll admit it: I didn't understand option 5. Thanks, Laurie On 12/8/14 2:19 AM, Michael Marth mma...@adobe.com wrote: Hi, About security, I wonder what are the common configurations. I think we should avoid a complex (but slow, and hard to implement) solution that can solve 100% of all possible _theoretical_ cases, but instead go for a (faster, simpler) solution that covers 99% of all _pratical_ cases. I am not sure if you are hinting towards one of the proposed approaches with that statement. IMO this simplification suggested by Tommaso makes sense: only if there's at least one item (node) in the (filtered) results which falls under that facet. That would mean that we would not return the counts of the facets, but a facet would be returned if there's at least one item in the results belonging to it Best regards Michael
Re: [DISCUSS] supporting faceting in Oak query engine
Hi, I think we should do: 1. conservative approach, do not touch JCR API select [jcr:path], [facet(jcr:primaryType)] from [nt:base] where contains([text, 'oak']); The column facet(jcr:primaryType) would return the facet data. I think that's a good approach. The question is, which rows would return that data. I would prefer a solution where _each_ row returns the data (and not just the first row), because that's a bit easier to use, easier to document, and more closely matches the relational model. If just the first row returns the facet data, then we can't sort the result afterwards (otherwise the facet data ends up in another row, which would be weird). Another approach is to extend the API (create a new interface, for example OakQuery). The JDBC API (but not the JCR API) has a concept of multiple result sets per query (Statement.getMoreResults). We could build a solution that more closely matches this model. But I don't think it's worth the trouble right now (we could still do that later on if really needed). About security, I wonder what are the common configurations. I think we should avoid a complex (but slow, and hard to implement) solution that can solve 100% of all possible _theoretical_ cases, but instead go for a (faster, simpler) solution that covers 99% of all _pratical_ cases. Regards, Thomas On 05/12/14 12:13, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi all, I am resurrecting this thread as I've managed to find some time to start having a look at how to support faceting in Oak query engine. One important thing is that I agree with Ard (and I've seen it like that from the beginning) that since we have Lucene and Solr Oak index implementations we should rely on them for such advanced features [1][2] instead of reinventing the wheel. Within the above assumption the implementation seems quite straightforward. The not so obvious bits comes when getting to: - exposing facets within the JCR API - correctly filtering facets depending on authorization / privileges For the former here are a quick list of options that came to my mind (originated also when talking to people f2f about this): 1. conservative approach, do not touch JCR API: facets are retrieved as custom columns (String values) of a Row (from QueryResult.getRows()), e.g. row.getValue(facet(jcr:primaryType))). 2. Oak-only approach, do not touch JCR API but provide utilities which can retrieve structured facets from the result, e.g. IterableFacet facets = OakQueryUtils.extractFacets(QueryResult.getRows()); 3. not JCR compliant approach, we add methods to the API similarly to what Ard and AlexK proposed 4. adapter pattern, similarly to what is done in Apache Sling's adaptTo, where QueryResult can be adapted to different things and therefore it's more extensible (but less controllable). Of course other proposals are welcome on this. For the latter the things seem less simple as I foresee that we want the facets to be consistent with the result nodes and therefore to be filtered according to the privileges of the user having issued the query. Here are the options I could think to so far, even though none looks satisfactory to me yet: 1. retrieve facets and then filter them afterwards seems to have an inherent issue because the facets do not include information about the documents (nodes) which generated them, therefore retrieving them unfiltered (as the index doesn't have information about ACLs) as they are , e.g. facet on jcr:primaryType: jcr:primaryType : { nt:unstructured : 100, nt:file : 20, oak:Unstructured : 10 } would require to: iterate over the results and filter counts as you iterate or do N further queries to filter the counts but then it would be useless to have the facets being returned from the index as we'd be retrieving them ourselves to do the ACL checks OR other such dummy methods. 2. retrieve the facets unfiltered from the index and then return them in the filtered results only if there's at least one item (node) in the (filtered) results which falls under that facet. That would mean that we would not return the counts of the facets, but a facet would be returned if there's at least one item in the results belonging to it. While it sounds a bit not too nice (and a pity as we're loosing some information we have along the way) Amazon does exactly that (see Show results for column on the left at [3]) :-) 3. use a slightly different mechanism for returning facets, called result grouping (or field collapsing) in Solr [5], in which results are returned grouped (and counted) by a certain field. The example of point 1 would look like: grouped:{ jcr:primaryType:{ matches: 130, groups:[{ groupValue:nt:unstructured, doclist:{numFound:100,start:0,docs:[ { path:/content/a/b }, ... ] }}, { groupValue:nt:file, doclist:{numFound:20,start:0,docs:[ { path:/content/d/e
Re: [DISCUSS] supporting faceting in Oak query engine
Hi all, I am resurrecting this thread as I've managed to find some time to start having a look at how to support faceting in Oak query engine. One important thing is that I agree with Ard (and I've seen it like that from the beginning) that since we have Lucene and Solr Oak index implementations we should rely on them for such advanced features [1][2] instead of reinventing the wheel. Within the above assumption the implementation seems quite straightforward. The not so obvious bits comes when getting to: - exposing facets within the JCR API - correctly filtering facets depending on authorization / privileges For the former here are a quick list of options that came to my mind (originated also when talking to people f2f about this): 1. conservative approach, do not touch JCR API: facets are retrieved as custom columns (String values) of a Row (from QueryResult.getRows()), e.g. row.getValue(facet(jcr:primaryType))). 2. Oak-only approach, do not touch JCR API but provide utilities which can retrieve structured facets from the result, e.g. IterableFacet facets = OakQueryUtils.extractFacets(QueryResult.getRows()); 3. not JCR compliant approach, we add methods to the API similarly to what Ard and AlexK proposed 4. adapter pattern, similarly to what is done in Apache Sling's adaptTo, where QueryResult can be adapted to different things and therefore it's more extensible (but less controllable). Of course other proposals are welcome on this. For the latter the things seem less simple as I foresee that we want the facets to be consistent with the result nodes and therefore to be filtered according to the privileges of the user having issued the query. Here are the options I could think to so far, even though none looks satisfactory to me yet: 1. retrieve facets and then filter them afterwards seems to have an inherent issue because the facets do not include information about the documents (nodes) which generated them, therefore retrieving them unfiltered (as the index doesn't have information about ACLs) as they are , e.g. facet on jcr:primaryType: jcr:primaryType : { nt:unstructured : 100, nt:file : 20, oak:Unstructured : 10 } would require to: iterate over the results and filter counts as you iterate or do N further queries to filter the counts but then it would be useless to have the facets being returned from the index as we'd be retrieving them ourselves to do the ACL checks OR other such dummy methods. 2. retrieve the facets unfiltered from the index and then return them in the filtered results only if there's at least one item (node) in the (filtered) results which falls under that facet. That would mean that we would not return the counts of the facets, but a facet would be returned if there's at least one item in the results belonging to it. While it sounds a bit not too nice (and a pity as we're loosing some information we have along the way) Amazon does exactly that (see Show results for column on the left at [3]) :-) 3. use a slightly different mechanism for returning facets, called result grouping (or field collapsing) in Solr [5], in which results are returned grouped (and counted) by a certain field. The example of point 1 would look like: grouped:{ jcr:primaryType:{ matches: 130, groups:[{ groupValue:nt:unstructured, doclist:{numFound:100,start:0,docs:[ { path:/content/a/b }, ... ] }}, { groupValue:nt:file, doclist:{numFound:20,start:0,docs:[ { path:/content/d/e }, ... ] }}, ... there the facets would also contain (some or all of) the docs (nodes) belonging to each group and therefore filtering the facets afterwards could be done without having to retrieve the paths of the nodes falling under each facet. 4. move towards the 'covering index' concept [5] Thomas mentioned in [6] and incorporate the ACLs in the index so that no further filtering has to be done once the underlying query index has returned its results. However this comes with a non trivial impact with regards to a) load of the indexing on the repo (each time some ACL changes a bunch of index updates happen) b) complexity in encoding ACLs in the indexed documents c) complexity in encoding the ACL check in the index-specific queries. Still this is probably something we may evaluate regardless of facets in the future as the lazy ACL check approach we have has, IIUTC, the following issue: userA searching for 'jcr:title=foo', the query engine selecting the Lucene property index which returns 100 docs, userA being only able to see 2 of them because of its ACLs, in this case we have wasted (approximately) 80% of the Lucene effort to match and return the documents. However this is most probably overkill for now... 5. another probably crazy idea is user filtered indexes, meaning that the NodeStates passed to such IndexEditors would be filtered according to what
Re: [DISCUSS] supporting faceting in Oak query engine
On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek aklim...@adobe.com wrote: ...you can leverage some kind of caching though. In practice, if you have a public site with content that does not change permanently, the facet values are pretty much stable, and authorization shouldn't cost much Yes, I think it's very rare to require facets to be immediately up to date after content changes, updating them (or the related caches) asynchronously with low priority should be good enough for the large majority of cases. So maybe the facet indexes and caches can be handled differently than primary queries, with more lenient update latency requirements. -Bertrand
Re: [DISCUSS] supporting faceting in Oak query engine
Hey Alex, On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek aklim...@adobe.com wrote: On 29.08.2014, at 03:10, Ard Schrijvers a.schrijv...@onehippo.com wrote: 1) When exposing faceting from Jackrabbit, we wouldn't use virtual layers any more to expose them over pure JCR spec API's. Instead, we would extend the jcr QueryResult to have next to getRows/getNodes/etc also expose for example methods on the QueryResult like public MapString, Integer getFacetValues(final String facet) { return result.getFacetValues(facet); } public QueryResult drilldown(final FacetValue facetValue) { // return current query result drilled down for facet value return ... } We actually have a similar API in our CQ/AEM product: Query = represents a query [1] SearchResult result = query.getResult(); MapString, Facet facets = result.getFacets(); A facet is a list of Buckets [2] - same as FacetValue above, I assume - an abstraction over different values. You could have distinctive values (e.g. red, green, blue), but also ranges (last year, last month etc.). Each bucket has a count, i.e. the number of times it occurs in the current result. Then on Query you have a method Query refine(Bucket bucket) which is the same as the drilldown above. So in the end it looks pretty much the same, and seems to be a good way to represent this as API. Doesn't say much about the implementation yet, though :) It looks very much the same, and I must admit that during typing my mail I didn't put too much attention to things like how to name something (I reckon that #refine is a much nicer name than the drillDown I wrote :-) 2) Authorized countsfor faceting, it doesn't make sense to expose there are 314 results if you can only read 54 of them. Accounting for authorization through access manager can be way too slow. ... 3) If you support faceting through Oak, will that be competitive enough to what Solr and Elasticsearch offer? Customers these days have some expectations on search result quality and faceting capabilities, performance included. ... So, my take would be to invest time in easy integration with solr/elasticsearch and focus in Oak on the parts (hierarchy, authorization, merging, versioning) that aren't covered by already existing frameworks. Perhaps provide an extended JCR API as described in (1) which under the hood can delegate to a solr or es java client. In the end, you'll still end up having the authorized counts issue, but if you make the integration pluggable enough, it might be possible to leverage domain specific solutions to this (solr/es doesn't do anything with authorization either, it is a tough nut to crack) Good points. When facets are used, the worst case (showing facets for all your content) might actually be the very first thing you see, when something like a product search/browse page is shown, before any actual search by the user is done. Optimizing for performance right from the start is a must, I agree. What I can imagine though, is if you can leverage some kind of caching though. In practice, if you have a public site with content that does not change permanently, the facet values are pretty much stable, and authorization shouldn't cost much. Certainly there are many use cases where you can cache a lot, or for example have a public site user that has read access to an entire content tree. It becomes however much more difficult when you want to for example expose faceted structure of documents to an editor in a cms environment, where the editor has read access to only 1% of the documents. If at the same time, her initial query without authorization results in, say, 10 million hits, then you'll have to authorize all of them to get correct counts. The only way we could make this performing with Hippo CMS against jackrabbit was by translating our authorization authorization model directly to lucene queries and keep caching (authorized) bitsets (slightly different in newer lucene versions) in memory for a user, see [1]. The difficulty was that even executing the authorization query (to AND with normal query) became slow because of very large queries, but fortunately due to the jackrabbit 2 index implementation, we could keep a cached bitset per indexreader, see [2]. Unfortunately, this solution can only be done for specific authoriztion models (which can be mapped to lucene queries) and might not be generic enough for oak. Any way, apart from performance / authorization, I doubt whether oak will be able to keep up with what can be leveraged through ES or Solr. Regards Ard [1] http://svn.onehippo.org/repos/hippo/hippo-cms7/repository/trunk/engine/src/main/java/org/hippoecm/repository/query/lucene/AuthorizationQuery.java [2] http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-authorization-combined-with-searches.html [1]
Re: [DISCUSS] supporting faceting in Oak query engine
Hello, On Mon, Aug 25, 2014 at 7:02 PM, Lukas Smith sm...@pooteeweet.org wrote: Aloha, you should definitely talk to the HippoCMS developers. They forked Jackrabbit 2.x to add facetting as virtual nodes. They ran into some performance issues but I am sure they still have value-able feedback on this. Well, performance actually wasn't the biggest hurdle : Exposing and integrating virtual nodes was quite a bit tougher. Indeed I think I might have quite some feedback, but honestly, I am also these days full of doubts what the best approach will be. I'll try to keep it short: 1) When exposing faceting from Jackrabbit, we wouldn't use virtual layers any more to expose them over pure JCR spec API's. Instead, we would extend the jcr QueryResult to have next to getRows/getNodes/etc also expose for example methods on the QueryResult like public MapString, Integer getFacetValues(final String facet) { return result.getFacetValues(facet); } public QueryResult drilldown(final FacetValue facetValue) { // return current query result drilled down for facet value return ... } 2) Authorized countsfor faceting, it doesn't make sense to expose there are 314 results if you can only read 54 of them. Accounting for authorization through access manager can be way too slow. The alternatives are to not show authorized counts, or try to translate the authorization model to a lucene query which is in general not possible unless you restrict your authorization model severely (which results in a domain specific solution unusable for JR) 3) If you support faceting through Oak, will that be competitive enough to what Solr and Elasticsearch offer? Customers these days have some expectations on search result quality and faceting capabilities, performance included. Oak's faceting support will be compared to dedicated search servers and is quite unlikely to be nearly as good and to keep up with what is being build: Aggregations is the new buzz which is very cool super set of faceting. You really don't wanna have to leverage that next from Oak. So, my take would be to invest time in easy integration with solr/elasticsearch and focus in Oak on the parts (hierarchy, authorization, merging, versioning) that aren't covered by already existing frameworks. Perhaps provide an extended JCR API as described in (1) which under the hood can delegate to a solr or es java client. In the end, you'll still end up having the authorized counts issue, but if you make the integration pluggable enough, it might be possible to leverage domain specific solutions to this (solr/es doesn't do anything with authorization either, it is a tough nut to crack) Regards Ard regards, Lukas Kahwe Smith On 25 Aug 2014, at 18:43, Laurie Byrum lby...@adobe.com wrote: Hi Tommaso, I am happy to see this thread! Questions: Do you expect to want to support hierarchical or pivoted facets soonish? If so, does that influence this decision? Do you know how ACLs will come into play with your facet implementation? If so, does that influence this decision? :-) Thanks! Laurie On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi all, since this has been asked every now and then [1] and since I think it's a pretty useful and common feature for search engine nowadays I'd like to discuss introduction of facets [2] for the Oak query engine. Pros: having facets in search results usually helps filtering (drill down) the results before browsing all of them, so the main usage would be for client code. Impact: probably change / addition in both the JCR and Oak APIs to support returning other than just nodes (a NodeIterator and a Cursor respectively). Right now a couple of ideas on how we could do that come to my mind, both based on the approach of having an Oak index for them: 1. a (multivalued) property index for facets, meaning we would store the facets in the repository, so that we would run a query against it to have the facets of an originating query. 2. a dedicated QueryIndex implementation, eventually leveraging Lucene faceting capabilities, which could use the Lucene index we already have, together with a sidecar index [3]. What do you think? Regards, Tommaso [1] : http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3 Aorg.apache.jackrabbit.oak-dev+page:1+state:facets [2] : http://en.wikipedia.org/wiki/Faceted_search [3] : http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file s/userguide.html -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
Re: [DISCUSS] supporting faceting in Oak query engine
On 29.08.2014, at 03:10, Ard Schrijvers a.schrijv...@onehippo.com wrote: 1) When exposing faceting from Jackrabbit, we wouldn't use virtual layers any more to expose them over pure JCR spec API's. Instead, we would extend the jcr QueryResult to have next to getRows/getNodes/etc also expose for example methods on the QueryResult like public MapString, Integer getFacetValues(final String facet) { return result.getFacetValues(facet); } public QueryResult drilldown(final FacetValue facetValue) { // return current query result drilled down for facet value return ... } We actually have a similar API in our CQ/AEM product: Query = represents a query [1] SearchResult result = query.getResult(); MapString, Facet facets = result.getFacets(); A facet is a list of Buckets [2] - same as FacetValue above, I assume - an abstraction over different values. You could have distinctive values (e.g. red, green, blue), but also ranges (last year, last month etc.). Each bucket has a count, i.e. the number of times it occurs in the current result. Then on Query you have a method Query refine(Bucket bucket) which is the same as the drilldown above. So in the end it looks pretty much the same, and seems to be a good way to represent this as API. Doesn't say much about the implementation yet, though :) 2) Authorized countsfor faceting, it doesn't make sense to expose there are 314 results if you can only read 54 of them. Accounting for authorization through access manager can be way too slow. ... 3) If you support faceting through Oak, will that be competitive enough to what Solr and Elasticsearch offer? Customers these days have some expectations on search result quality and faceting capabilities, performance included. ... So, my take would be to invest time in easy integration with solr/elasticsearch and focus in Oak on the parts (hierarchy, authorization, merging, versioning) that aren't covered by already existing frameworks. Perhaps provide an extended JCR API as described in (1) which under the hood can delegate to a solr or es java client. In the end, you'll still end up having the authorized counts issue, but if you make the integration pluggable enough, it might be possible to leverage domain specific solutions to this (solr/es doesn't do anything with authorization either, it is a tough nut to crack) Good points. When facets are used, the worst case (showing facets for all your content) might actually be the very first thing you see, when something like a product search/browse page is shown, before any actual search by the user is done. Optimizing for performance right from the start is a must, I agree. What I can imagine though, is if you can leverage some kind of caching though. In practice, if you have a public site with content that does not change permanently, the facet values are pretty much stable, and authorization shouldn't cost much. [1] http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/Query.html [2] http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/facets/Bucket.html Cheers, Alex
Re: [DISCUSS] supporting faceting in Oak query engine
Hi Laurie, 2014-08-25 18:43 GMT+02:00 Laurie Byrum lby...@adobe.com: Hi Tommaso, I am happy to see this thread! ;-) Questions: Do you expect to want to support hierarchical or pivoted facets soonish? I would say 'why not' if we have a valid use case. If so, does that influence this decision? I think so, especially it would influence the way that may be implemented. Do you know how ACLs will come into play with your facet implementation? not yet, I think that's one of the open points (e.g. Lukas mentioned that HippoCMS did use 'virtual nodes' for them) we should take care of; each 'term' in the facet should be properly checked, but of course doing this kind of check at that fine grain would be costly so we need to come up with a solution which is both correct from the security point of view and performant. If so, does that influence this decision? :-) yes, I think so :) Any suggestions and / or feedback would be highly welcome, especially from potential users of this feature so that we properly tackle your requirements (if any). Thanks and regards, Tommaso Thanks! Laurie On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi all, since this has been asked every now and then [1] and since I think it's a pretty useful and common feature for search engine nowadays I'd like to discuss introduction of facets [2] for the Oak query engine. Pros: having facets in search results usually helps filtering (drill down) the results before browsing all of them, so the main usage would be for client code. Impact: probably change / addition in both the JCR and Oak APIs to support returning other than just nodes (a NodeIterator and a Cursor respectively). Right now a couple of ideas on how we could do that come to my mind, both based on the approach of having an Oak index for them: 1. a (multivalued) property index for facets, meaning we would store the facets in the repository, so that we would run a query against it to have the facets of an originating query. 2. a dedicated QueryIndex implementation, eventually leveraging Lucene faceting capabilities, which could use the Lucene index we already have, together with a sidecar index [3]. What do you think? Regards, Tommaso [1] : http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3 Aorg.apache.jackrabbit.oak-dev+page:1+state:facets [2] : http://en.wikipedia.org/wiki/Faceted_search [3] : http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file s/userguide.html
Re: [DISCUSS] supporting faceting in Oak query engine
2014-08-25 19:02 GMT+02:00 Lukas Smith sm...@pooteeweet.org: Aloha, Aloha! you should definitely talk to the HippoCMS developers. They forked Jackrabbit 2.x to add facetting as virtual nodes. They ran into some performance issues but I am sure they still have value-able feedback on this. Cool, thanks for letting us know, if you or any other (from Hippo) would like to give some more insight on pros and cons of such an approach that'd be very good. Regards, Tommaso regards, Lukas Kahwe Smith On 25 Aug 2014, at 18:43, Laurie Byrum lby...@adobe.com wrote: Hi Tommaso, I am happy to see this thread! Questions: Do you expect to want to support hierarchical or pivoted facets soonish? If so, does that influence this decision? Do you know how ACLs will come into play with your facet implementation? If so, does that influence this decision? :-) Thanks! Laurie On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi all, since this has been asked every now and then [1] and since I think it's a pretty useful and common feature for search engine nowadays I'd like to discuss introduction of facets [2] for the Oak query engine. Pros: having facets in search results usually helps filtering (drill down) the results before browsing all of them, so the main usage would be for client code. Impact: probably change / addition in both the JCR and Oak APIs to support returning other than just nodes (a NodeIterator and a Cursor respectively). Right now a couple of ideas on how we could do that come to my mind, both based on the approach of having an Oak index for them: 1. a (multivalued) property index for facets, meaning we would store the facets in the repository, so that we would run a query against it to have the facets of an originating query. 2. a dedicated QueryIndex implementation, eventually leveraging Lucene faceting capabilities, which could use the Lucene index we already have, together with a sidecar index [3]. What do you think? Regards, Tommaso [1] : http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3 Aorg.apache.jackrabbit.oak-dev+page:1+state:facets [2] : http://en.wikipedia.org/wiki/Faceted_search [3] : http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file s/userguide.html
Re: [DISCUSS] supporting faceting in Oak query engine
This looks useful Tommaso. With OAK-2005 we should be able to support multiple LuceneIndexes and manage them easily. If we can abstract all this out and just expose the facet information as virtual node that would simplify the stuff for end users. Probably we can have a read only NodeStore impl to expose the faceted data bound to a system path. Otherwise we would need to expose the Lucene API and OakDirectory Chetan Mehrotra On Tue, Aug 26, 2014 at 1:28 PM, Tommaso Teofili tommaso.teof...@gmail.com wrote: 2014-08-25 19:02 GMT+02:00 Lukas Smith sm...@pooteeweet.org: Aloha, Aloha! you should definitely talk to the HippoCMS developers. They forked Jackrabbit 2.x to add facetting as virtual nodes. They ran into some performance issues but I am sure they still have value-able feedback on this. Cool, thanks for letting us know, if you or any other (from Hippo) would like to give some more insight on pros and cons of such an approach that'd be very good. Regards, Tommaso regards, Lukas Kahwe Smith On 25 Aug 2014, at 18:43, Laurie Byrum lby...@adobe.com wrote: Hi Tommaso, I am happy to see this thread! Questions: Do you expect to want to support hierarchical or pivoted facets soonish? If so, does that influence this decision? Do you know how ACLs will come into play with your facet implementation? If so, does that influence this decision? :-) Thanks! Laurie On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi all, since this has been asked every now and then [1] and since I think it's a pretty useful and common feature for search engine nowadays I'd like to discuss introduction of facets [2] for the Oak query engine. Pros: having facets in search results usually helps filtering (drill down) the results before browsing all of them, so the main usage would be for client code. Impact: probably change / addition in both the JCR and Oak APIs to support returning other than just nodes (a NodeIterator and a Cursor respectively). Right now a couple of ideas on how we could do that come to my mind, both based on the approach of having an Oak index for them: 1. a (multivalued) property index for facets, meaning we would store the facets in the repository, so that we would run a query against it to have the facets of an originating query. 2. a dedicated QueryIndex implementation, eventually leveraging Lucene faceting capabilities, which could use the Lucene index we already have, together with a sidecar index [3]. What do you think? Regards, Tommaso [1] : http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3 Aorg.apache.jackrabbit.oak-dev+page:1+state:facets [2] : http://en.wikipedia.org/wiki/Faceted_search [3] : http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file s/userguide.html
Re: [DISCUSS] supporting faceting in Oak query engine
Hi Tommaso, I am happy to see this thread! Questions: Do you expect to want to support hierarchical or pivoted facets soonish? If so, does that influence this decision? Do you know how ACLs will come into play with your facet implementation? If so, does that influence this decision? :-) Thanks! Laurie On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi all, since this has been asked every now and then [1] and since I think it's a pretty useful and common feature for search engine nowadays I'd like to discuss introduction of facets [2] for the Oak query engine. Pros: having facets in search results usually helps filtering (drill down) the results before browsing all of them, so the main usage would be for client code. Impact: probably change / addition in both the JCR and Oak APIs to support returning other than just nodes (a NodeIterator and a Cursor respectively). Right now a couple of ideas on how we could do that come to my mind, both based on the approach of having an Oak index for them: 1. a (multivalued) property index for facets, meaning we would store the facets in the repository, so that we would run a query against it to have the facets of an originating query. 2. a dedicated QueryIndex implementation, eventually leveraging Lucene faceting capabilities, which could use the Lucene index we already have, together with a sidecar index [3]. What do you think? Regards, Tommaso [1] : http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3 Aorg.apache.jackrabbit.oak-dev+page:1+state:facets [2] : http://en.wikipedia.org/wiki/Faceted_search [3] : http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file s/userguide.html
Re: [DISCUSS] supporting faceting in Oak query engine
Aloha, you should definitely talk to the HippoCMS developers. They forked Jackrabbit 2.x to add facetting as virtual nodes. They ran into some performance issues but I am sure they still have value-able feedback on this. regards, Lukas Kahwe Smith On 25 Aug 2014, at 18:43, Laurie Byrum lby...@adobe.com wrote: Hi Tommaso, I am happy to see this thread! Questions: Do you expect to want to support hierarchical or pivoted facets soonish? If so, does that influence this decision? Do you know how ACLs will come into play with your facet implementation? If so, does that influence this decision? :-) Thanks! Laurie On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Hi all, since this has been asked every now and then [1] and since I think it's a pretty useful and common feature for search engine nowadays I'd like to discuss introduction of facets [2] for the Oak query engine. Pros: having facets in search results usually helps filtering (drill down) the results before browsing all of them, so the main usage would be for client code. Impact: probably change / addition in both the JCR and Oak APIs to support returning other than just nodes (a NodeIterator and a Cursor respectively). Right now a couple of ideas on how we could do that come to my mind, both based on the approach of having an Oak index for them: 1. a (multivalued) property index for facets, meaning we would store the facets in the repository, so that we would run a query against it to have the facets of an originating query. 2. a dedicated QueryIndex implementation, eventually leveraging Lucene faceting capabilities, which could use the Lucene index we already have, together with a sidecar index [3]. What do you think? Regards, Tommaso [1] : http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3 Aorg.apache.jackrabbit.oak-dev+page:1+state:facets [2] : http://en.wikipedia.org/wiki/Faceted_search [3] : http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file s/userguide.html