subject:"Re\: \[DISCUSS\] supporting faceting in Oak query engine"

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-12 Thread Michael Marth

Hi,

Davide’s proposal (let users specify maximum number of entries per facet) is 
basically a generalisation of my proposal to return a facet if there is more 
than 1 entry in the facet. I think we can try either, but we might want to test 
the performance on cases with large result sets where only few results are 
readable by the user.
AFAIR Amit and Davide have been working on a “micro scalability test framework” 
(measuring how queries scale with content). We could maybe add these tests 
there.

On Ard’s suggestion “possibly incorrect, fast counts”: I think this is only 
feasible if “incorrect” is guaranteed to always be lower than the exact amount. 
Otherwise facets would lead to information leakage as users could find 
information about nodes they otherwise cannot read.

Cheers
Michael


On 10 Dec 2014, at 11:12, Tommaso Teofili tommaso.teof...@gmail.com wrote:

 2014-12-10 10:17 GMT+01:00 Ard Schrijvers a.schrijv...@onehippo.com:
 
 On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org
 wrote:
 On 09/12/2014 17:10, Michael Marth wrote:
 ...
 
 The use cases problematic case for counting the facets I have in mind
 are when a query returns millions of results. This is problematic when one
 wants to retrieve the exact size of the result set (taking ACLs into
 account, obviously). When facets are to be retrieved this will be an even
 harder problem (meaning when the exact number is to be calculated per
 facet).
 As an illustration consider a digital asset management application that
 displays mime type as facets. A query could return 1 million images and,
 say, 10 video.
 
 Is there a way we could support such scenarios (while still counting
 results per facet) and have a performant implementation?
 
 We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
 we're done within it, then we can output the actual number. In case
 after 1000 nodes checked we still have some left we can leave the number
 either empty or with something like many, +, or any other fancy way
 if we want.
 
 In the end is the same approach taken by Amazon (as Tommaso already
 pointed) or for example google. If you run a search, their facets
 (Searches related to...) are never with results.
 
 I don't think Amazon and Google have customers that can demand them to
 show correct facet counts...our customers typically do :).
 
 
 I see, however something along the lines of what Davide was proposing
 doesn't sound too bad to me even for such use cases (but I may be wrong).
 
 
 My take on
 on this would be to have a configurable option between
 
 1) exact and possibly slow counts
 2) unauthorized, possibly incorrect, fast counts
 
 Obviously, the second just uses the faceted navigation counts from the
 backing search implementation (with node by node access manager
 check), whether it is the internal lucene index, solr or Elastic
 Search. If you opt for the second option, then, depending on your
 authorization model you can get fast exact authorized counts as well :
 When the authorization model can be translated into a search query /
 filter that is AND-ed with every normal search. For ES this is briefly
 written at [1]. Most likely the filter is internally cached so even
 for very large authorization queries (like we have at Hippo because of
 fine grained ACL model) it will just perform. Obviously it depends
 quite heavily on your authorization model whether it can be translated
 to a query. If  it relies on an external authorization check or has
 many hierarchical constraints, it will be very hard. If you choose to
 have it based on, say, nodetype, nodename, node properties and
 jcr:path (fake pseudo property) it can be easily translated to a
 query. Note that for the jcr:path hierarchical ACL (eg read everything
 below /foo) it is not possible to write a lucene query easily unless
 you index path information as wellthis results in that moves of
 large subtree's are slow because the entire subtree needs to be
 re-indexed. A different authorization model might be based on groups,
 where every node also gets the groups (the token of the group) indexed
 that can read that node. Although I never looked much into the code, I
 suspect [2] does something like this.
 
 
 that's what I had in mind in my proposal #4, the hurdles there relate to
 the fact that each index implementation aiming at providing facets would
 have to implement such an index and search with ACLs which is not trivial.
 One possibly good thing is that this is for sure not a new issue, as you
 pointed out Apache ManifoldCF has something like that for Solr (and I think
 for ES too). One the other hand this would differ quite a bit from the
 approach taken so far (indexes see just node and properties, the
 QueryEngine post filters results on ACLs, node types, etc.), so that'd be a
 significant change.
 
 
 
 So, instead of second guessing which might be acceptable (slow
 queries, wrong counts, etc) for which customers/users I'd try to

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-12 Thread Laurie Byrum

Thanks, Michael. FWIW, with the use cases I have in mind, getting back a
count that is less than the actual number (and some indication that there
is an unknown amount more) would be perfectly fine if it makes us go from
potentially unacceptable performance to acceptable performance.

Laurie


On 12/12/14 12:41 AM, Michael Marth mma...@adobe.com wrote:

Hi,

Davide¹s proposal (let users specify maximum number of entries per facet)
is basically a generalisation of my proposal to return a facet if there
is more than 1 entry in the facet. I think we can try either, but we
might want to test the performance on cases with large result sets where
only few results are readable by the user.
AFAIR Amit and Davide have been working on a ³micro scalability test
framework² (measuring how queries scale with content). We could maybe add
these tests there.

On Ard¹s suggestion ³possibly incorrect, fast counts²: I think this is
only feasible if ³incorrect² is guaranteed to always be lower than the
exact amount. Otherwise facets would lead to information leakage as users
could find information about nodes they otherwise cannot read.

Cheers
Michael


On 10 Dec 2014, at 11:12, Tommaso Teofili tommaso.teof...@gmail.com
wrote:

 2014-12-10 10:17 GMT+01:00 Ard Schrijvers a.schrijv...@onehippo.com:
 
 On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org
 wrote:
 On 09/12/2014 17:10, Michael Marth wrote:
 ...
 
 The use cases problematic case for counting the facets I have in mind
 are when a query returns millions of results. This is problematic when
one
 wants to retrieve the exact size of the result set (taking ACLs into
 account, obviously). When facets are to be retrieved this will be an
even
 harder problem (meaning when the exact number is to be calculated per
 facet).
 As an illustration consider a digital asset management application
that
 displays mime type as facets. A query could return 1 million images
and,
 say, 10 video.
 
 Is there a way we could support such scenarios (while still counting
 results per facet) and have a performant implementation?
 
 We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes.
If
 we're done within it, then we can output the actual number. In case
 after 1000 nodes checked we still have some left we can leave the
number
 either empty or with something like many, +, or any other fancy
way
 if we want.
 
 In the end is the same approach taken by Amazon (as Tommaso already
 pointed) or for example google. If you run a search, their facets
 (Searches related to...) are never with results.
 
 I don't think Amazon and Google have customers that can demand them to
 show correct facet counts...our customers typically do :).
 
 
 I see, however something along the lines of what Davide was proposing
 doesn't sound too bad to me even for such use cases (but I may be
wrong).
 
 
 My take on
 on this would be to have a configurable option between
 
 1) exact and possibly slow counts
 2) unauthorized, possibly incorrect, fast counts
 
 Obviously, the second just uses the faceted navigation counts from the
 backing search implementation (with node by node access manager
 check), whether it is the internal lucene index, solr or Elastic
 Search. If you opt for the second option, then, depending on your
 authorization model you can get fast exact authorized counts as well :
 When the authorization model can be translated into a search query /
 filter that is AND-ed with every normal search. For ES this is briefly
 written at [1]. Most likely the filter is internally cached so even
 for very large authorization queries (like we have at Hippo because of
 fine grained ACL model) it will just perform. Obviously it depends
 quite heavily on your authorization model whether it can be translated
 to a query. If  it relies on an external authorization check or has
 many hierarchical constraints, it will be very hard. If you choose to
 have it based on, say, nodetype, nodename, node properties and
 jcr:path (fake pseudo property) it can be easily translated to a
 query. Note that for the jcr:path hierarchical ACL (eg read everything
 below /foo) it is not possible to write a lucene query easily unless
 you index path information as wellthis results in that moves of
 large subtree's are slow because the entire subtree needs to be
 re-indexed. A different authorization model might be based on groups,
 where every node also gets the groups (the token of the group) indexed
 that can read that node. Although I never looked much into the code, I
 suspect [2] does something like this.
 
 
 that's what I had in mind in my proposal #4, the hurdles there relate to
 the fact that each index implementation aiming at providing facets would
 have to implement such an index and search with ACLs which is not
trivial.
 One possibly good thing is that this is for sure not a new issue, as you
 pointed out Apache ManifoldCF has something like that for Solr (and I
think
 for ES too). One the other

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-10 Thread Ard Schrijvers

On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org wrote:
On 09/12/2014 17:10, Michael Marth wrote:
...

The use cases problematic case for counting the facets I have in mind are
when a query returns millions of results. This is problematic when one wants
to retrieve the exact size of the result set (taking ACLs into account,
obviously). When facets are to be retrieved this will be an even harder
problem (meaning when the exact number is to be calculated per facet).
As an illustration consider a digital asset management application that
displays mime type as facets. A query could return 1 million images and,
say, 10 video.

Is there a way we could support such scenarios (while still counting results
per facet) and have a performant implementation?

We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
we're done within it, then we can output the actual number. In case
after 1000 nodes checked we still have some left we can leave the number
either empty or with something like many, +, or any other fancy way
if we want.

In the end is the same approach taken by Amazon (as Tommaso already
pointed) or for example google. If you run a search, their facets
(Searches related to...) are never with results.

I don't think Amazon and Google have customers that can demand them to
show correct facet counts...our customers typically do :). My take on
on this would be to have a configurable option between

1) exact and possibly slow counts
2) unauthorized, possibly incorrect, fast counts

Obviously, the second just uses the faceted navigation counts from the
backing search implementation (with node by node access manager
check), whether it is the internal lucene index, solr or Elastic
Search. If you opt for the second option, then, depending on your
authorization model you can get fast exact authorized counts as well :
When the authorization model can be translated into a search query /
filter that is AND-ed with every normal search. For ES this is briefly
written at [1]. Most likely the filter is internally cached so even
for very large authorization queries (like we have at Hippo because of
fine grained ACL model) it will just perform. Obviously it depends
quite heavily on your authorization model whether it can be translated
to a query. If it relies on an external authorization check or has
many hierarchical constraints, it will be very hard. If you choose to
have it based on, say, nodetype, nodename, node properties and
jcr:path (fake pseudo property) it can be easily translated to a
query. Note that for the jcr:path hierarchical ACL (eg read everything
below /foo) it is not possible to write a lucene query easily unless
you index path information as wellthis results in that moves of
large subtree's are slow because the entire subtree needs to be
re-indexed. A different authorization model might be based on groups,
where every node also gets the groups (the token of the group) indexed
that can read that node. Although I never looked much into the code, I
suspect [2] does something like this.

So, instead of second guessing which might be acceptable (slow
queries, wrong counts, etc) for which customers/users I'd try to keep
the options open, have a default of correct (slow) counts, and make it
easy to flip to 'counts from the indexes without accessmanager
authorization', where depending on the authorization model, the latter
can return correct results.

For those who are interested, I will be listening to [3] this
afternoon (5 pm GMT).

Regards Ard

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered
[2] http://manifoldcf.apache.org/en_US/index.html
[3]
http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/

--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-10 Thread Tommaso Teofili

2014-12-10 10:17 GMT+01:00 Ard Schrijvers a.schrijv...@onehippo.com:

On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org
wrote:
On 09/12/2014 17:10, Michael Marth wrote:
...

The use cases problematic case for counting the facets I have in mind
are when a query returns millions of results. This is problematic when one
wants to retrieve the exact size of the result set (taking ACLs into
account, obviously). When facets are to be retrieved this will be an even
harder problem (meaning when the exact number is to be calculated per
facet).
As an illustration consider a digital asset management application that
displays mime type as facets. A query could return 1 million images and,
say, 10 video.

Is there a way we could support such scenarios (while still counting
results per facet) and have a performant implementation?

In the end is the same approach taken by Amazon (as Tommaso already
pointed) or for example google. If you run a search, their facets
(Searches related to...) are never with results.

I don't think Amazon and Google have customers that can demand them to
show correct facet counts...our customers typically do :).

I see, however something along the lines of what Davide was proposing
doesn't sound too bad to me even for such use cases (but I may be wrong).

My take on
on this would be to have a configurable option between

1) exact and possibly slow counts
2) unauthorized, possibly incorrect, fast counts

that's what I had in mind in my proposal #4, the hurdles there relate to
the fact that each index implementation aiming at providing facets would
have to implement such an index and search with ACLs which is not trivial.
One possibly good thing is that this is for sure not a new issue, as you
pointed out Apache ManifoldCF has something like that for Solr (and I think
for ES too). One the other hand this would differ quite a bit from the
approach taken so far (indexes see just node and properties, the
QueryEngine post filters results on ACLs, node types, etc.), so that'd be a
significant change.

I think the best way of addressing this is by try prototyping (some of) the
mentioned options and see where we get, I'll see what I can do there.

For those who are interested, I will be listening to [3] this
afternoon (5 pm GMT).

cool, thanks for the pointer!

Regards,
Tommaso

Regards Ard

--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-10 Thread Ard Schrijvers

On Wed, Dec 10, 2014 at 10:17 AM, Ard Schrijvers
a.schrijv...@onehippo.com wrote:
On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella dav...@apache.org wrote:
On 09/12/2014 17:10, Michael Marth wrote:
...

The use cases problematic case for counting the facets I have in mind are
when a query returns millions of results. This is problematic when one
wants to retrieve the exact size of the result set (taking ACLs into
account, obviously). When facets are to be retrieved this will be an even
harder problem (meaning when the exact number is to be calculated per
facet).
As an illustration consider a digital asset management application that
displays mime type as facets. A query could return 1 million images and,
say, 10 video.

Is there a way we could support such scenarios (while still counting
results per facet) and have a performant implementation?

In the end is the same approach taken by Amazon (as Tommaso already
pointed) or for example google. If you run a search, their facets
(Searches related to...) are never with results.

I don't think Amazon and Google have customers that can demand them to
show correct facet counts...our customers typically do :). My take on
on this would be to have a configurable option between

1) exact and possibly slow counts
2) unauthorized, possibly incorrect, fast counts

Obviously, the second just uses the faceted navigation counts from the
backing search implementation (with node by node access manager

Here of course I meant to write: '**without** node by node access manager check'

check), whether it is the internal lucene index, solr or Elastic
Search. If you opt for the second option, then, depending on your
authorization model you can get fast exact authorized counts as well :
When the authorization model can be translated into a search query /
filter that is AND-ed with every normal search. For ES this is briefly
written at [1]. Most likely the filter is internally cached so even
for very large authorization queries (like we have at Hippo because of
fine grained ACL model) it will just perform. Obviously it depends
quite heavily on your authorization model whether it can be translated
to a query. If it relies on an external authorization check or has
many hierarchical constraints, it will be very hard. If you choose to
have it based on, say, nodetype, nodename, node properties and
jcr:path (fake pseudo property) it can be easily translated to a
query. Note that for the jcr:path hierarchical ACL (eg read everything
below /foo) it is not possible to write a lucene query easily unless
you index path information as wellthis results in that moves of
large subtree's are slow because the entire subtree needs to be
re-indexed. A different authorization model might be based on groups,
where every node also gets the groups (the token of the group) indexed
that can read that node. Although I never looked much into the code, I
suspect [2] does something like this.

For those who are interested, I will be listening to [3] this
afternoon (5 pm GMT).

Regards Ard

--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-09 Thread Tommaso Teofili

2014-12-08 8:15 GMT+01:00 Thomas Mueller muel...@adobe.com:

 Hi,

 I think we should do:


  1. conservative approach, do not touch JCR API


  select [jcr:path], [facet(jcr:primaryType)] from [nt:base]
  where contains([text, 'oak']);

 The column facet(jcr:primaryType) would return the facet data. I think
 that's a good approach. The question is, which rows would return that
 data. I would prefer a solution where _each_ row returns the data (and not
 just the first row), because that's a bit easier to use, easier to
 document, and more closely matches the relational model. If just the first
 row returns the facet data, then we can't sort the result afterwards
 (otherwise the facet data ends up in another row, which would be weird).


sure, I see this point, while it didn't in the first impl me and Thomas
discussed offline, the current PoC does exactly that (can return the facets
via row.getColumnValue(facet(jcr:primaryType)) for each row).



 Another approach is to extend the API (create a new interface, for example
 OakQuery). The JDBC API (but not the JCR API) has a concept of multiple
 result sets per query (Statement.getMoreResults). We could build a
 solution that more closely matches this model. But I don't think it's
 worth the trouble right now (we could still do that later on if really
 needed).


I think that for an end user to leverage facets easily what you propose
would really make things nicer, of course there's no hurry in defining
that, at least until we have a satisfactory facets implementation.



 About security, I wonder what are the common configurations. I think we
 should avoid a complex (but slow, and hard to implement) solution that can
 solve 100% of all possible _theoretical_ cases, but instead go for a
 (faster, simpler) solution that covers 99% of all _pratical_ cases.


if I think to the simplest usecases I see:
- a publicly available website where users can search without logging in
- a website where logged in users can search on some content

both would require the results and facets to be filtered on the content a
logged in user or an anonymous user can see.

Perhaps we may also have a use case where the website expose content
crawled from the Web (e.g. Google) where there's no filtering on content,
maybe just a personalized ranking (but that's a different story that
doesn't belong here).

@Micheal, Laurie: for filtering out the counts, as I said I'd prefer not to
do that because it's an interesting piece of information we would loose,
what we may do is making that inclusion/exclusion configurable either in
the query index definition node or at runtime somehow within the query
depending on the client needs.

@Laurie for the option #5 that would mean we would have query indexes which
can index and query only data a configured user can see, e.g. we have an
'anonymous-lucene' index being a Lucene index that will only be able to
index nodes the user anonymous can see (has jcr:read privilege on), and
that will be used only for queries issued by the user anonymous, however
as I said I am not sure that's a good idea, because that may not scale (if
you want to define 100 users, you would have 100 Lucene indexes dedicated
to 100 different users).

Regards,
Tommaso




 Regards,
 Thomas




 On 05/12/14 12:13, Tommaso Teofili tommaso.teof...@gmail.com wrote:

 Hi all,
 
 I am resurrecting this thread as I've managed to find some time to start
 having a look at how to support faceting in Oak query engine.
 
 One important thing is that I agree with Ard (and I've seen it like that
 from the beginning) that since we have Lucene and Solr Oak index
 implementations we should rely on them for such advanced features [1][2]
 instead of reinventing the wheel.
 
 Within the above assumption the implementation seems quite
 straightforward.
 The not so obvious bits comes when getting to:
 - exposing facets within the JCR API
 - correctly filtering facets depending on authorization / privileges
 
 For the former here are a quick list of options that came to my mind
 (originated also when talking to people f2f about this):
 1. conservative approach, do not touch JCR API: facets are retrieved as
 custom columns (String values) of a Row (from QueryResult.getRows()), e.g.
 row.getValue(facet(jcr:primaryType))).
 2. Oak-only approach, do not touch JCR API but provide utilities which can
 retrieve structured facets from the result, e.g. IterableFacet facets =
 OakQueryUtils.extractFacets(QueryResult.getRows());
 3. not JCR compliant approach, we add methods to the API similarly to what
 Ard and AlexK proposed
 4. adapter pattern, similarly to what is done in Apache Sling's adaptTo,
 where QueryResult can be adapted to different things and therefore it's
 more extensible (but less controllable).
 Of course other proposals are welcome on this.
 
 For the latter the things seem less simple as I foresee that we want the
 facets to be consistent with the result nodes and therefore to be filtered

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-09 Thread Thomas Mueller

Hi,

I would like the counts.

I agree. I guess this feature doesn't make much sense without the counts.

1, 2, and 4 seem like
bad ideas

1 undercuts the idea that we'd use lucene/solr to get decent
performance. 

Sorry I don't understand... This is just about the API to retrieve the
data. It still uses Lucene/Solr (the same as all other options). I'm not
sure if you talk about the performance overhead of converting the facet
data to a string and back? This performance overhead is very very small (I
assume not measurable).

Regards,
Thomas

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-09 Thread Michael Marth

Hi,

I agree that facets *with* counts are better than without counts, but disagree 
that they are worthless without counts (see the Amazon link Tommaso posted 
earlier on this thread). There is value in providing the information that 
*some* results will appear when a user selects a facet .

The use cases problematic case for counting the facets I have in mind are when 
a query returns millions of results. This is problematic when one wants to 
retrieve the exact size of the result set (taking ACLs into account, 
obviously). When facets are to be retrieved this will be an even harder problem 
(meaning when the exact number is to be calculated per facet).
As an illustration consider a digital asset management application that 
displays mime type as facets. A query could return 1 million images and, say, 
10 video.

Is there a way we could support such scenarios (while still counting results 
per facet) and have a performant implementation?

(I should note that I have not tested how long it takes to retrieve and 
ACL-check 1 million nodes - maybe my concern is invalid)

Best regards
Michael


On 09 Dec 2014, at 09:57, Thomas Mueller muel...@adobe.com wrote:

 Hi,
 
 I would like the counts.
 
 I agree. I guess this feature doesn't make much sense without the counts.
 
 1, 2, and 4 seem like
 bad ideas
 
 1 undercuts the idea that we'd use lucene/solr to get decent
 performance. 
 
 Sorry I don't understand... This is just about the API to retrieve the
 data. It still uses Lucene/Solr (the same as all other options). I'm not
 sure if you talk about the performance overhead of converting the facet
 data to a string and back? This performance overhead is very very small (I
 assume not measurable).
 
 Regards,
 Thomas

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-09 Thread Lukas Kahwe Smith


 On 09 Dec 2014, at 18:10, Michael Marth mma...@adobe.com wrote:
 
 Hi,
 
 I agree that facets *with* counts are better than without counts, but 
 disagree that they are worthless without counts (see the Amazon link Tommaso 
 posted earlier on this thread). There is value in providing the information 
 that *some* results will appear when a user selects a facet .
 
 The use cases problematic case for counting the facets I have in mind are 
 when a query returns millions of results. This is problematic when one wants 
 to retrieve the exact size of the result set (taking ACLs into account, 
 obviously). When facets are to be retrieved this will be an even harder 
 problem (meaning when the exact number is to be calculated per facet).
 As an illustration consider a digital asset management application that 
 displays mime type as facets. A query could return 1 million images and, say, 
 10 video.
 
 Is there a way we could support such scenarios (while still counting results 
 per facet) and have a performant implementation?
 
 (I should note that I have not tested how long it takes to retrieve and 
 ACL-check 1 million nodes - maybe my concern is invalid)

yeah such stuff can easily cause severe slow downs. so count optional or count 
only up to some specified max value is nice but complicates the API.

regards,
Lukas Kahwe Smith
sm...@pooteeweet.org





signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-08 Thread Laurie Byrum

I guess that returning the facets without the counts really weakens the
story of facets. Yes, amazon does it for some searches, but usually it
does not. For the use case I have in mind, I would like the counts.

Options 3 or 6 seem like decent avenues to explore. 1, 2, and 4 seem like
bad ideas (1 undercuts the idea that we'd use lucene/solr to get decent
performance. 2 drops the counts. 4 feels like something we would regret,
because of the complexity). I'll admit it: I didn't understand option 5.

Thanks,
Laurie


On 12/8/14 2:19 AM, Michael Marth mma...@adobe.com wrote:

Hi,

About security, I wonder what are the common configurations. I think we
should avoid a complex (but slow, and hard to implement) solution that can
solve 100% of all possible _theoretical_ cases, but instead go for a
(faster, simpler) solution that covers 99% of all _pratical_ cases.

I am not sure if you are hinting towards one of the proposed approaches
with that statement. IMO this simplification suggested by Tommaso makes
sense:

only if there's at least one item (node) in the
(filtered) results which falls under that facet. That would mean that we
would not return the counts of the facets, but a facet would be returned
if
there's at least one item in the results belonging to it

Best regards
Michael

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-07 Thread Thomas Mueller

Hi,

I think we should do:


 1. conservative approach, do not touch JCR API


 select [jcr:path], [facet(jcr:primaryType)] from [nt:base]
 where contains([text, 'oak']);

The column facet(jcr:primaryType) would return the facet data. I think
that's a good approach. The question is, which rows would return that
data. I would prefer a solution where _each_ row returns the data (and not
just the first row), because that's a bit easier to use, easier to
document, and more closely matches the relational model. If just the first
row returns the facet data, then we can't sort the result afterwards
(otherwise the facet data ends up in another row, which would be weird).

Another approach is to extend the API (create a new interface, for example
OakQuery). The JDBC API (but not the JCR API) has a concept of multiple
result sets per query (Statement.getMoreResults). We could build a
solution that more closely matches this model. But I don't think it's
worth the trouble right now (we could still do that later on if really
needed).

About security, I wonder what are the common configurations. I think we
should avoid a complex (but slow, and hard to implement) solution that can
solve 100% of all possible _theoretical_ cases, but instead go for a
(faster, simpler) solution that covers 99% of all _pratical_ cases.


Regards,
Thomas




On 05/12/14 12:13, Tommaso Teofili tommaso.teof...@gmail.com wrote:

Hi all,

I am resurrecting this thread as I've managed to find some time to start
having a look at how to support faceting in Oak query engine.

One important thing is that I agree with Ard (and I've seen it like that
from the beginning) that since we have Lucene and Solr Oak index
implementations we should rely on them for such advanced features [1][2]
instead of reinventing the wheel.

Within the above assumption the implementation seems quite
straightforward.
The not so obvious bits comes when getting to:
- exposing facets within the JCR API
- correctly filtering facets depending on authorization / privileges

For the former here are a quick list of options that came to my mind
(originated also when talking to people f2f about this):
1. conservative approach, do not touch JCR API: facets are retrieved as
custom columns (String values) of a Row (from QueryResult.getRows()), e.g.
row.getValue(facet(jcr:primaryType))).
2. Oak-only approach, do not touch JCR API but provide utilities which can
retrieve structured facets from the result, e.g. IterableFacet facets =
OakQueryUtils.extractFacets(QueryResult.getRows());
3. not JCR compliant approach, we add methods to the API similarly to what
Ard and AlexK proposed
4. adapter pattern, similarly to what is done in Apache Sling's adaptTo,
where QueryResult can be adapted to different things and therefore it's
more extensible (but less controllable).
Of course other proposals are welcome on this.

For the latter the things seem less simple as I foresee that we want the
facets to be consistent with the result nodes and therefore to be filtered
according to the privileges of the user having issued the query.
Here are the options I could think to so far, even though none looks
satisfactory to me yet:

1. retrieve facets and then filter them afterwards seems to have an
inherent issue because the facets do not include information about the
documents (nodes) which generated them, therefore retrieving them
unfiltered (as the index doesn't have information about ACLs) as they are
,
e.g. facet on jcr:primaryType:

jcr:primaryType : {
nt:unstructured : 100,
nt:file : 20,
oak:Unstructured : 10
}

would require to: iterate over the results and filter counts as you
iterate
or do N further queries to filter the counts but then it would be useless
to have the facets being returned from the index as we'd be retrieving
them
ourselves to do the ACL checks OR other such dummy methods.

2. retrieve the facets unfiltered from the index and then return them in
the filtered results only if there's at least one item (node) in the
(filtered) results which falls under that facet. That would mean that we
would not return the counts of the facets, but a facet would be returned
if
there's at least one item in the results belonging to it. While it sounds
a
bit not too nice (and a pity as we're loosing some information we have
along the way) Amazon does exactly that (see Show results for column on
the left at [3]) :-)

3. use a slightly different mechanism for returning facets, called result
grouping (or field collapsing) in Solr [5], in which results are returned
grouped (and counted) by a certain field. The example of point 1 would
look
like:

grouped:{
  jcr:primaryType:{
matches: 130,
groups:[{
groupValue:nt:unstructured,
doclist:{numFound:100,start:0,docs:[
{
  path:/content/a/b
}, ...
  ]
}},
  {
groupValue:nt:file,
doclist:{numFound:20,start:0,docs:[
{
  path:/content/d/e

Re: [DISCUSS] supporting faceting in Oak query engine

2014-12-05 Thread Tommaso Teofili

Hi all,

I am resurrecting this thread as I've managed to find some time to start
having a look at how to support faceting in Oak query engine.

One important thing is that I agree with Ard (and I've seen it like that
from the beginning) that since we have Lucene and Solr Oak index
implementations we should rely on them for such advanced features [1][2]
instead of reinventing the wheel.

Within the above assumption the implementation seems quite straightforward.
The not so obvious bits comes when getting to:
- exposing facets within the JCR API
- correctly filtering facets depending on authorization / privileges

For the former here are a quick list of options that came to my mind
(originated also when talking to people f2f about this):
1. conservative approach, do not touch JCR API: facets are retrieved as
custom columns (String values) of a Row (from QueryResult.getRows()), e.g.
row.getValue(facet(jcr:primaryType))).
2. Oak-only approach, do not touch JCR API but provide utilities which can
retrieve structured facets from the result, e.g. IterableFacet facets =
OakQueryUtils.extractFacets(QueryResult.getRows());
3. not JCR compliant approach, we add methods to the API similarly to what
Ard and AlexK proposed
4. adapter pattern, similarly to what is done in Apache Sling's adaptTo,
where QueryResult can be adapted to different things and therefore it's
more extensible (but less controllable).
Of course other proposals are welcome on this.

For the latter the things seem less simple as I foresee that we want the
facets to be consistent with the result nodes and therefore to be filtered
according to the privileges of the user having issued the query.
Here are the options I could think to so far, even though none looks
satisfactory to me yet:

1. retrieve facets and then filter them afterwards seems to have an
inherent issue because the facets do not include information about the
documents (nodes) which generated them, therefore retrieving them
unfiltered (as the index doesn't have information about ACLs) as they are ,
e.g. facet on jcr:primaryType:

jcr:primaryType : {
nt:unstructured : 100,
nt:file : 20,
oak:Unstructured : 10
}

would require to: iterate over the results and filter counts as you iterate
or do N further queries to filter the counts but then it would be useless
to have the facets being returned from the index as we'd be retrieving them
ourselves to do the ACL checks OR other such dummy methods.

2. retrieve the facets unfiltered from the index and then return them in
the filtered results only if there's at least one item (node) in the
(filtered) results which falls under that facet. That would mean that we
would not return the counts of the facets, but a facet would be returned if
there's at least one item in the results belonging to it. While it sounds a
bit not too nice (and a pity as we're loosing some information we have
along the way) Amazon does exactly that (see Show results for column on
the left at [3]) :-)

3. use a slightly different mechanism for returning facets, called result
grouping (or field collapsing) in Solr [5], in which results are returned
grouped (and counted) by a certain field. The example of point 1 would look
like:

grouped:{
  jcr:primaryType:{
matches: 130,
groups:[{
groupValue:nt:unstructured,
doclist:{numFound:100,start:0,docs:[
{
  path:/content/a/b
}, ...
  ]
}},
  {
groupValue:nt:file,
doclist:{numFound:20,start:0,docs:[
{
  path:/content/d/e
}, ...
  ]
}},
...

there the facets would also contain (some or all of) the docs (nodes)
belonging to each group and therefore filtering the facets afterwards could
be done without having to retrieve the paths of the nodes falling under
each facet.

4. move towards the 'covering index' concept [5] Thomas mentioned in [6]
and incorporate the ACLs in the index so that no further filtering has to
be done once the underlying query index has returned its results. However
this comes with a non trivial impact with regards to a) load of the
indexing on the repo (each time some ACL changes a bunch of index updates
happen)  b) complexity in encoding ACLs in the indexed documents c)
complexity in encoding the ACL check in the index-specific queries. Still
this is probably something we may evaluate regardless of facets in the
future as the lazy ACL check approach we have has, IIUTC, the following
issue: userA searching for 'jcr:title=foo', the query engine selecting the
Lucene property index which returns 100 docs, userA being only able to see
2 of them because of its ACLs, in this case we have wasted (approximately)
80% of the Lucene effort to match and return the documents. However this is
most probably overkill for now...

5. another probably crazy idea is user filtered indexes, meaning that the
NodeStates passed to such IndexEditors would be filtered according to what

Re: [DISCUSS] supporting faceting in Oak query engine

2014-09-01 Thread Bertrand Delacretaz

On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
aklim...@adobe.com wrote:
 ...you can leverage some kind of caching though. In practice, if you have a 
 public site
 with content that does not change permanently, the facet values are pretty 
 much
 stable, and authorization shouldn't cost much

Yes, I think it's very rare to require facets to be immediately up to
date after content changes, updating them (or the related caches)
asynchronously with low priority should be good enough for the large
majority of cases.

So maybe the facet indexes and caches can be handled differently than
primary queries, with more lenient update latency requirements.

-Bertrand

Re: [DISCUSS] supporting faceting in Oak query engine

2014-09-01 Thread Ard Schrijvers

Hey Alex,

On Sat, Aug 30, 2014 at 4:25 AM, Alexander Klimetschek
aklim...@adobe.com wrote:
 On 29.08.2014, at 03:10, Ard Schrijvers a.schrijv...@onehippo.com wrote:

 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
 layers any more to expose them over pure JCR spec API's. Instead, we
 would extend the jcr QueryResult to have next to getRows/getNodes/etc
 also expose for example methods on the QueryResult like

 public MapString, Integer getFacetValues(final String facet) {
  return result.getFacetValues(facet);
 }

 public QueryResult drilldown(final FacetValue facetValue) {
// return current query result drilled down for facet value
return ...
 }

 We actually have a similar API in our CQ/AEM product:

 Query = represents a query [1]
 SearchResult result = query.getResult();
 MapString, Facet facets = result.getFacets();

 A facet is a list of Buckets [2] - same as FacetValue above, I assume - an 
 abstraction over different values. You could have distinctive values (e.g. 
 red, green, blue), but also ranges (last year, last month etc.). 
 Each bucket has a count, i.e. the number of times it occurs in the current 
 result.

 Then on Query you have a method

 Query refine(Bucket bucket)

 which is the same as the drilldown above.

 So in the end it looks pretty much the same, and seems to be a good way to 
 represent this as API. Doesn't say much about the implementation yet, though 
 :)

It looks very much the same, and I must admit that during typing my
mail I didn't put too much attention to things like how to name
something (I reckon that #refine is a much nicer name than the
drillDown I wrote :-)


 2) Authorized countsfor faceting, it doesn't make sense to expose
 there are 314 results if you can only read 54 of them. Accounting for
 authorization through access manager can be way too slow.
 ...
 3) If you support faceting through Oak, will that be competitive
 enough to what Solr and Elasticsearch offer? Customers these days have
 some expectations on search result quality and faceting capabilities,
 performance included.
 ...
 So, my take would be to invest time in easy integration with
 solr/elasticsearch and focus in Oak on the parts (hierarchy,
 authorization, merging, versioning) that aren't covered by already
 existing frameworks. Perhaps provide an extended JCR API as described
 in (1) which under the hood can delegate to a solr or es java client.
 In the end, you'll still end up having the authorized counts issue,
 but if you make the integration pluggable enough, it might be possible
 to leverage domain specific solutions to this (solr/es doesn't do
 anything with authorization either, it is a tough nut to crack)

 Good points. When facets are used, the worst case (showing facets for all 
 your content) might actually be the very first thing you see, when something 
 like a product search/browse page is shown, before any actual search by the 
 user is done. Optimizing for performance right from the start is a must, I 
 agree.

 What I can imagine though, is if you can leverage some kind of caching 
 though. In practice, if you have a public site with content that does not 
 change permanently, the facet values are pretty much stable, and 
 authorization shouldn't cost much.

Certainly there are many use cases where you can cache a lot, or for
example have a public site user that has read access to an entire
content tree. It becomes however much more difficult when you want to
for example expose faceted structure of documents to an editor in a
cms environment, where the editor has read access to only 1% of the
documents. If at the same time, her initial query without
authorization results in, say, 10 million hits, then you'll have to
authorize all of them to get correct counts. The only way we could
make this performing with Hippo CMS against jackrabbit was by
translating our authorization authorization model directly to lucene
queries and keep caching (authorized) bitsets (slightly different in
newer lucene versions) in memory for a user, see [1]. The difficulty
was that even executing the authorization query (to AND with normal
query) became slow because of very large queries, but fortunately due
to the jackrabbit 2 index implementation, we could keep a cached
bitset per indexreader, see [2]. Unfortunately, this solution can only
be done for specific authoriztion models (which can be mapped to
lucene queries) and might not be generic enough for oak.

Any way, apart from performance / authorization, I doubt whether oak
will be able to keep up with what can be leveraged through ES or Solr.

Regards Ard

[1] 
http://svn.onehippo.org/repos/hippo/hippo-cms7/repository/trunk/engine/src/main/java/org/hippoecm/repository/query/lucene/AuthorizationQuery.java
[2] 
http://www.onehippo.com/en/resources/blogs/2013/01/cms-7.8-nailed-down-authorization-combined-with-searches.html


 [1]

Re: [DISCUSS] supporting faceting in Oak query engine

2014-08-29 Thread Ard Schrijvers

Hello,

On Mon, Aug 25, 2014 at 7:02 PM, Lukas Smith sm...@pooteeweet.org wrote:
 Aloha,

 you should definitely talk to the HippoCMS developers. They forked Jackrabbit 
 2.x to add facetting as virtual nodes. They ran into some performance issues 
 but I am sure they still have value-able feedback on this.

Well, performance actually wasn't the biggest hurdle : Exposing and
integrating virtual nodes was quite a bit tougher.

Indeed I think I might have quite some feedback, but honestly, I am
also these days full of doubts what the best approach will be. I'll
try to keep it short:

1) When exposing faceting from Jackrabbit, we wouldn't use virtual
layers any more to expose them over pure JCR spec API's. Instead, we
would extend the jcr QueryResult to have next to getRows/getNodes/etc
also expose for example methods on the QueryResult like

public MapString, Integer getFacetValues(final String facet) {
  return result.getFacetValues(facet);
}

public QueryResult drilldown(final FacetValue facetValue) {
// return current query result drilled down for facet value
return ...
}

2) Authorized countsfor faceting, it doesn't make sense to expose
there are 314 results if you can only read 54 of them. Accounting for
authorization through access manager can be way too slow. The
alternatives are to not show authorized counts, or try to translate
the authorization model to a lucene query which is in general not
possible unless you restrict your authorization model severely (which
results in a domain specific solution unusable for JR)

3) If you support faceting through Oak, will that be competitive
enough to what Solr and Elasticsearch offer? Customers these days have
some expectations on search result quality and faceting capabilities,
performance included. Oak's faceting support will be compared to
dedicated search servers and is quite unlikely to be nearly as good
and to keep up with what is being build: Aggregations is the new buzz
which is very cool super set of faceting. You really don't wanna have
to leverage that next from Oak.

So, my take would be to invest time in easy integration with
solr/elasticsearch and focus in Oak on the parts (hierarchy,
authorization, merging, versioning) that aren't covered by already
existing frameworks. Perhaps provide an extended JCR API as described
in (1) which under the hood can delegate to a solr or es java client.
In the end, you'll still end up having the authorized counts issue,
but if you make the integration pluggable enough, it might be possible
to leverage domain specific solutions to this (solr/es doesn't do
anything with authorization either, it is a tough nut to crack)

Regards Ard


 regards,
 Lukas Kahwe Smith

 On 25 Aug 2014, at 18:43, Laurie Byrum lby...@adobe.com wrote:

 Hi Tommaso,
 I am happy to see this thread!

 Questions:
 Do you expect to want to support hierarchical or pivoted facets soonish?
 If so, does that influence this decision?
 Do you know how ACLs will come into play with your facet implementation?
 If so, does that influence this decision? :-)

 Thanks!
 Laurie



 On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote:

 Hi all,

 since this has been asked every now and then [1] and since I think it's a
 pretty useful and common feature for search engine nowadays I'd like to
 discuss introduction of facets [2] for the Oak query engine.

 Pros: having facets in search results usually helps filtering (drill down)
 the results before browsing all of them, so the main usage would be for
 client code.

 Impact: probably change / addition in both the JCR and Oak APIs to support
 returning other than just nodes (a NodeIterator and a Cursor
 respectively).

 Right now a couple of ideas on how we could do that come to my mind, both
 based on the approach of having an Oak index for them:
 1. a (multivalued) property index for facets, meaning we would store the
 facets in the repository, so that we would run a query against it to have
 the facets of an originating query.
 2. a dedicated QueryIndex implementation, eventually leveraging Lucene
 faceting capabilities, which could use the Lucene index we already have,
 together with a sidecar index [3].

 What do you think?
 Regards,
 Tommaso

 [1] :
 http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
 Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
 [2] : http://en.wikipedia.org/wiki/Faceted_search
 [3] :
 http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
 s/userguide.html




-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: [DISCUSS] supporting faceting in Oak query engine

2014-08-29 Thread Alexander Klimetschek

On 29.08.2014, at 03:10, Ard Schrijvers a.schrijv...@onehippo.com wrote:

 1) When exposing faceting from Jackrabbit, we wouldn't use virtual
 layers any more to expose them over pure JCR spec API's. Instead, we
 would extend the jcr QueryResult to have next to getRows/getNodes/etc
 also expose for example methods on the QueryResult like
 
 public MapString, Integer getFacetValues(final String facet) {
  return result.getFacetValues(facet);
 }
 
 public QueryResult drilldown(final FacetValue facetValue) {
// return current query result drilled down for facet value
return ...
 }

We actually have a similar API in our CQ/AEM product:

Query = represents a query [1]
SearchResult result = query.getResult();
MapString, Facet facets = result.getFacets();

A facet is a list of Buckets [2] - same as FacetValue above, I assume - an 
abstraction over different values. You could have distinctive values (e.g. 
red, green, blue), but also ranges (last year, last month etc.). Each 
bucket has a count, i.e. the number of times it occurs in the current result.

Then on Query you have a method

Query refine(Bucket bucket)

which is the same as the drilldown above.

So in the end it looks pretty much the same, and seems to be a good way to 
represent this as API. Doesn't say much about the implementation yet, though :)

 2) Authorized countsfor faceting, it doesn't make sense to expose
 there are 314 results if you can only read 54 of them. Accounting for
 authorization through access manager can be way too slow.
 ...
 3) If you support faceting through Oak, will that be competitive
 enough to what Solr and Elasticsearch offer? Customers these days have
 some expectations on search result quality and faceting capabilities,
 performance included.
 ...
 So, my take would be to invest time in easy integration with
 solr/elasticsearch and focus in Oak on the parts (hierarchy,
 authorization, merging, versioning) that aren't covered by already
 existing frameworks. Perhaps provide an extended JCR API as described
 in (1) which under the hood can delegate to a solr or es java client.
 In the end, you'll still end up having the authorized counts issue,
 but if you make the integration pluggable enough, it might be possible
 to leverage domain specific solutions to this (solr/es doesn't do
 anything with authorization either, it is a tough nut to crack)

Good points. When facets are used, the worst case (showing facets for all your 
content) might actually be the very first thing you see, when something like a 
product search/browse page is shown, before any actual search by the user is 
done. Optimizing for performance right from the start is a must, I agree.

What I can imagine though, is if you can leverage some kind of caching though. 
In practice, if you have a public site with content that does not change 
permanently, the facet values are pretty much stable, and authorization 
shouldn't cost much.

[1] 
http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/Query.html
[2] 
http://docs.adobe.com/docs/en/aem/6-0/develop/ref/javadoc/com/day/cq/search/facets/Bucket.html

Cheers,
Alex

Re: [DISCUSS] supporting faceting in Oak query engine

2014-08-26 Thread Tommaso Teofili

Hi Laurie,

2014-08-25 18:43 GMT+02:00 Laurie Byrum lby...@adobe.com:

 Hi Tommaso,
 I am happy to see this thread!


;-)



 Questions:
 Do you expect to want to support hierarchical or pivoted facets soonish?


I would say 'why not' if we have a valid use case.


 If so, does that influence this decision?


I think so, especially it would influence the way that may be implemented.


 Do you know how ACLs will come into play with your facet implementation?


not yet, I think that's one of the open points (e.g. Lukas mentioned that
HippoCMS did use 'virtual nodes' for them) we should take care of; each
'term' in the facet should be properly checked, but of course doing this
kind of check at that fine grain would be costly so we need to come up with
a solution which is both correct from the security point of view and
performant.


 If so, does that influence this decision? :-)


yes, I think so :)

Any suggestions and / or feedback would be highly welcome, especially from
potential users of this feature so that we properly tackle your
requirements (if any).

Thanks and regards,
Tommaso



 Thanks!
 Laurie



 On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote:

 Hi all,
 
 since this has been asked every now and then [1] and since I think it's a
 pretty useful and common feature for search engine nowadays I'd like to
 discuss introduction of facets [2] for the Oak query engine.
 
 Pros: having facets in search results usually helps filtering (drill down)
 the results before browsing all of them, so the main usage would be for
 client code.
 
 Impact: probably change / addition in both the JCR and Oak APIs to support
 returning other than just nodes (a NodeIterator and a Cursor
 respectively).
 
 Right now a couple of ideas on how we could do that come to my mind, both
 based on the approach of having an Oak index for them:
 1. a (multivalued) property index for facets, meaning we would store the
 facets in the repository, so that we would run a query against it to have
 the facets of an originating query.
 2. a dedicated QueryIndex implementation, eventually leveraging Lucene
 faceting capabilities, which could use the Lucene index we already have,
 together with a sidecar index [3].
 
 What do you think?
 Regards,
 Tommaso
 
 [1] :
 
 http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
 Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
 [2] : http://en.wikipedia.org/wiki/Faceted_search
 [3] :
 
 http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
 s/userguide.html

Re: [DISCUSS] supporting faceting in Oak query engine

2014-08-26 Thread Tommaso Teofili

2014-08-25 19:02 GMT+02:00 Lukas Smith sm...@pooteeweet.org:

 Aloha,


Aloha!



 you should definitely talk to the HippoCMS developers. They forked
 Jackrabbit 2.x to add facetting as virtual nodes. They ran into some
 performance issues but I am sure they still have value-able feedback on
 this.


Cool, thanks for letting us know, if you or any other (from Hippo) would
like to give some more insight on pros and cons of such an approach that'd
be very good.

Regards,
Tommaso



 regards,
 Lukas Kahwe Smith

  On 25 Aug 2014, at 18:43, Laurie Byrum lby...@adobe.com wrote:
 
  Hi Tommaso,
  I am happy to see this thread!
 
  Questions:
  Do you expect to want to support hierarchical or pivoted facets soonish?
  If so, does that influence this decision?
  Do you know how ACLs will come into play with your facet implementation?
  If so, does that influence this decision? :-)
 
  Thanks!
  Laurie
 
 
 
  On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com
 wrote:
 
  Hi all,
 
  since this has been asked every now and then [1] and since I think it's
 a
  pretty useful and common feature for search engine nowadays I'd like to
  discuss introduction of facets [2] for the Oak query engine.
 
  Pros: having facets in search results usually helps filtering (drill
 down)
  the results before browsing all of them, so the main usage would be for
  client code.
 
  Impact: probably change / addition in both the JCR and Oak APIs to
 support
  returning other than just nodes (a NodeIterator and a Cursor
  respectively).
 
  Right now a couple of ideas on how we could do that come to my mind,
 both
  based on the approach of having an Oak index for them:
  1. a (multivalued) property index for facets, meaning we would store the
  facets in the repository, so that we would run a query against it to
 have
  the facets of an originating query.
  2. a dedicated QueryIndex implementation, eventually leveraging Lucene
  faceting capabilities, which could use the Lucene index we already
 have,
  together with a sidecar index [3].
 
  What do you think?
  Regards,
  Tommaso
 
  [1] :
 
 http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
  Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
  [2] : http://en.wikipedia.org/wiki/Faceted_search
  [3] :
 
 http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
  s/userguide.html

Re: [DISCUSS] supporting faceting in Oak query engine

2014-08-26 Thread Chetan Mehrotra

This looks useful Tommaso. With OAK-2005 we should be able to support
multiple LuceneIndexes and manage them easily.

If we can abstract all this out and just expose the facet information
as virtual node that would simplify the stuff for end users. Probably
we can have a read only NodeStore impl to expose the faceted data
bound to a system path. Otherwise we would need to expose the Lucene
API and OakDirectory
Chetan Mehrotra


On Tue, Aug 26, 2014 at 1:28 PM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 2014-08-25 19:02 GMT+02:00 Lukas Smith sm...@pooteeweet.org:

 Aloha,


 Aloha!



 you should definitely talk to the HippoCMS developers. They forked
 Jackrabbit 2.x to add facetting as virtual nodes. They ran into some
 performance issues but I am sure they still have value-able feedback on
 this.


 Cool, thanks for letting us know, if you or any other (from Hippo) would
 like to give some more insight on pros and cons of such an approach that'd
 be very good.

 Regards,
 Tommaso



 regards,
 Lukas Kahwe Smith

  On 25 Aug 2014, at 18:43, Laurie Byrum lby...@adobe.com wrote:
 
  Hi Tommaso,
  I am happy to see this thread!
 
  Questions:
  Do you expect to want to support hierarchical or pivoted facets soonish?
  If so, does that influence this decision?
  Do you know how ACLs will come into play with your facet implementation?
  If so, does that influence this decision? :-)
 
  Thanks!
  Laurie
 
 
 
  On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com
 wrote:
 
  Hi all,
 
  since this has been asked every now and then [1] and since I think it's
 a
  pretty useful and common feature for search engine nowadays I'd like to
  discuss introduction of facets [2] for the Oak query engine.
 
  Pros: having facets in search results usually helps filtering (drill
 down)
  the results before browsing all of them, so the main usage would be for
  client code.
 
  Impact: probably change / addition in both the JCR and Oak APIs to
 support
  returning other than just nodes (a NodeIterator and a Cursor
  respectively).
 
  Right now a couple of ideas on how we could do that come to my mind,
 both
  based on the approach of having an Oak index for them:
  1. a (multivalued) property index for facets, meaning we would store the
  facets in the repository, so that we would run a query against it to
 have
  the facets of an originating query.
  2. a dedicated QueryIndex implementation, eventually leveraging Lucene
  faceting capabilities, which could use the Lucene index we already
 have,
  together with a sidecar index [3].
 
  What do you think?
  Regards,
  Tommaso
 
  [1] :
 
 http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
  Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
  [2] : http://en.wikipedia.org/wiki/Faceted_search
  [3] :
 
 http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
  s/userguide.html

Re: [DISCUSS] supporting faceting in Oak query engine

2014-08-25 Thread Laurie Byrum

Hi Tommaso,
I am happy to see this thread!

Questions: 
Do you expect to want to support hierarchical or pivoted facets soonish?
If so, does that influence this decision?
Do you know how ACLs will come into play with your facet implementation?
If so, does that influence this decision? :-)

Thanks!
Laurie



On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote:

Hi all,

since this has been asked every now and then [1] and since I think it's a
pretty useful and common feature for search engine nowadays I'd like to
discuss introduction of facets [2] for the Oak query engine.

Pros: having facets in search results usually helps filtering (drill down)
the results before browsing all of them, so the main usage would be for
client code.

Impact: probably change / addition in both the JCR and Oak APIs to support
returning other than just nodes (a NodeIterator and a Cursor
respectively).

Right now a couple of ideas on how we could do that come to my mind, both
based on the approach of having an Oak index for them:
1. a (multivalued) property index for facets, meaning we would store the
facets in the repository, so that we would run a query against it to have
the facets of an originating query.
2. a dedicated QueryIndex implementation, eventually leveraging Lucene
faceting capabilities, which could use the Lucene index we already have,
together with a sidecar index [3].

What do you think?
Regards,
Tommaso

[1] :
http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
[2] : http://en.wikipedia.org/wiki/Faceted_search
[3] :
http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
s/userguide.html

Re: [DISCUSS] supporting faceting in Oak query engine

2014-08-25 Thread Lukas Smith

Aloha,

you should definitely talk to the HippoCMS developers. They forked Jackrabbit 
2.x to add facetting as virtual nodes. They ran into some performance issues 
but I am sure they still have value-able feedback on this.

regards,
Lukas Kahwe Smith

 On 25 Aug 2014, at 18:43, Laurie Byrum lby...@adobe.com wrote:
 
 Hi Tommaso,
 I am happy to see this thread!
 
 Questions: 
 Do you expect to want to support hierarchical or pivoted facets soonish?
 If so, does that influence this decision?
 Do you know how ACLs will come into play with your facet implementation?
 If so, does that influence this decision? :-)
 
 Thanks!
 Laurie
 
 
 
 On 8/25/14 7:08 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote:
 
 Hi all,
 
 since this has been asked every now and then [1] and since I think it's a
 pretty useful and common feature for search engine nowadays I'd like to
 discuss introduction of facets [2] for the Oak query engine.
 
 Pros: having facets in search results usually helps filtering (drill down)
 the results before browsing all of them, so the main usage would be for
 client code.
 
 Impact: probably change / addition in both the JCR and Oak APIs to support
 returning other than just nodes (a NodeIterator and a Cursor
 respectively).
 
 Right now a couple of ideas on how we could do that come to my mind, both
 based on the approach of having an Oak index for them:
 1. a (multivalued) property index for facets, meaning we would store the
 facets in the repository, so that we would run a query against it to have
 the facets of an originating query.
 2. a dedicated QueryIndex implementation, eventually leveraging Lucene
 faceting capabilities, which could use the Lucene index we already have,
 together with a sidecar index [3].
 
 What do you think?
 Regards,
 Tommaso
 
 [1] :
 http://markmail.org/search/?q=oak%20faceting#query:oak%20faceting%20list%3
 Aorg.apache.jackrabbit.oak-dev+page:1+state:facets
 [2] : http://en.wikipedia.org/wiki/Faceted_search
 [3] :
 http://lucene.apache.org/core/4_0_0/facet/org/apache/lucene/facet/doc-file
 s/userguide.html

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

Re: [DISCUSS] supporting faceting in Oak query engine

21 matches

Site Navigation

Mail list logo

Footer information