Re: Determine whether a MatchAllQuery or a Query with atleast one Term

2015-11-29 Thread Sandeep Khanzode
Hi Uwe,
Thanks.
Actually, I do use that logic in another part of the code for some other 
functionality :).
However, I was wondering if we have some direct API to check for the presence 
of terms (Terms, NumericRanges, etc.) given an abstract query. My requirement 
is simple: Irrespective of the Query sub-class implementation (which will 
either extend or change in the future), I want to check whether the net effect 
of this query (bool or otherwise) is a MatchAllQuery (i.e. without any terms) 
or a query with at least one term, or numeric range.
The alternative to traverse the bool hierarchy and check instanceOf() on every 
clause for a Query Subclass may be involved, cumbersome, and prone to error. 
Your thoughts? ---Thanks n Regards,
Sandeep Ramesh Khanzode 


On Saturday, November 28, 2015 5:29 PM, Uwe Schindler <u...@thetaphi.de> 
wrote:
 

 Hi,

You can also traverse a BooleanQuery. Just do instanceof BooleanQuery checks 
and if it is a BooleanQuery recursively iterate over all clauses [you can use a 
BooleanQuery in a for-each java loop as it implements Iterable]. For each 
clause recurse and check types again. Then you should be able to detect all 
types of queries in the tree.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Sandeep Khanzode [mailto:sandeep_khanz...@yahoo.com.INVALID]
> Sent: Saturday, November 28, 2015 12:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Determine whether a MatchAllQuery or a Query with atleast
> one Term
> 
> Hi,
> Actually, the MatchAllQuery, for all I know (since it is invoked by the 
> client)
> can be wrapped in a Bool Query type. Hence, it is difficult for me to traverse
> the Bool Query clauses and determine MatchAll, whereas there may be
> other clauses which do contain a TermQuery or a NumericRangeQuery in
> which case a MatchAllQuery check is futile.
> Given any query, Bool Query or MatchAll, or a specific subclass of Query,
> what would be the safe way to determine that this is not a MatchAll query
> without any terms, or whether this is a query that contains at least one term
> or range? ---Thanks n Regards,
> Sandeep Ramesh Khanzode
> 
> 
>    On Saturday, November 28, 2015 12:30 PM, Michael Wilkowski
> <m...@silenteight.com> wrote:
> 
> 
>  Instanceof?
> 
> MW
> Sent from Mi phone
> On 28 Nov 2015 06:57, "Sandeep Khanzode"
> <sandeep_khanz...@yahoo.com.invalid>
> wrote:
> 
> > Hi,
> > I have a question.
> > In my program, I need to check whether the input query is a MatchAll
> Query
> > that contains no terms, or a Query (any variant) that has at least one
> > term. For typical Term queries, this seems reasonable to be done with
> > Query.extractTerms(Set<> terms) which gives the list of terms.
> > However, when there is a NumericRangeQuery, this method throws an
> > UnsupportedOperationException.
> > How can I determine that a NumericRangeQuery or any non-Term query
> exists
> > in the Input Query and differentiate it from the MatchAllQuery? -- SRK
> 
> 
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


  

Re: Determine whether a MatchAllQuery or a Query with atleast one Term

2015-11-28 Thread Sandeep Khanzode
Hi,
Actually, the MatchAllQuery, for all I know (since it is invoked by the client) 
can be wrapped in a Bool Query type. Hence, it is difficult for me to traverse 
the Bool Query clauses and determine MatchAll, whereas there may be other 
clauses which do contain a TermQuery or a NumericRangeQuery in which case a 
MatchAllQuery check is futile.
Given any query, Bool Query or MatchAll, or a specific subclass of Query, what 
would be the safe way to determine that this is not a MatchAll query without 
any terms, or whether this is a query that contains at least one term or range? 
---Thanks n Regards,
Sandeep Ramesh Khanzode 


On Saturday, November 28, 2015 12:30 PM, Michael Wilkowski 
<m...@silenteight.com> wrote:
 

 Instanceof?

MW
Sent from Mi phone
On 28 Nov 2015 06:57, "Sandeep Khanzode" <sandeep_khanz...@yahoo.com.invalid>
wrote:

> Hi,
> I have a question.
> In my program, I need to check whether the input query is a MatchAll Query
> that contains no terms, or a Query (any variant) that has at least one
> term. For typical Term queries, this seems reasonable to be done with
> Query.extractTerms(Set<> terms) which gives the list of terms.
> However, when there is a NumericRangeQuery, this method throws an
> UnsupportedOperationException.
> How can I determine that a NumericRangeQuery or any non-Term query exists
> in the Input Query and differentiate it from the MatchAllQuery? -- SRK


  

Determine whether a MatchAllQuery or a Query with atleast one Term

2015-11-27 Thread Sandeep Khanzode
Hi,
I have a question.
In my program, I need to check whether the input query is a MatchAll Query that 
contains no terms, or a Query (any variant) that has at least one term. For 
typical Term queries, this seems reasonable to be done with 
Query.extractTerms(Set<> terms) which gives the list of terms. 
However, when there is a NumericRangeQuery, this method throws an 
UnsupportedOperationException.
How can I determine that a NumericRangeQuery or any non-Term query exists in 
the Input Query and differentiate it from the MatchAllQuery? -- SRK

Upgrading Lucene Indices and maintaining same resultset

2015-05-27 Thread Sandeep Khanzode
Hi All,
We have a Lucene 3.6-based index set which is quite large and currently in use. 
What will be the upgrade path to (a) 4.x or (b) 5.x? With respect to the data 
migration, etc. What are the steps and is it technically possible? I read that 
3.x to 5.x is not possible, and throws IndexTooStale exceptions. Can we do it 
in two hops, like from 3.x to 4.x and 4.x to 5.x.
If I have a set of documents that have already been indexed with Lucene 3.6 and 
somehow we are able to upgrade to Lucene 4.x (or maybe 5.x), how can we make 
sure that we will get the same set of results? I am not sure, but I will check 
the analyzers and tokenizers used in the 3.6 versions. If we could somehow 
carry over those to 5.x, will we be guaranteed the same set of results? Or are 
there other considerations to get the same set of results? - SRK

BitSet in Filters

2014-08-12 Thread Sandeep Khanzode
Hi,
 
The current usage of BitSets in filters in Lucene is limited to applying only 
on docIDs i.e. I can only construct a filter out of a BitSet if I have the 
DocumentIDs handy.

However, with every update/delete i.e. CRUD modification, these will change, 
and I have to again redo the whole process to fetch the latest docIDs. 

Assume a scenario where I need to tag millions of documents with a tag like 
Finance, IT, Legal, etc.

Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical. If I could map the documents to 
a numeric long identifier and put them in a BitMap, I could then cache them 
because the size reduces drastically. However, I cannot use this numeric long 
identifier in Lucene filters because it is not a docID but another regular 
field.

Please help with this scenario. Thanks,

---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: BitSet in Filters

2014-08-12 Thread Sandeep Khanzode
Hi Erick,

I have mentioned everything that is relevant, I believe :).

However, just to give more background: Assume documents of the order of more 
than 300 million, and multiple concurrent users running search. I may front 
Lucene with ElasticSearch, and ES basically calls Lucene TermFilters. My 
filters are broad in nature, so you can take it that any time I filter on a 
tag, it would run into, easily, millions of documents to be accepted in the 
filter.

The only filter that uses a BitSet works with Document Ids in Lucene. I would 
have wanted this bitset approach to work on some other regular numeric long 
field so that we can scale which does not seem likely if I have to use an 
ArrayList of Longs for TermFilters.

Hope that makes the scenario more clear. Please let me know your thoughts.
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, August 12, 2014 8:41 PM, Erick Erickson erickerick...@gmail.com 
wrote:
 


bq: Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical

Why do you say that? Do you have evidence? Because lots and lots of Solr 
installations do exactly this and they run fine.

So I suspect there's something you're not telling us about your setup. Are you, 
say, soft committing often? Do you have autowarming specified? 

You're not going to be able to keep your filters based on some other field in 
the document. Internally, Lucene uses the internal doc ID as an index into the 
bitset. That's baked in to very low levels and isn't going to change AFAIK.

Best,
Erick



On Mon, Aug 11, 2014 at 11:53 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

Hi,
 
The current usage of BitSets in filters in Lucene is limited to applying only 
on docIDs i.e. I can only construct a filter out of a BitSet if I have the 
DocumentIDs handy.

However, with every update/delete i.e. CRUD modification, these will change, 
and I have to again redo the whole process to fetch the latest docIDs. 

Assume a scenario where I need to tag millions of documents with a tag like 
Finance, IT, Legal, etc.

Unless, I can cache these filters in memory, the cost of constructing this 
filter at run time per query is not practical. If I could map the documents to 
a numeric long identifier and put them in a BitMap, I could then cache them 
because the size reduces drastically. However, I cannot use this numeric long 
identifier in Lucene filters because it is not a docID but another regular 
field.

Please help with this scenario. Thanks,

---
Thanks n Regards,
Sandeep Ramesh Khanzode

Sort, Search Facets

2014-07-08 Thread Sandeep Khanzode
Hi,
 
I am using Lucene 4.7.2 and my primary use case for Lucene is to do three 
things: (a) search, (b) sort by a number of fields for the search results, and 
(c) facet on probably an equal number of fields (probably the most standard use 
cases anyway).

Let us say, I have a corpus of more than a 100m docs with each document having 
approx. 10-15 fields excluding the content (body) which will also be one of the 
fields. Out of 10-15, I have a requirement to have sorting enabled on all 10-15 
and the facets as well. That makes a total of approx. ~45 fields to be indexed 
for various reasons, once for String/Long/TextField, once for 
SortedDocValuesField, and once for FacetField each. 

What will be the impact of this on the indexing operation w.r.t. the time taken 
as well as the extra disk space required? Will it grow linearly with the 
increase in the number of fields?

What is the impact on the memory usage during search time?


I will attempt to benchmark some of these, but if you have any experience with 
this, request you to share the details. Thanks,

---
Thanks n Regards,
Sandeep Ramesh Khanzode

DocIDs from Facet Results

2014-07-07 Thread Sandeep Khanzode
Hi,

For Lucene 4.7.2 Facets, once we invoke FacetCollector and get the topNChildren 
into FacetResult, is there any mechanism that for a particular search result, I 
could get the docIds corresponding to any facet?

Say, I have a facet defined on Field1. Upon Search and FacetCollection, I get 
FVal1, FVal2, and FVal3 as top3Children along with their corresponding counts. 
Can I look into (a) Field1 and get all docIDs, or (b) FVal1 or FVal2 or FVal3 
and get their corresponding docIds?
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

Incremental Field Updates

2014-07-01 Thread Sandeep Khanzode
Hi,

I wanted to know of the best approach to follow if a few fields in my indexed 
documents are changing at run time (after index and before or during search), 
but a majority of them are created at index time.

I could see the JIRA given below but it is scheduled for Lucene 4.9, I believe. 
 

There are a few other approaches, like maintaining a separate index for 
changing fields and use either a parallelreader or use a Join.

Can everyone share their experience for this scenario on how it is handled in 
your systems? Thanks,

[LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA

 
 [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF JIRA
Shai and I would like to start working on the proposal to Incremental Field 
Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex).   
View on issues.apache.org Preview by Yahoo  
 
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: Incremental Field Updates

2014-07-01 Thread Sandeep Khanzode
Hi Shai,

So one follow-up question.

Assume that my use case is to have approx. ~50M documents indexed with each 
document having about ~10-15 indexed but not stored fields. These fields will 
never change, but there are another ~5-6 fields that will change and will 
continue to change after the index is written. These ~5-6 fields may also be 
multivalued. The size of this index turns out to be ~120GB. 

In this case, I would like to sort or facet or search on these ~5-6 fields. 
Which approach do you suggest? Should I use BinaryDocValues and update using IW 
or use either a ParallelReader/Join query.
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, July 1, 2014 9:53 PM, Shai Erera ser...@gmail.com wrote:
 


Except that Lucene now offers efficient numeric and binary DocValues
updates. See IndexWriter.updateNumeric/Binary...

On Jul 1, 2014 5:51 PM, Erick Erickson erickerick...@gmail.com wrote:

 This JIRA is complicated, don't really expect it in 4.9 as it's
 been hanging around for quite a while. Everyone would like this,
 but it's not easy.

 Atomic updates will work, but you have to stored=true for all
 source fields. Under the covers this actually reads the document
 out of the stored fields, deletes the old one and adds it
 over again.

 FWIW,
 Erick

 On Tue, Jul 1, 2014 at 5:32 AM, Sandeep Khanzode
 sandeep_khanz...@yahoo.com.invalid wrote:
  Hi,
 
  I wanted to know of the best approach to follow if a few fields in my
 indexed documents are changing at run time (after index and before or
 during search), but a majority of them are created at index time.
 
  I could see the JIRA given below but it is scheduled for Lucene 4.9, I
 believe.
 
  There are a few other approaches, like maintaining a separate index for
 changing fields and use either a parallelreader or use a Join.
 
  Can everyone share their experience for this scenario on how it is
 handled in your systems? Thanks,
 
  [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
 JIRA
 
 
   [LUCENE-4258] Incremental Field Updates through Stacked Segments - ASF
 JIRA
  Shai and I would like to start working on the proposal to Incremental
 Field Updates outlined here (http://markmail.org/message/zhrdxxpfk6qvdaex
 ).
  View on issues.apache.org Preview by Yahoo
 
 
  ---
  Thanks n Regards,
  Sandeep Ramesh Khanzode

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



IndexDocValues

2014-06-27 Thread Sandeep Khanzode
I came across this type when I checked this blog: 
http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/
 
The blog mentions that the IndexDocValues are created as sorting types indexed 
specifically for the purpose and reduce the overhead created by the FieldCache.

I could not locate this class in the Lucene 4.7.2 hierarchy. Is this replaced 
by somewhat similar SortedDocValuesField?

And are there any benchmarks that show the memory and sorting time using this 
field as opposed to sorting on a regular StringField. 
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

Searching on Large Indexes

2014-06-27 Thread Sandeep Khanzode
Hi,

I have an index that runs into 200-300GB. It is not frequently updated.

What are the best strategies to query on this index?
1.] Should I, at index time, split the content, like a hash based partition, 
into multiple separate smaller indexes and aggregate the results 
programmatically?
2.] Should I replicate this index and provide some sort of document ID, and 
search on each node for a specific range of document IDs?
3.] Is there any way I can split or move individual segments to different nodes 
and aggregate the results?

I am not fully aware of the large scale query strategies. Can you please share 
your findings or experiences? Thanks, 
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

SortedDocValuesField

2014-06-26 Thread Sandeep Khanzode
Hi,
 
I was checking the SortedDocValuesField and its performance in Sort as opposed 
to a normal i.e. StringField and its performance in the same sort. So, I used 
the same string/bytesref value in both fields and in separate JVM processes, I 
launched the two sorts.

I used a RAMDirectory and created a million items. The SortedDocValuesField 
sort took 12/13 seconds and consumed approx 500-550 megs of RAM whereas the 
StringField took 10/11 seconds and consumed 350-400 megs of RAM. 
Is this normal behavior? I was expecting the SDVF to perform better since it is 
indexed for sorting and not stored for any other purpose.

---

Thanks n Regards,
Sandeep Ramesh Khanzode

Re: Custom Sorting

2014-06-25 Thread Sandeep Khanzode
Hi,

Thanks for your reply. 
Actually, I am evaluating both approaches.

With the sort being performed on a field indexed in Lucene itself, my concern 
is with the FieldCache. Very quickly, for multiple clients executing in 
parallel, it bumps up to 8-10GB. This is for 4-5 different Sort fields using an 
index corpus of 50M documents. The problem is not so much the memory 
consumption, as mush as controlling it. If the max heap argument for the JVM is 
scaled back to 2-3GB, then all clients throw an OOM. How should the FieldCache 
scale based on the amount of available max memory to the JVM or can it be 
selectively turned off, or implement a LRU type of algorithm to purge old 
entries?

Secondly, the the DB approach, yes, it will not perform. However, I just wanted 
to know whether such a custom sort function exists that allows one to write 
their own sort on a field that is not indexed by Lucene.

Thanks again,

---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Wednesday, June 25, 2014 1:21 AM, Erick Erickson erickerick...@gmail.com 
wrote:
 


I'm a little confused here. Sure, sorting on a number of fields will
increase memory, the basic idea here is that you need to cache all the
sort values (plus support structures) for performance reasons.

If you create your own custom sort that goes out to a DB and gets the
doc, you have to be prepared for
q=*:*sort=custom_function
Which means you'll have to fetch the value for each and every document
in the index. If this is a DB call, it will NOT perform.

In order to be performant, you'll need to cache the values. Which is
what is being done _for_ you by the FieldCache.

So I think this is really a false path, or an XY problem. Why do you
think you need to do this?

Best,
Erick


On Tue, Jun 24, 2014 at 10:31 AM, Sandeep Khanzode
sandeep_khanz...@yahoo.com.invalid wrote:
 Hi,

 I am trying to implement a sort order for search results in Lucene 4.7.2.

 If I want to use data for ordering that is not stored in Lucene as Fields, is 
 there any way this can be done?
 Basically, I would have certain data that is associated logically to a 
 document but stored elsewhere, like a DB. Can I create a Custom Sort function 
 on the lines of a FieldComparator to sort based on this data by plugging it 
 inside the sort function?

 I have tested the performance of the Sort function for String and numeric 
 types, and as mentioned in some blog, it seems that the numeric type is much 
 faster compared to the string type. However, if I sort on a number of fields 
 from multiple clients, the memory footprint, due to the FieldCache.DEFAULT 
 impl, increases approximately 5-6 times. If I run this on a machine which 
 does not have this capacity, will I get a OOM or will there be intense 
 thrashing for the memory?


 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Custom Sorting

2014-06-24 Thread Sandeep Khanzode
Hi,

I am trying to implement a sort order for search results in Lucene 4.7.2.

If I want to use data for ordering that is not stored in Lucene as Fields, is 
there any way this can be done?
Basically, I would have certain data that is associated logically to a document 
but stored elsewhere, like a DB. Can I create a Custom Sort function on the 
lines of a FieldComparator to sort based on this data by plugging it inside the 
sort function?  

I have tested the performance of the Sort function for String and numeric 
types, and as mentioned in some blog, it seems that the numeric type is much 
faster compared to the string type. However, if I sort on a number of fields 
from multiple clients, the memory footprint, due to the FieldCache.DEFAULT 
impl, increases approximately 5-6 times. If I run this on a machine which does 
not have this capacity, will I get a OOM or will there be intense thrashing for 
the memory?

 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
 commit
there's no stop-the-world activity
 that's going on. Rather, all in-memory
buffers are flushed, the files are fsync'd and a new commit point is
generated. Indexing can continue though as usual. Concurrency might be
affected though, depending on the speed of your IO system, but there's no
intentional stop-the-world.

* 5.] Does the RAMBufferSizeMB() control the commit intervals, so that when
the limit is reached across all writing threads, the contents are flushed
to disk periodically?*

The RAM buffer limit controls the flush intervals. Commit is an explicit
operation that you have to call yourself, as it's rather expensive (fsync
is expensive). Note that since 4.0 Lucene flushes each thread's indexing
state independent from other threads. So when the RAM buffer fills up, on
thread's indexing state is picked and flushed, while other threads can
continue indexing (where before this flush would be a stop-the-world
action, preventing indexing for a while).

Shai




On Mon, Jun 16, 2014 at 4:57 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Correction on [4] below. I do get doc/pos/tim/tip/dvd/dvm files in either
 ase. What I meant was the number of those files appear different in both
 cases. Also, does commit() stop the world and behave serially to flush the
 contents?

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Monday, June 16, 2014 7:10 PM, Sandeep Khanzode
 sandeep_khanz...@yahoo.com.INVALID wrote:



 Hi Shai,

 Thanks for the response. Appreciated! I understand that this particular
 use case has to be handled in a different way.

 Can you please help me with the below questions?

 1.] Is there any API that gives me the count of a specific dimension from

 FacetCollector in response to a search query. Currently, I use the
 getTopChildren() with some value and then check the FacetResult object for
 the actual number of dimensions hit along with their occurrences. Also, the
 getSpecificValue() does not work without a path attribute to the API.

 2.] Can I find the MAX or MIN value of a Numeric type field written to the
 index?

 3.] I am trying to compare and contrast Lucene Facets with Elastic Search.
 I could determine that ES does search time faceting and dynamically returns
 the response without any prior faceting during indexing time. Is index time
 lag is not my concern, can I assume that, in general, performance-wise
 Lucene facets would be faster?

 4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do not
 use IndexWriter.commit(), I get standard files like cfe/cfs/si in the index
 directory. However, if I do use the commit(), then as I understand it, the
 state is persisted to the disk. But this time, there are additional file
 extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this
 difference and its cause.

 5.] Does the RAMBufferSizeMB() control the commit intervals, so that when
 the limit is reached across all writing threads, the contents are flushed
 to disk periodically?

 Appreciate your response to the above queries. Thanks again,



 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode



 On Sunday, June 15, 2014 10:40 AM, Shai Erera ser...@gmail.com wrote:



 Hi

 Currently there's now way to add e.g. terms to already indexed documents,
 you have to re-index them. The only updatable field type Lucene offers
 currently are DocValues fields. If the list of markers/flags is fixed in
 your case, and you can map them to an integer, I think you could use a
 NumericDocValues field, which supports field-level updates.

 Once you
 do that, you can then:

 * Count on this field pretty easily. You will need to write a Facets
 implementation, but otherwise it's very easy.

 * Filter queries: you will need to write a Filter which returns a DocIdSet
 of the documents that belong to one category (e.g. Financially Relevant).
 Here you might want to consider caching the result of the Filter, by using
 CachingWrapperFilter.

 It's not the best approach, updatable Terms would better suit your usecase,
 however we don't offer them yet and it will be a while until we do (and IF
 we do). You should also benchmark that approach vs re-indexing the
 documents since the current implementation of updatable doc-values fields
 isn't optimized for a few document updates between index reopens. See here:
 http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html

 Shai



 On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi Shai,
 
  Thanks so much for the clear explanation.
 
  I agree on the first question. Taxonomy Writer with a separate index
 would
  probably be my approach too.
 
  For the second question:
  I am a little new to the Facets API so I will try to figure out the
  approach that you outlined below.
 
  However, the scenario is such: Assume a document corpus that is indexed.
  For a user query, a document is returned

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
Hi,

Thanks for your response. It does sound pretty bad which is why I am not sure 
whether there is an issue with the code, the index, the searcher, or just the 
machine, as you say. 
I will try with another machine just to make sure and post the results.

Meanwhile, can you tell me if there is anything wrong in the below measurement? 
Or is the API usage or the pattern incorrect?

I used a tool called RAMMap to clean the Windows cache. If I do not, the 
results are very fast as I mentioned already. If I do, then the total time is 
40s. 

Can you please provide any pointers on what could be wrong? I will be checking 
on a Linux box anyway.

=
System.out.println(1. Start Date:  + new Date());
TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
System.out.println(1. End Date:  + new Date());
// Above part takes approx 2-12 seconds depending on the query

System.out.println(2. Start Date:  + new Date());
ListFacetResult results = new ArrayListFacetResult();
Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
System.out.println(2. End Date:  + new Date());
// Above part takes approx 40-53 seconds depending on the query for the first 
time on Windows

System.out.println(3. Start Date:  + new Date());
results.add(facets.getTopChildren(1000, F1));
results.add(facets.getTopChildren(1000, F2));
results.add(facets.getTopChildren(1000, F3));
results.add(facets.getTopChildren(1000, F4));
results.add(facets.getTopChildren(1000, F5));
results.add(facets.getTopChildren(1000, F6));
results.add(facets.getTopChildren(1000, F7));
System.out.println(3. End Date:  + new Date());
// Above part takes approx less than 1 second
= 

---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:
 


Hi

40 seconds for faceted search is ... crazy. Also, note how the times don't
differ much even though the number of hits is much higher (29K vs 15.1M)
... That, w/ that you say that subsequent queries are much faster (few
seconds)
 suggests that something is seriously messed up w/ your
environment. Maybe it's a faulty disk? E.g. after the file system cache is
warm, you no longer hit the disk?

In general, the more hits you have, the more expensive is faceted search.
It's also true for scoring as well (i.e. even without facets). There's just
more work to determine the top results (docs, facets...). With facets, you
can use sampling (see RandomSamplingFacetsCollector), but I would do that
only after you verify that collecting 15M docs is very expensive for you,
even when the file system cache is hot.

I've never
 seen those numbers before, therefore it's difficult for me to
relate to them.

There's a caching mechanism for facets, through CachedOrdinalsReader. But I
wouldn't go there until you verify that your IO system is good (try another
machine, OS, disk ...)., and that the 40s times are truly from the faceting
code.

Shai



On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks again!

 This time, I have indexed data with the following specs. I run into  40
 seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
 as per your measurements? Subsequent runs fare much better probably because
 of the Windows file system cache. How can I speed this up?
 I believe there was a CategoryListCache earlier. Is there any cache or
 other implementation that I can use?

 Secondly, I had a general question. If I extrapolate these numbers for a
 billion documents, my search and facet number may probably be unusable in a
 real time scenario. What are the strategies employed when you deal with
 such large scale? I am new to Lucene so please also direct me to the
 relevant info sources. Thanks!

 Corpus:
 Count: 20M, Size: 51GB

 Index:
 Size (w/o Facets): 19GB, Size
 (w/Facets): 20.12GB
 Creation Time (w/o Facets):
 3.46hrs,
 Creation Time (w/Facets): 3.49hrs

 Search Performance:
                With 29055 hits (5 terms in query):
                Query Execution: 8 seconds
                Facet counts execution: 40-45 seconds

                With 4.22M hits (2 terms in query):
                Query Execution: 3 seconds
                Facet counts execution: 42-46 seconds

                With 15.1M hits (1 term in query):
                Query Execution: 2 seconds
                Facet counts execution: 45-53 seconds

                With 6183 hits (5 different values for the same 5 terms):
  (Without Flushing Windows File Cache on Next
 run)
                Query Execution: 11 seconds
                Facet counts execution:  1
 second

                With 4.9M hits (1 different value for the 1 term): (Without
 Flushing
 Windows File Cache on Next run)
                Query Execution: 2 seconds
                Facet counts execution: 3 seconds

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
If I am counting correctly, the $facets field in the index shows a count of 
approx. 28k. That does not sound like much, I guess. All my facets are flat and 
the FacetsConfig only defines a couple of them to be multi-valued.

Let me know if I am not counting the taxonomy size correctly. The 
taxoReader.getSize() also shows this count.

I will check on a Linux box to make sure. Thanks,
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Tuesday, June 17, 2014 11:28 PM, Shai Erera ser...@gmail.com wrote:
 


Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts
actually computes the counts ... that's the expensive part of faceted
search.

How big is your taxonomy (number categories)?
Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)?
What does your FacetsConfig look like?

Still, well maybe if your taxonomy is huge (hundreds of millions of
categories), I don't think you can intentionally mess up something that
much to end up w/ 40-45s response times!

Shai


On Tue, Jun 17, 2014 at 8:51 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 Thanks for your response. It does sound pretty bad which is why I am not
 sure whether there is an issue with the code, the index, the searcher, or
 just the machine, as you say.
 I will try with another machine just to make sure and post the results.

 Meanwhile, can you tell me if there is anything wrong in the below
 measurement? Or is the API usage or the pattern incorrect?

 I used a tool called RAMMap to clean the Windows cache. If I do not, the
 results are very fast as I mentioned already. If I do, then the total time
 is 40s.

 Can you please provide any pointers on what could be wrong? I will be
 checking on a Linux box anyway.

 =
 System.out.println(1. Start Date:  + new Date());
 TopDocs topDocs = FacetsCollector.search(searcher, query, 100, fc);
 System.out.println(1. End Date:  + new Date());
 // Above part takes approx 2-12 seconds depending on the query

 System.out.println(2. Start Date:  + new Date());
 ListFacetResult results = new ArrayListFacetResult();
 Facets facets = new FastTaxonomyFacetCounts(taxoReader, config, fc);
 System.out.println(2. End Date:  + new Date());
 // Above part takes approx 40-53 seconds depending on the query for the
 first time on Windows

 System.out.println(3. Start Date:  + new Date());
 results.add(facets.getTopChildren(1000, F1));
 results.add(facets.getTopChildren(1000, F2));
 results.add(facets.getTopChildren(1000, F3));
 results.add(facets.getTopChildren(1000, F4));
 results.add(facets.getTopChildren(1000, F5));
 results.add(facets.getTopChildren(1000, F6));
 results.add(facets.getTopChildren(1000, F7));
 System.out.println(3. End Date:  + new Date());
 // Above part takes approx less than 1 second
 =

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Tuesday, June 17, 2014 10:15 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 40 seconds for faceted search is ... crazy. Also, note how the times don't
 differ much even though the number of hits is much higher (29K vs 15.1M)
 ... That, w/ that you say that subsequent queries are much faster (few
 seconds)
  suggests that something is seriously messed up w/ your
 environment. Maybe it's a faulty disk? E.g. after the file system cache is
 warm, you no longer hit the disk?

 In general, the more hits you have, the more expensive is faceted search.
 It's also true for scoring as well (i.e. even without facets). There's just
 more work to determine the top results (docs, facets...). With facets, you
 can use sampling (see RandomSamplingFacetsCollector), but I would do that
 only after you verify that collecting 15M docs is very expensive for you,
 even when the file system cache is hot.

 I've never
  seen those numbers before, therefore it's difficult for me to
 relate to them.

 There's a caching mechanism for facets, through CachedOrdinalsReader. But I
 wouldn't go there until you verify that your IO system is good (try another
 machine, OS, disk ...)., and that the 40s times are truly from the faceting
 code.

 Shai



 On Tue, Jun 17, 2014 at 4:21 PM, Sandeep Khanzode 
 sandeep_khanz...@yahoo.com.invalid wrote:

  Hi,
 
  Thanks again!
 
  This time, I have indexed data with the following specs. I run into  40
  seconds for the FastTaxonomyFacetCounts to create all the facets. Is this
  as per your measurements? Subsequent runs fare much better probably
 because
  of the Windows file system cache. How can I speed this up?
  I believe there was a CategoryListCache earlier. Is there any cache or
  other implementation that I can use?
 
  Secondly, I had a general question. If I extrapolate these numbers for a
  billion documents, my search and facet number may probably be unusable
 in a
  real time scenario. What are the strategies employed when you deal

Re: Facets in Lucene 4.7.2

2014-06-16 Thread Sandeep Khanzode
Hi Shai,

Thanks for the response. Appreciated! I understand that this particular use 
case has to be handled in a different way.

Can you please help me with the below questions? 

1.] Is there any API that gives me the count of a specific dimension from 
FacetCollector in response to a search query. Currently, I use the 
getTopChildren() with some value and then check the FacetResult object for the 
actual number of dimensions hit along with their occurrences. Also, the 
getSpecificValue() does not work without a path attribute to the API.

2.] Can I find the MAX or MIN value of a Numeric type field written to the 
index?

3.] I am trying to compare and contrast Lucene Facets with Elastic Search. I 
could determine that ES does search time faceting and dynamically returns the 
response without any prior faceting during indexing time. Is index time lag is 
not my concern, can I assume that, in general, performance-wise Lucene facets 
would be faster?

4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do not use 
IndexWriter.commit(), I get standard files like cfe/cfs/si in the index 
directory. However, if I do use the commit(), then as I understand it, the 
state is persisted to the disk. But this time, there are additional file 
extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this 
difference and its cause. 

5.] Does the RAMBufferSizeMB() control the commit intervals, so that when the 
limit is reached across all writing threads, the contents are flushed to disk 
periodically?

Appreciate your response to the above queries. Thanks again,

 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Sunday, June 15, 2014 10:40 AM, Shai Erera ser...@gmail.com wrote:
 


Hi

Currently there's now way to add e.g. terms to already indexed documents,
you have to re-index them. The only updatable field type Lucene offers
currently are DocValues fields. If the list of markers/flags is fixed in
your case, and you can map them to an integer, I think you could use a
NumericDocValues field, which supports field-level updates.

Once you do that, you can then:

* Count on this field pretty easily. You will need to write a Facets
implementation, but otherwise it's very easy.

* Filter queries: you will need to write a Filter which returns a DocIdSet
of the documents that belong to one category (e.g. Financially Relevant).
Here you might want to consider caching the result of the Filter, by using
CachingWrapperFilter.

It's not the best approach, updatable Terms would better suit your usecase,
however we don't offer them yet and it will be a while until we do (and IF
we do). You should also benchmark that approach vs re-indexing the
documents since the current implementation of updatable doc-values fields
isn't optimized for a few document updates between index reopens. See here:
http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html

Shai



On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi Shai,

 Thanks so much for the clear explanation.

 I agree on the first question. Taxonomy Writer with a separate index would
 probably be my approach too.

 For the second question:
 I am a little new to the Facets API so I will try to figure out the
 approach that you outlined below.

 However, the scenario is such: Assume a document corpus that is indexed.
 For a user query, a document is returned and selected by the user for
 editing as part of some use case/workflow. That document is now marked as
 either historically interesting or not, financially relevant, specific to
 media or entertainment domain, etc. by the user. So, essentially the user
 is flagging the document with certain markers.
 Another set of users could possibly want to query on these markers. So,
 lets say, a second user comes along, and wants to see the top documents
 belonging to one category, say, agriculture or farming. Since these markers
 are run time activities, how can I use the facets on them? So, I was
 envisioning facets as the various markers. But, if I constantly re-index or
 update the documents whenever a marker changes, I believe it would not be
 very efficient.

 Is there anything, facets or otherwise, in Lucene that can help me solve
 this use case?

 Please let me know. And, thanks!

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Friday, June 13, 2014 9:51 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 You can check the demo code here:

 https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/
 .
 This code is updated with each release, so you always get a working code
 examples, even when the API changes.

 If you don't mind managing the sidecar index, which I agree isn't such a
 big deal, then yes - the taxonomy index currently performs the fastest. I
 plan to explore porting the taxonomy-based approach from BinaryDocValues

Re: Facets in Lucene 4.7.2

2014-06-16 Thread Sandeep Khanzode
Correction on [4] below. I do get doc/pos/tim/tip/dvd/dvm files in either ase. 
What I meant was the number of those files appear different in both cases. 
Also, does commit() stop the world and behave serially to flush the contents?
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Monday, June 16, 2014 7:10 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.INVALID wrote:
 


Hi Shai,

Thanks for the response. Appreciated! I understand that this particular use 
case has to be handled in a different way.

Can you please help me with the below questions? 

1.] Is there any API that gives me the count of a specific dimension from 
FacetCollector in response to a search query. Currently, I use the 
getTopChildren() with some value and then check the FacetResult object for the 
actual number of dimensions hit along with their occurrences. Also, the 
getSpecificValue() does not work without a path attribute to the API.

2.] Can I find the MAX or MIN value of a Numeric type field written to the 
index?

3.] I am trying to compare and contrast Lucene Facets with Elastic Search. I 
could determine that ES does search time faceting and dynamically returns the 
response without any prior faceting during indexing time. Is index time lag is 
not my concern, can I assume that, in general, performance-wise Lucene facets 
would be faster?

4.] I index a semi-large-ish corpus of 20M files across 50GB. If I do not use 
IndexWriter.commit(), I get standard files like cfe/cfs/si in the index 
directory. However, if I do use the commit(), then as I understand it, the 
state is persisted to the disk. But this time, there are additional file 
extensions like doc/pos/tim/tip/dvd/dvm, etc. I am not sure about this 
difference and its cause. 

5.] Does the RAMBufferSizeMB() control the commit intervals, so that when the 
limit is reached across all writing threads, the contents are flushed to disk 
periodically?

Appreciate your response to the above queries. Thanks again,

 
---
Thanks n Regards,
Sandeep Ramesh Khanzode



On Sunday, June 15, 2014 10:40 AM, Shai Erera ser...@gmail.com wrote:



Hi

Currently there's now way to add e.g. terms to already indexed documents,
you have to re-index them. The only updatable field type Lucene offers
currently are DocValues fields. If the list of markers/flags is fixed in
your case, and you can map them to an integer, I think you could use a
NumericDocValues field, which supports field-level updates.

Once you do that, you can then:

* Count on this field pretty easily. You will need to write a Facets
implementation, but otherwise it's very easy.

* Filter queries: you will need to write a Filter which returns a DocIdSet
of the documents that belong to one category (e.g. Financially Relevant).
Here you might want to consider caching the result of the Filter, by using
CachingWrapperFilter.

It's not the best approach, updatable Terms would better suit your usecase,
however we don't offer them yet and it will be a while until we do (and IF
we do). You should also benchmark that approach vs re-indexing the
documents since the current implementation of updatable doc-values fields
isn't optimized for a few document updates between index reopens. See here:
http://shaierera.blogspot.com/2014/04/benchmarking-updatable-docvalues.html

Shai



On Fri, Jun 13, 2014 at 10:19 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi Shai,

 Thanks so much for the clear explanation.

 I agree on the first question. Taxonomy Writer with a separate index would
 probably be my approach too.

 For the second question:
 I am a little new to the Facets API so I will try to figure out the
 approach that you outlined below.

 However, the scenario is such: Assume a document corpus that is indexed.
 For a user query, a document is returned and selected by the user for
 editing as part of some use case/workflow. That document is now marked as
 either historically interesting or not, financially relevant, specific to
 media or entertainment domain, etc. by the user. So, essentially the user
 is flagging the document with certain markers.
 Another set of users could possibly want to query on these markers. So,
 lets say, a second user comes along, and wants to see the top documents
 belonging to one category, say, agriculture or farming. Since these markers
 are run time activities, how can I use the facets on them? So, I was
 envisioning facets as the various markers. But, if I constantly re-index or
 update the documents whenever a marker changes, I believe it would not be
 very efficient.

 Is there anything, facets or otherwise, in Lucene that can help me solve
 this use case?

 Please let me know. And, thanks!

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


 On Friday, June 13, 2014 9:51 PM, Shai Erera ser...@gmail.com wrote:



 Hi

 You can check the demo code here:

 https://svn.apache.org/repos/asf/lucene/dev/branches

Facets in Lucene 4.7.2

2014-06-13 Thread Sandeep Khanzode
Hi,
 
I am evaluating Lucene Facets for a project. Since there is a lot of change in 
4.7.2 for Facets, I am relying on UTs for reference. Please let me know if 
there are other sources of information. 

I have a couple of questions:

1.] All categories in my application are flat, not hierarchical. But, it seems 
from a few sources, that even that notwithstanding, you would want to use a 
Taxonomy based index for performance reasons. It is faster but uses more RAM. 
Or is the deterrent to use it is the fact that it is a separate data structure. 
If one could do with the life-cycle management of the extra index, should we go 
ahead with the taxonomy index for better performance across tens of millions of 
documents? 

Another note to add is that I do not see a scenario wherein I would want to 
re-index my collection over and over again or, in other words, the changes 
would be spread over time. 

2.] I need a type of dynamic facet that allows me to add a flag or marker to 
the document at runtime since it will change/update every time a user modifies 
or adds to the list of markers. Is this possible to do with the current 
implementation? Since I believe, that currently all faceting is done at 
indexing time.

 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: Facets in Lucene 4.7.2

2014-06-13 Thread Sandeep Khanzode
Hi Shai,
 
Thanks so much for the clear explanation.

I agree on the first question. Taxonomy Writer with a separate index would 
probably be my approach too.

For the second question:
I am a little new to the Facets API so I will try to figure out the approach 
that you outlined below.

However, the scenario is such: Assume a document corpus that is indexed. For a 
user query, a document is returned and selected by the user for editing as part 
of some use case/workflow. That document is now marked as either historically 
interesting or not, financially relevant, specific to media or entertainment 
domain, etc. by the user. So, essentially the user is flagging the document 
with certain markers.
Another set of users could possibly want to query on these markers. So, lets 
say, a second user comes along, and wants to see the top documents belonging to 
one category, say, agriculture or farming. Since these markers are run time 
activities, how can I use the facets on them? So, I was envisioning facets as 
the various markers. But, if I constantly re-index or update the documents 
whenever a marker changes, I believe it would not be very efficient. 

Is there anything, facets or otherwise, in Lucene that can help me solve this 
use case? 

Please let me know. And, thanks!

---
Thanks n Regards,
Sandeep Ramesh Khanzode


On Friday, June 13, 2014 9:51 PM, Shai Erera ser...@gmail.com wrote:
 


Hi

You can check the demo code here:
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_4_8/lucene/demo/src/java/org/apache/lucene/demo/facet/.
This code is updated with each release, so you always get a working code
examples, even when the API changes.

If you don't mind managing the sidecar index, which I agree isn't such a
big deal, then yes - the taxonomy index currently performs the fastest. I
plan to explore porting the taxonomy-based approach from BinaryDocValues to
the new SortedNumericDocValues (coming out in 4.9) since it might perform
even faster.

I didn't quite get the marker/flag facet. Can you give an example? For
instance, if you can model that as a NumericDocValuesField added to
documents (w/ the different markers/flags translated to numbers), then you
can use Lucene's updatable numeric DocValues and write a custom Facets to
aggregate on that NumericDocValues field.

Shai



On Fri, Jun 13, 2014 at 11:48 AM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 I am evaluating Lucene Facets for a project. Since there is a lot of
 change in 4.7.2 for Facets, I am relying on UTs for reference. Please let
 me know if there are other sources of information.

 I have a couple of questions:

 1.] All categories in my application are flat, not hierarchical. But, it
 seems from a few sources, that even that notwithstanding, you would want to
 use a Taxonomy based index for performance reasons. It is faster but uses
 more RAM. Or is the deterrent to use it is the fact that it is a separate
 data structure. If one could do with the life-cycle management of the extra
 index, should we go ahead with the taxonomy index for better performance
 across tens of millions of documents?

 Another note to add is that I do not see a scenario wherein I would want
 to re-index my collection over and over again or, in other words, the
 changes would be spread over time.

 2.] I need a type of dynamic facet that allows me to add a flag or marker
 to the document at runtime since it will change/update every time a user
 modifies or adds to the list of markers. Is this possible to do with the
 current implementation? Since I believe, that currently all faceting is
 done at indexing time.


 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode