Re: Lucene Facets Module 4.8.1

2014-06-23 Thread Shai Erera
There is no sample code for doing that but it's quite straightforward - if
you know you indexed some dimensions under different indexFieldNames,
initialize a FacetCounts per such field name, e.g.:

FastTaxoFacetCounts defaultCounts = new FastTaxoFacetCounts(...); // for
your regular facets
FastTaxoFacetCounts cityCounts = new FastTaxoFacetCounts(...); // for your
CITY facets

Something like that...

Shai


On Mon, Jun 23, 2014 at 9:04 AM, Jigar Shah  wrote:

> On commenting
>
> //config.setIndexFieldName("CITY", "city"); at search time, this is before
> i do, getTopChildren(...)
>
> I get following exception.
>
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
> at
>
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:74)
> [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> at
>
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:49)
> [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> at
>
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:39)
> [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> at
>
> org.apache.lucene.facet.DrillSideways.buildFacetsResult(DrillSideways.java:110)
> [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:177)
> [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> at org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203)
> [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
>
> Application level excepitons.
> ...
> ...
>
>
>
> On Sat, Jun 21, 2014 at 10:56 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> > Are you sure it's the same FacetsConfig at search time?  Because the
> > exception implies your CITY field didn't have
> > config.setIndexFieldName("CITY", "city") called.
> >
> > Or, can you try commenting out 'config.setIndexFieldName("CITY",
> > "city")' at index time and see if the exception still happens?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah 
> wrote:
> > > Thanks for helping me.
> > >
> > > Yes, i did couple of things:
> > >
> > > Below is simple code for indexing which i use.
> > >
> > > TrackingIndexWriter nrtWriter
> > > DirectoryTaxonomyWriter taxoWriter = ...
> > > 
> > > FacetsConfig config = new FacetConfig();
> > > config.setHierarchical("CITY", true)
> > > config.setMultiValued("CITY", true);
> > > config.setIndexFieldName("CITY","city") // I kept dimName different
> from
> > > indexFieldName
> > > 
> > > Added indexing searchable fields...
> > > 
> > >
> > > doc.add( new FacetField("CITY", "India", "Gujarat", "Vadodara" ))
> > > doc.add( new FacetField("CITY", "India", "Gujarat", "Ahmedabad" ))
> > >
> > >  nrtWriter.addDocument(config.build(taxoWriter, doc));
> > >
> > > Below is code which i use for searching
> > >
> > > TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);
> > >
> > > Query query = ...
> > > IndexSearcher searcher = ...
> > > DrillDownQuery ddq = new DrillDownQuery(config, query);
> > > DrillSideways ds = new DrillSideways(searcher, config, taxoReader); //
> > > Config object is same which i created before
> > > DrillSidewaysResult result = ds.search(query, null, null, start +
> limit,
> > > null, true, true)
> > > ...
> > > Facets f = result.facets
> > > FacetResult fr = f.getTopChildren(5, "CITY") [Exception is geneated]//
> > > Didn't perform any drill-down,really, its just original query for first
> > > time, but wrapped in DrillDownQuery.
> > >
> > > ... and below gives me empty collection.
> > >
> > > List frs= f.getAllDims(5)
> > >
> > > I debug source code and found, it internally calls
> > >
> > > FastTaxonomyFacetCounts(indexFieldName, taxoReader, config) // Config
> > > object is same which i created before
> > >
> > > which then calls
> > >
> > > IntTaxonomyFacets(indexFieldName, taxoReader, config) // Config object
> is
> > > same which i created before
> > >
> > > And during this calls the value of indexFieldName is "$facets defined
> by
> > > constant  'public static final String DEFAULT_INDEX_FIELD_NAME =
> > "$facets";'
> > > in FacetsConfig.
> > >
> > > My question is if i am using same FacetsConfig while indexing and
> > > searching. why its not identifying correct name of field, and goes for
> > > "$facets"
> > >
> > > Please correct me if i understood wrong. or correct way to solve above
> > > problem.
> > >
> > > Many Thanks.
> > > Jigar Shah.
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Jamie

Hi

While running a search over several million documents, the Yourkit 
profiler reports a deadlock on the following method. Any ideas?


search worker <--- Frozen for at least 25m 37 sec
org.apache.lucene.util.PriorityQueue.downHeap()
org.apache.lucene.util.PriorityQueue.updateTop()
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.updateBottom(int)
org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(int)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Collector, 
Scorer)

org.apache.lucene.search.Weight$DefaultBulkScorer.score(Collector, int)
org.apache.lucene.search.BulkScorer.score(Collector)
org.apache.lucene.search.IndexSearcher.search(List, Weight, Collector)
org.apache.lucene.search.IndexSearcher.search(List, Weight, FieldDoc, 
int, Sort, boolean, boolean, boolean)
org.apache.lucene.search.IndexSearcher$SearcherCallableWithSort.call()<2 
recursive calls>

java.util.concurrent.FutureTask.run()
java.util.concurrent.Executors$RunnableAdapter.call()
java.util.concurrent.FutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()



search worker <--- Frozen for at least 25m 38 sec
org.apache.lucene.util.PriorityQueue.downHeap()
org.apache.lucene.util.PriorityQueue.updateTop()
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.updateBottom(int)
org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(int)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Collector, 
Scorer)

org.apache.lucene.search.Weight$DefaultBulkScorer.score(Collector, int)
org.apache.lucene.search.BulkScorer.score(Collector)
org.apache.lucene.search.IndexSearcher.search(List, Weight, Collector)
org.apache.lucene.search.IndexSearcher.search(List, Weight, FieldDoc, 
int, Sort, boolean, boolean, boolean)
org.apache.lucene.search.IndexSearcher$SearcherCallableWithSort.call()<2 
recursive calls>

java.util.concurrent.FutureTask.run()
java.util.concurrent.Executors$RunnableAdapter.call()
java.util.concurrent.FutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()



search worker <--- Frozen for at least 25m 37 sec
org.apache.lucene.util.PriorityQueue.downHeap()
org.apache.lucene.util.PriorityQueue.pop()
org.apache.lucene.search.TopFieldCollector.populateResults(ScoreDoc[], int)
org.apache.lucene.search.TopDocsCollector.topDocs(int, int)
org.apache.lucene.search.TopDocsCollector.topDocs()
org.apache.lucene.search.IndexSearcher.search(List, Weight, FieldDoc, 
int, Sort, boolean, boolean, boolean)
org.apache.lucene.search.IndexSearcher$SearcherCallableWithSort.call()<2 
recursive calls>

java.util.concurrent.FutureTask.run()
java.util.concurrent.Executors$RunnableAdapter.call()
java.util.concurrent.FutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()



search worker <--- Frozen for at least 25m 37 sec
org.apache.lucene.util.PriorityQueue.downHeap()
org.apache.lucene.util.PriorityQueue.updateTop()
org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.updateBottom(int)
org.apache.lucene.search.TopFieldCollector$OutOfOrderOneComparatorNonScoringCollector.collect(int)
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Collector, 
Scorer)

org.apache.lucene.search.Weight$DefaultBulkScorer.score(Collector, int)
org.apache.lucene.search.BulkScorer.score(Collector)
org.apache.lucene.search.IndexSearcher.search(List, Weight, Collector)
org.apache.lucene.search.IndexSearcher.search(List, Weight, FieldDoc, 
int, Sort, boolean, boolean, boolean)
org.apache.lucene.search.IndexSearcher$SearcherCallableWithSort.call()<2 
recursive calls>

java.util.concurrent.FutureTask.run()
java.util.concurrent.Executors$RunnableAdapter.call()
java.util.concurrent.FutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()

Much appreciate

Jamie





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Toke Eskildsen
On Mon, 2014-06-23 at 13:33 +0200, Jamie wrote:
> While running a search over several million documents, the Yourkit 
> profiler reports a deadlock on the following method. Any ideas?

> search worker <--- Frozen for at least 25m 37 sec
> org.apache.lucene.util.PriorityQueue.downHeap()

My guess is that you are requesting several million documents as your
result set, instead of just the top 10 or top 100. 

The heap implementation used by Lucene does not play well with large
result sets. Performance is bad and it allocates an excessive amount of
objects: Your machine is probably busy garbage collecting. The quick fix
is to allocate more memory for Java.

This is not a fault in the implementation as such, but rather the result
of using a heap for a large result set. If you really need a large
result set, I recommend you create your own collector that collects
everything the most compact way and perform sorting on the full
collection afterwards.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Jamie

Toke

Thanks for the tip. Sadly, we are only requesting a set page size worth 
of documents at a time.


if (startIdx==0) {
  topDocs = 
indexSearcher.search(query,queryFilter,searchResult.getPageSize(), sort);

} else {
  topDocs = indexSearcher.searchAfter(p.startScoreDoc, query, 
queryFilter, searchResult.getPageSize(),sort);

}
The page size is set to 50,000.

We've noticed that during major collections, search grinds to a halt for 
several minutes.


What are the best JVM collector settings for Lucene searching? We're 
tried various options and they don't seem to make much difference.


Regards

Jamie

On 2014/06/23, 1:43 PM, Toke Eskildsen wrote:

On Mon, 2014-06-23 at 13:33 +0200, Jamie wrote:

While running a search over several million documents, the Yourkit
profiler reports a deadlock on the following method. Any ideas?





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Jamie

Toke

How does one sort the results of a collector as opposed to the entire 
result set?


Do I need to implement my own sort algorithm or is there a way to do 
this with Lucene?  If so, which API functions do I need to call?


Thanks

Jamie


On 2014/06/23, 1:43 PM, Toke Eskildsen wrote:

On Mon, 2014-06-23 at 13:33 +0200, Jamie wrote:

While running a search over several million documents, the Yourkit
profiler reports a deadlock on the following method. Any ideas?
search worker <--- Frozen for at least 25m 37 sec
org.apache.lucene.util.PriorityQueue.downHeap()

My guess is that you are requesting several million documents as your
result set, instead of just the top 10 or top 100.

The heap implementation used by Lucene does not play well with large
result sets. Performance is bad and it allocates an excessive amount of
objects: Your machine is probably busy garbage collecting. The quick fix
is to allocate more memory for Java.

This is not a fault in the implementation as such, but rather the result
of using a heap for a large result set. If you really need a large
result set, I recommend you create your own collector that collects
everything the most compact way and perform sorting on the full
collection afterwards.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: EarlyTerminatingSortingCollector help needed..

2014-06-23 Thread Adrien Grand
On Sun, Jun 22, 2014 at 6:44 PM, Ravikumar Govindarajan
 wrote:
> For a normal sorting-query, on a top-level searcher, I execute
>
> TopDocs docs = searcher.search(query, 50, sortField)
>
> Then I can issue reader.document() for final list of exactly 50 docs, which
> gives me a global order across segments but at the obvious cost of memory...
>
> SortingMergePolicy + ETSC will make me do 50*N [N=no.of.segments] collects,
> which could increase cost of seeks when each segment collects considerable
> hits...

This is not correct. :) ETSC will collect segments one after another
but in the end, what you will get are the top hits for all segments.
This means that even though you have eg. 15 segments, if you requested
50 documents, you will get the top 50 documents out of your
TopHitsCollector.

>  - you can afford the merging overhead (ie. for heavy indexing
>> workloads, this might not be the best solution)
>>  - there is a single sort order that is used for most queries
>>  - you don't need any feature that requires to collect all documents
>> (like computing the total hit count or facets).
>
>
> Our use-case fits perfectly on all these 3 points and thats why we wanted
> to explore this. But our final set of results must also be globally
> ordered. May be it's mistake to assume that Sorting can be entirely
> replaced with SMP + ETSC...

I don't think it is a mistake, this can help make the execution of
search requests significantly faster.

> I would not advise to use the stored fields API, even in the context
>> of early termination. Doc values should be more efficient here?
>
>
> I read your excellent blog on stored-fields compression, where you've
> mentioned that stored-fields now take only one random seek. [
> http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
> ]
>
> If so, then what could make DocValues still a winner?

Yes. If you use eg. 2 doc values fields to run your query, it is true
that the number of seeks in the worst case would be 2 for doc values
and only 1 for stored fields, so stored fields might look more
appropriate. However, doc values play much better with the operating
system thanks to column-stride storage since:
 - it allows for lightweight and efficient compression,
 - the filesystem cache doesn't get loaded on field values that you
are not interested in.

When wondering about stored fields vs doc values, the right trade-off
is usually to use:
 - stored fields when looking up several field values for a few documents,
 - doc values when loading a few field values for many documents.


-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Toke Eskildsen
On Mon, 2014-06-23 at 13:53 +0200, Jamie wrote:
> if (startIdx==0) {
>topDocs = 
> indexSearcher.search(query,queryFilter,searchResult.getPageSize(), sort);
> } else {
>topDocs = indexSearcher.searchAfter(p.startScoreDoc, query, 
> queryFilter, searchResult.getPageSize(),sort);
> }
> The page size is set to 50,000.

Okay, that was strange. 50K is fine for a heap. How many concurrent
searches are you running?

> What are the best JVM collector settings for Lucene searching? We're 
> tried various options and they don't seem to make much difference.

I am no expert there, but I will advice you to check how much free
memory your JVM has when it is running searches. GC-tweaks does not help
much if the JVM is nearly our of memory.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Jamie

Toke


On 2014/06/23, 2:08 PM, Toke Eskildsen wrote:

On Mon, 2014-06-23 at 13:53 +0200, Jamie wrote:

if (startIdx==0) {
topDocs =
indexSearcher.search(query,queryFilter,searchResult.getPageSize(), sort);
} else {
topDocs = indexSearcher.searchAfter(p.startScoreDoc, query,
queryFilter, searchResult.getPageSize(),sort);
}
The page size is set to 50,000.

Okay, that was strange. 50K is fine for a heap. How many concurrent
searches are you running?
Just one search at a time, although the index searcher is passed an 
executor with a thread pool of 16 or so.



What are the best JVM collector settings for Lucene searching? We're
tried various options and they don't seem to make much difference.

I am no expert there, but I will advice you to check how much free
memory your JVM has when it is running searches. GC-tweaks does not help
much if the JVM is nearly our of memory.

There is a plenty of memory available. The heap is set to 6 Gigs.


- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: frozen in PriorityQueue.downHeap for more than 25 minutes

2014-06-23 Thread Toke Eskildsen
On Mon, 2014-06-23 at 13:58 +0200, Jamie wrote:
> How does one sort the results of a collector as opposed to the entire 
> result set?

With only 50K as page size, this should not be necessary. But for the
record, you do it by implementing a Collector that can potentially hold
all documents in the index (well, their docID & sort key anyway) and
feed it to the search(Query query, Collector results) method in the
IndexSearcher. When the call has finished, run your own sort and extract
the top-X results.

> Do I need to implement my own sort algorithm or is there a way to do 
> this with Lucene?  If so, which API functions do I need to call?

InPlaceMergeSorter is a nice one to extend. But again, with 50K result
sets, this seems like overkill.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Facets Module 4.8.1

2014-06-23 Thread Jigar Shah
Thanks this worked for me :)

Is there any advantage of indexing some facets as not providing any
indexFieldName ?

Thanks




On Mon, Jun 23, 2014 at 12:55 PM, Shai Erera  wrote:

> There is no sample code for doing that but it's quite straightforward - if
> you know you indexed some dimensions under different indexFieldNames,
> initialize a FacetCounts per such field name, e.g.:
>
> FastTaxoFacetCounts defaultCounts = new FastTaxoFacetCounts(...); // for
> your regular facets
> FastTaxoFacetCounts cityCounts = new FastTaxoFacetCounts(...); // for your
> CITY facets
>
> Something like that...
>
> Shai
>
>
> On Mon, Jun 23, 2014 at 9:04 AM, Jigar Shah  wrote:
>
> > On commenting
> >
> > //config.setIndexFieldName("CITY", "city"); at search time, this is
> before
> > i do, getTopChildren(...)
> >
> > I get following exception.
> >
> > Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
> > at
> >
> >
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:74)
> > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > at
> >
> >
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:49)
> > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > at
> >
> >
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:39)
> > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > at
> >
> >
> org.apache.lucene.facet.DrillSideways.buildFacetsResult(DrillSideways.java:110)
> > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > at
> org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:177)
> > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > at
> org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203)
> > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> >
> > Application level excepitons.
> > ...
> > ...
> >
> >
> >
> > On Sat, Jun 21, 2014 at 10:56 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> > > Are you sure it's the same FacetsConfig at search time?  Because the
> > > exception implies your CITY field didn't have
> > > config.setIndexFieldName("CITY", "city") called.
> > >
> > > Or, can you try commenting out 'config.setIndexFieldName("CITY",
> > > "city")' at index time and see if the exception still happens?
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah 
> > wrote:
> > > > Thanks for helping me.
> > > >
> > > > Yes, i did couple of things:
> > > >
> > > > Below is simple code for indexing which i use.
> > > >
> > > > TrackingIndexWriter nrtWriter
> > > > DirectoryTaxonomyWriter taxoWriter = ...
> > > > 
> > > > FacetsConfig config = new FacetConfig();
> > > > config.setHierarchical("CITY", true)
> > > > config.setMultiValued("CITY", true);
> > > > config.setIndexFieldName("CITY","city") // I kept dimName different
> > from
> > > > indexFieldName
> > > > 
> > > > Added indexing searchable fields...
> > > > 
> > > >
> > > > doc.add( new FacetField("CITY", "India", "Gujarat", "Vadodara" ))
> > > > doc.add( new FacetField("CITY", "India", "Gujarat", "Ahmedabad" ))
> > > >
> > > >  nrtWriter.addDocument(config.build(taxoWriter, doc));
> > > >
> > > > Below is code which i use for searching
> > > >
> > > > TaxonomyReader taxoReader = new DirectoryTaxonomyReader(taxoWriter);
> > > >
> > > > Query query = ...
> > > > IndexSearcher searcher = ...
> > > > DrillDownQuery ddq = new DrillDownQuery(config, query);
> > > > DrillSideways ds = new DrillSideways(searcher, config, taxoReader);
> //
> > > > Config object is same which i created before
> > > > DrillSidewaysResult result = ds.search(query, null, null, start +
> > limit,
> > > > null, true, true)
> > > > ...
> > > > Facets f = result.facets
> > > > FacetResult fr = f.getTopChildren(5, "CITY") [Exception is
> geneated]//
> > > > Didn't perform any drill-down,really, its just original query for
> first
> > > > time, but wrapped in DrillDownQuery.
> > > >
> > > > ... and below gives me empty collection.
> > > >
> > > > List frs= f.getAllDims(5)
> > > >
> > > > I debug source code and found, it internally calls
> > > >
> > > > FastTaxonomyFacetCounts(indexFieldName, taxoReader, config) // Config
> > > > object is same which i created before
> > > >
> > > > which then calls
> > > >
> > > > IntTaxonomyFacets(indexFieldName, taxoReader, config) // Config
> object
> > is
> > > > same which i created before
> > > >
> > > > And during this calls the value of indexFieldName is "$facets defined
> > by
> > > > constant  'public static final String DEFAULT_INDEX_FIELD_NAME =
> > > "$facets";'
> > > > in FacetsConfig.
> > > >
> > > > My question is if i am using same FacetsConfig while indexing and
> > > > searching. why its not identifying correct name of field, and goes
> for
> > > > "

Re: Lucene Facets Module 4.8.1

2014-06-23 Thread Shai Erera
Basically, it's not very common to change the indexFieldName. You should do
that in case you e.g. count facets in groups of dimensions, rather than
counting all of them. So for example, if you have 20 dimensions, but you
know you only count d1-d5, d6-d12 and d13-d20, then if you separate them to
3 different indexFieldNames will probably improve performance.

But if you can't make such a decision, it's better to not modify this. When
you initialize a FacetCounts, it counts all the dimensions that are indexed
under that indexFieldName, so if you need the counts of all of them, or the
majority of them, that's ok. But if you know you *always* need the count of
a subset of them, then separating that subset to a different field is
better.

Hope that clarifies.

Shai


On Mon, Jun 23, 2014 at 4:18 PM, Jigar Shah  wrote:

> Thanks this worked for me :)
>
> Is there any advantage of indexing some facets as not providing any
> indexFieldName ?
>
> Thanks
>
>
>
>
> On Mon, Jun 23, 2014 at 12:55 PM, Shai Erera  wrote:
>
> > There is no sample code for doing that but it's quite straightforward -
> if
> > you know you indexed some dimensions under different indexFieldNames,
> > initialize a FacetCounts per such field name, e.g.:
> >
> > FastTaxoFacetCounts defaultCounts = new FastTaxoFacetCounts(...); // for
> > your regular facets
> > FastTaxoFacetCounts cityCounts = new FastTaxoFacetCounts(...); // for
> your
> > CITY facets
> >
> > Something like that...
> >
> > Shai
> >
> >
> > On Mon, Jun 23, 2014 at 9:04 AM, Jigar Shah 
> wrote:
> >
> > > On commenting
> > >
> > > //config.setIndexFieldName("CITY", "city"); at search time, this is
> > before
> > > i do, getTopChildren(...)
> > >
> > > I get following exception.
> > >
> > > Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
> > > at
> > >
> > >
> >
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:74)
> > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > at
> > >
> > >
> >
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:49)
> > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > at
> > >
> > >
> >
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:39)
> > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > at
> > >
> > >
> >
> org.apache.lucene.facet.DrillSideways.buildFacetsResult(DrillSideways.java:110)
> > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > at
> > org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:177)
> > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > at
> > org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203)
> > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > >
> > > Application level excepitons.
> > > ...
> > > ...
> > >
> > >
> > >
> > > On Sat, Jun 21, 2014 at 10:56 PM, Michael McCandless <
> > > luc...@mikemccandless.com> wrote:
> > >
> > > > Are you sure it's the same FacetsConfig at search time?  Because the
> > > > exception implies your CITY field didn't have
> > > > config.setIndexFieldName("CITY", "city") called.
> > > >
> > > > Or, can you try commenting out 'config.setIndexFieldName("CITY",
> > > > "city")' at index time and see if the exception still happens?
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > >
> > > > On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah 
> > > wrote:
> > > > > Thanks for helping me.
> > > > >
> > > > > Yes, i did couple of things:
> > > > >
> > > > > Below is simple code for indexing which i use.
> > > > >
> > > > > TrackingIndexWriter nrtWriter
> > > > > DirectoryTaxonomyWriter taxoWriter = ...
> > > > > 
> > > > > FacetsConfig config = new FacetConfig();
> > > > > config.setHierarchical("CITY", true)
> > > > > config.setMultiValued("CITY", true);
> > > > > config.setIndexFieldName("CITY","city") // I kept dimName different
> > > from
> > > > > indexFieldName
> > > > > 
> > > > > Added indexing searchable fields...
> > > > > 
> > > > >
> > > > > doc.add( new FacetField("CITY", "India", "Gujarat", "Vadodara" ))
> > > > > doc.add( new FacetField("CITY", "India", "Gujarat", "Ahmedabad" ))
> > > > >
> > > > >  nrtWriter.addDocument(config.build(taxoWriter, doc));
> > > > >
> > > > > Below is code which i use for searching
> > > > >
> > > > > TaxonomyReader taxoReader = new
> DirectoryTaxonomyReader(taxoWriter);
> > > > >
> > > > > Query query = ...
> > > > > IndexSearcher searcher = ...
> > > > > DrillDownQuery ddq = new DrillDownQuery(config, query);
> > > > > DrillSideways ds = new DrillSideways(searcher, config, taxoReader);
> > //
> > > > > Config object is same which i created before
> > > > > DrillSidewaysResult result = ds.search(query, null, null, start +
> > > limit,
> > > > > null, true, true)
> > > > > ...

Re: Lucene Facets Module 4.8.1

2014-06-23 Thread Jigar Shah
Thanks very much for this valuable information.

Good to know that, same indexFieldName can be used for multiple (similar in
some cases) dimensions.

For sure this will help me to design application better.

Thanks once again.


On Mon, Jun 23, 2014 at 7:00 PM, Shai Erera  wrote:

> Basically, it's not very common to change the indexFieldName. You should do
> that in case you e.g. count facets in groups of dimensions, rather than
> counting all of them. So for example, if you have 20 dimensions, but you
> know you only count d1-d5, d6-d12 and d13-d20, then if you separate them to
> 3 different indexFieldNames will probably improve performance.
>
> But if you can't make such a decision, it's better to not modify this. When
> you initialize a FacetCounts, it counts all the dimensions that are indexed
> under that indexFieldName, so if you need the counts of all of them, or the
> majority of them, that's ok. But if you know you *always* need the count of
> a subset of them, then separating that subset to a different field is
> better.
>
> Hope that clarifies.
>
> Shai
>
>
> On Mon, Jun 23, 2014 at 4:18 PM, Jigar Shah  wrote:
>
> > Thanks this worked for me :)
> >
> > Is there any advantage of indexing some facets as not providing any
> > indexFieldName ?
> >
> > Thanks
> >
> >
> >
> >
> > On Mon, Jun 23, 2014 at 12:55 PM, Shai Erera  wrote:
> >
> > > There is no sample code for doing that but it's quite straightforward -
> > if
> > > you know you indexed some dimensions under different indexFieldNames,
> > > initialize a FacetCounts per such field name, e.g.:
> > >
> > > FastTaxoFacetCounts defaultCounts = new FastTaxoFacetCounts(...); //
> for
> > > your regular facets
> > > FastTaxoFacetCounts cityCounts = new FastTaxoFacetCounts(...); // for
> > your
> > > CITY facets
> > >
> > > Something like that...
> > >
> > > Shai
> > >
> > >
> > > On Mon, Jun 23, 2014 at 9:04 AM, Jigar Shah 
> > wrote:
> > >
> > > > On commenting
> > > >
> > > > //config.setIndexFieldName("CITY", "city"); at search time, this is
> > > before
> > > > i do, getTopChildren(...)
> > > >
> > > > I get following exception.
> > > >
> > > > Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:74)
> > > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:49)
> > > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:39)
> > > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.lucene.facet.DrillSideways.buildFacetsResult(DrillSideways.java:110)
> > > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > > at
> > > org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:177)
> > > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > > at
> > > org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203)
> > > > [lucene-facet-4.8.1.jar:4.8.1 1594670 - rmuir - 2014-05-14 19:23:23]
> > > >
> > > > Application level excepitons.
> > > > ...
> > > > ...
> > > >
> > > >
> > > >
> > > > On Sat, Jun 21, 2014 at 10:56 PM, Michael McCandless <
> > > > luc...@mikemccandless.com> wrote:
> > > >
> > > > > Are you sure it's the same FacetsConfig at search time?  Because
> the
> > > > > exception implies your CITY field didn't have
> > > > > config.setIndexFieldName("CITY", "city") called.
> > > > >
> > > > > Or, can you try commenting out 'config.setIndexFieldName("CITY",
> > > > > "city")' at index time and see if the exception still happens?
> > > > >
> > > > > Mike McCandless
> > > > >
> > > > > http://blog.mikemccandless.com
> > > > >
> > > > >
> > > > > On Sat, Jun 21, 2014 at 1:08 AM, Jigar Shah  >
> > > > wrote:
> > > > > > Thanks for helping me.
> > > > > >
> > > > > > Yes, i did couple of things:
> > > > > >
> > > > > > Below is simple code for indexing which i use.
> > > > > >
> > > > > > TrackingIndexWriter nrtWriter
> > > > > > DirectoryTaxonomyWriter taxoWriter = ...
> > > > > > 
> > > > > > FacetsConfig config = new FacetConfig();
> > > > > > config.setHierarchical("CITY", true)
> > > > > > config.setMultiValued("CITY", true);
> > > > > > config.setIndexFieldName("CITY","city") // I kept dimName
> different
> > > > from
> > > > > > indexFieldName
> > > > > > 
> > > > > > Added indexing searchable fields...
> > > > > > 
> > > > > >
> > > > > > doc.add( new FacetField("CITY", "India", "Gujarat", "Vadodara" ))
> > > > > > doc.add( new FacetField("CITY", "India", "Gujarat", "Ahmedabad"
> ))
> > > > > >
> > > > > >  nrtWriter.addDocument(config.build(taxoWriter, doc)

Re: EarlyTerminatingSortingCollector help needed..

2014-06-23 Thread Ravikumar Govindarajan
>
> This means that even though you have eg. 15 segments, if you requested
> 50 documents, you will get the top 50 documents out of your
> TopHitsCollector.


Yes, we can get the top-50 docs finally. I am not denying that.

I will probably re-phrase my question. Apologize if I am not clear

How do we ensure global sort-order during search across all segments of the
index, when using ESTC+SMP that works only at per-segment level?


When wondering about stored fields vs doc values, the right trade-off
> is usually to use:
>  - stored fields when looking up several field values for a few documents,
>  - doc values when loading a few field values for many documents.


Thanks for this clarification. Shall surely move towards doc-values...

--
Ravi


On Mon, Jun 23, 2014 at 5:36 PM, Adrien Grand  wrote:

> On Sun, Jun 22, 2014 at 6:44 PM, Ravikumar Govindarajan
>  wrote:
> > For a normal sorting-query, on a top-level searcher, I execute
> >
> > TopDocs docs = searcher.search(query, 50, sortField)
> >
> > Then I can issue reader.document() for final list of exactly 50 docs,
> which
> > gives me a global order across segments but at the obvious cost of
> memory...
> >
> > SortingMergePolicy + ETSC will make me do 50*N [N=no.of.segments]
> collects,
> > which could increase cost of seeks when each segment collects
> considerable
> > hits...
>
> This is not correct. :) ETSC will collect segments one after another
> but in the end, what you will get are the top hits for all segments.
> This means that even though you have eg. 15 segments, if you requested
> 50 documents, you will get the top 50 documents out of your
> TopHitsCollector.
>
> >  - you can afford the merging overhead (ie. for heavy indexing
> >> workloads, this might not be the best solution)
> >>  - there is a single sort order that is used for most queries
> >>  - you don't need any feature that requires to collect all documents
> >> (like computing the total hit count or facets).
> >
> >
> > Our use-case fits perfectly on all these 3 points and thats why we wanted
> > to explore this. But our final set of results must also be globally
> > ordered. May be it's mistake to assume that Sorting can be entirely
> > replaced with SMP + ETSC...
>
> I don't think it is a mistake, this can help make the execution of
> search requests significantly faster.
>
> > I would not advise to use the stored fields API, even in the context
> >> of early termination. Doc values should be more efficient here?
> >
> >
> > I read your excellent blog on stored-fields compression, where you've
> > mentioned that stored-fields now take only one random seek. [
> >
> http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
> > ]
> >
> > If so, then what could make DocValues still a winner?
>
> Yes. If you use eg. 2 doc values fields to run your query, it is true
> that the number of seeks in the worst case would be 2 for doc values
> and only 1 for stored fields, so stored fields might look more
> appropriate. However, doc values play much better with the operating
> system thanks to column-stride storage since:
>  - it allows for lightweight and efficient compression,
>  - the filesystem cache doesn't get loaded on field values that you
> are not interested in.
>
> When wondering about stored fields vs doc values, the right trade-off
> is usually to use:
>  - stored fields when looking up several field values for a few documents,
>  - doc values when loading a few field values for many documents.
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Facet migration 4.6.1 to > 4.7.0

2014-06-23 Thread Nicola Buso
Hi,

On Tue, 2014-06-17 at 17:51 +0300, Shai Erera wrote:
> - we are extending FacetResultsHandler to change the order of
> the facet
> results (i.e. date facets ordered by date instead of count).
> How can I
> achieve this now?
> 
> 
> Now everything is a Facets. In your case, since you use the taxonomy,
> it's TaxonomyFacets. You can check the class-hierarchy, where you have
> IntTaxoFacets (to deal w/ integers) and then TaxoFacetCounts and
> FastTaxoFacetCounts. I think you want to extend either IntTaxoFacets,
> or just TaxonomyFacets. Then if you ask for the 'date' dimension,
> delegate to the one that sorts by the date value, otherwise to the
> default one?
> 
> 
> When you say you sort by date, do you count the topN and then sort
> them by date, or you sort by date the entire dimension and then return
> topN? If the latter, does it mean you resolve each ordinal to its Date
> value to sort by? It might be a bit expensive to resolve that ... I
> wonder if you could do that w/ a NumericDocValues too ... e.g. add
> Year, Month, Day numeric DV fields, then aggregate by their value
> instead of resolving them to ordinals ... it's probably more involved
> than that, i.e. counting 2013/March is more complicated, but there's
> got to be a solution, like maybe ask to count March, but filter the
> query by year:2013 ... need to think about that.

I had an abstract implementation of FacetResultsHandler that was
permitting to the extenders to provide their own PriorityQueue that was
ordering in my case by label instead of value; the previous API in the
code was working with and instance of PriorityQueue and
FacetResultNode was a better container of information compare to
OrdAndValue (at least for my case). I probably need to reimplement again
this part.

> 
> - we have usual IndexReaders opened in groups with
> MultiReader, than we're
> merging in RAM the TaxonomyReaders to obtain a correspondence
> of the
> MultiReader for the taxonomies. Do you think I can still do
> this?
> 
> The taxonomy in general hasn't changed. Besides CategoryPath which was
> replaced by String[], it's more or less the same.

OK I will try to adapt this part
> 
> - at some point you removed the residue information from
> facets and we
> calculated it differently; am I right I can now calculate it
> as
> FacetResult.childCount - FacetResult.labelValues.length?
> 
> 
> If the residue is the number of children that had counts>0 but are not
> in the topN, then yes, the above computation seems right.
> FR.childCount denotes how many child labels were encountered, while
> FR.labelValues.length is <= N, where N is topN that you ask to count.

Yes, your assumption is right I already sorted out this part

> 
> 
> - we are extending TaxonomyFacetsAccumulator to provide:
>   - specific FacetResultsHandler(s) depeding on the facet
>   - add facet other than the topk if the user selected some
> facet values
> from the "residue".
> where does the API permit to extends the behavior to achieve
> this?
> 
> 
> FacetsCollector hasn't changed much and returns a List.
> The entire additional chain (Accumulator, ResultHandler etc.) is now a
> Facets. So you basically either need to extend Facets (or
> TaxonomyFacets), or write your own class which just processes the
> List.
> 
> There's no "right way" to do it, it depends on what you want to
> achieve. If its e.g. the different sort-order (date vs other), I would
> try to extend one of the existing classes (IntTaxoFacets). If it's
> something completely different, e.g. RangeFacetCounts, you should be
> able to just extend Facets. And if it's not a "Facets" thing at all,
> i.e. you don't need its API, just write your own interface to process
> the list of MatchingDocs.
> 
> Hope that helps
> 
> 
> Shai

Nicola.
> 
> 
> 
> On Tue, Jun 17, 2014 at 5:30 PM, Nicola Buso  wrote:
> Hi,
> 
> I'm migrating from lucene 4.6.1 to 4.8.1 and I noticed some
> Facet API
> changes happened on 4.7.0 probably mostly related to this
> ticket:
> http://issues.apache.org/jira/browse/LUCENE-5339
> 
> Here are few question about some customization/extension we
> did and
> seem not having a direct counterpart/extension point in the
> new API;
> can someone help with these questions?
> 
> - we are extending FacetResultsHandler to change the order of
> the facet
> results (i.e. date facets ordered by date instead of count).
> How can I
> achieve this now?
> 
> - we have usual IndexReaders opened in groups with
> MultiReader, than we're
> merging in RAM the TaxonomyReaders to obtain a correspondence
> of the
> MultiReader for the taxonomies. Do you think I ca

Re: Reusable Performance Tests

2014-06-23 Thread Gaurav gupta
Srividhya,
I am also looking something similar. I will try if I can find something.

Thanks
On Jun 20, 2014 12:50 PM, "Umashanker, Srividhya" <
srividhya.umashan...@hp.com> wrote:

> Are there any performance test suites available in lucene codebase which
> can be reused by us to benchmark against our lucene infrastructure?
>
> We are looking at mainly multithreaded indexing tests.
>
> -Vidhya
>


Re: Reusable Performance Tests

2014-06-23 Thread Michael McCandless
The luceneutil module
(https://code.google.com/a/apache-extras.org/p/luceneutil/ ) has
benchmarking code for indexing; it's what I use to generate Lucene's
nightly performance graphs
(http://people.apache.org/~mikemccand/lucenebench/indexing.html ).
But it's somewhat involved to get it set up ...

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jun 23, 2014 at 6:15 PM, Gaurav gupta
 wrote:
> Srividhya,
> I am also looking something similar. I will try if I can find something.
>
> Thanks
> On Jun 20, 2014 12:50 PM, "Umashanker, Srividhya" <
> srividhya.umashan...@hp.com> wrote:
>
>> Are there any performance test suites available in lucene codebase which
>> can be reused by us to benchmark against our lucene infrastructure?
>>
>> We are looking at mainly multithreaded indexing tests.
>>
>> -Vidhya
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org