subject:"Fast faceting over large number of distinct terms"

Re: Fast faceting over large number of distinct terms

2013-05-23 Thread David Larochelle

Interesting solution. My concern is how to select the most frequent terms
in the story_text field in a way that would make sense to the user. Only
including the X most common non-stopword terms in a document could easily
cause important patterns to be missed. There's a similar issue with only
returning counts for terms in the top N documents matching a particular
query.

Also is there an efficient way to add term counts on the client side? I
thought of using the TermVectorComponent to get document level frequency
counts and then using something like Hadoop to add them up. However, I
couldn't find any documentation on using the results of a solr query to
feed a map reduce operation.

David

On Wed, May 22, 2013 at 11:12 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:

Here's a possibility:

At index time extract important terms (and/or phrases) from this
story_text and store top N of them in a separate field (which will be
much smaller/shorter). Then facet on that. Or just retrieve it and
manually parse and count in the client if that turns out to be faster.
I did this in the previous decade before Solr was available and it
worked well. I limited my counting to top N (200?) hits.

Otis
--
Solr ElasticSearch Support
http://sematext.com/

On Wed, May 22, 2013 at 10:54 PM, David Larochelle
dlaroche...@cyber.law.harvard.edu wrote:
The goal of the system is to obtain data that can be used to generate
word
clouds so that users can quickly get a sense of the aggregate contents of
all documents matching a particular query. For example, a user might want
to see a word cloud of all documents discussing 'Iraq' in a particular
new
papers.

Faceting on story_text gives counts of individual words rather than
entire
text strings. I think this is because of the tokenization that happens
automatically as part of the text_general type. I'm happy to look at
alternatives to faceting but I wasn't able to find one that
provided aggregate word counts for just the documents matching a
particular
query rather than an individual documents or the entire index.

David

On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger
brendan.grain...@gmail.com wrote:

Hi David,

Out of interest, what are you trying to accomplish by faceting over the
story_text field? Is it generally the case that the story_text field
will
contain values that are repeated or categorize your documents somehow?
From your description: story_text is used to store free form text
obtained by crawling new papers and blogs, it doesn't seem that way, so
I'm not sure faceting is what you want in this situation.

Cheers,
Brendan

On Wed, May 22, 2013 at 9:49 PM, David Larochelle
dlaroche...@cyber.law.harvard.edu wrote:

I'm trying to quickly obtain cumulative word frequency counts over all
documents matching a particular query.

I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is
2.5
GB
and has around ~350,000 documents.

My schema includes the following fields:

field name=id type=string indexed=true stored=true
required=true
multiValued=false /
field name=media_id type=int indexed=true stored=true
required=true multiValued=false /
field name=story_text type=text_general indexed=true
stored=true
termVectors=true termPositions=true termOffsets=true /

story_text is used to store free form text obtained by crawling new
papers
and blogs.

Running faceted searches with the fc or fcs methods fails with the
error
Too many values for UnInvertedField faceting on field story_text

http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs

Running faceted search with the 'enum' method succeeds but takes a
very
long time.

http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0

http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0

The frustrating thing is even if the query only returns a few hundred
documents, it still takes 10 minutes or longer to get the cumulative
word
count results.

Eventually we're hoping to build a system that will return results in
a
few
seconds and scale to hundreds of millions of documents.
Is there anyway to get this level of performance out of Solr/Lucene?

Thanks,

David

--
Brendan Grainger
www.kuripai.com

Re: Fast faceting over large number of distinct terms

2013-05-22 Thread Brendan Grainger

Hi David,

Out of interest, what are you trying to accomplish by faceting over the
story_text field? Is it generally the case that the story_text field will
contain values that are repeated or categorize your documents somehow?
 From your description: story_text is used to store free form text
obtained by crawling new papers and blogs, it doesn't seem that way, so
I'm not sure faceting is what you want in this situation.

Cheers,
Brendan


On Wed, May 22, 2013 at 9:49 PM, David Larochelle 
dlaroche...@cyber.law.harvard.edu wrote:

 I'm trying to quickly obtain cumulative word frequency counts over all
 documents matching a particular query.

 I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB
 and has around ~350,000 documents.

 My schema includes the following fields:

 field name=id type=string indexed=true stored=true required=true
 multiValued=false /
 field name=media_id type=int indexed=true stored=true
 required=true multiValued=false /
 field name=story_text  type=text_general indexed=true stored=true
 termVectors=true termPositions=true termOffsets=true /


 story_text is used to store free form text obtained by crawling new papers
 and blogs.

 Running faceted searches with the fc or fcs methods fails with the error
 Too many values for UnInvertedField faceting on field story_text

 http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs

 Running faceted search with the 'enum' method succeeds but takes a very
 long time.

 http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
 
 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
 

 The frustrating thing is even if the query only returns a few hundred
 documents, it still takes 10 minutes or longer to get the cumulative word
 count results.

 Eventually we're hoping to build a system that will return results in a few
 seconds and scale to hundreds of millions of documents.
 Is there anyway to get this level of performance out of Solr/Lucene?

 Thanks,

 David




-- 
Brendan Grainger
www.kuripai.com

Re: Fast faceting over large number of distinct terms

2013-05-22 Thread David Larochelle

The goal of the system is to obtain data that can be used to generate word
clouds so that users can quickly get a sense of the aggregate contents of
all documents matching a particular query. For example, a user might want
to see a word cloud of all documents discussing 'Iraq' in a particular new
papers.

Faceting on story_text gives counts of individual words rather than entire
text strings. I think this is because of the tokenization that happens
automatically as part of the text_general type. I'm happy to look at
alternatives to faceting but I wasn't able to find one that
provided aggregate word counts for just the documents matching a particular
query rather than an individual documents  or the entire index.

--

David


On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger 
brendan.grain...@gmail.com wrote:

 Hi David,

 Out of interest, what are you trying to accomplish by faceting over the
 story_text field? Is it generally the case that the story_text field will
 contain values that are repeated or categorize your documents somehow?
  From your description: story_text is used to store free form text
 obtained by crawling new papers and blogs, it doesn't seem that way, so
 I'm not sure faceting is what you want in this situation.

 Cheers,
 Brendan


 On Wed, May 22, 2013 at 9:49 PM, David Larochelle 
 dlaroche...@cyber.law.harvard.edu wrote:

  I'm trying to quickly obtain cumulative word frequency counts over all
  documents matching a particular query.
 
  I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
 GB
  and has around ~350,000 documents.
 
  My schema includes the following fields:
 
  field name=id type=string indexed=true stored=true
 required=true
  multiValued=false /
  field name=media_id type=int indexed=true stored=true
  required=true multiValued=false /
  field name=story_text  type=text_general indexed=true
 stored=true
  termVectors=true termPositions=true termOffsets=true /
 
 
  story_text is used to store free form text obtained by crawling new
 papers
  and blogs.
 
  Running faceted searches with the fc or fcs methods fails with the error
  Too many values for UnInvertedField faceting on field story_text
 
 
 http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs
 
  Running faceted search with the 'enum' method succeeds but takes a very
  long time.
 
 
 http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
  
 
 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
  
 
  The frustrating thing is even if the query only returns a few hundred
  documents, it still takes 10 minutes or longer to get the cumulative word
  count results.
 
  Eventually we're hoping to build a system that will return results in a
 few
  seconds and scale to hundreds of millions of documents.
  Is there anyway to get this level of performance out of Solr/Lucene?
 
  Thanks,
 
  David
 



 --
 Brendan Grainger
 www.kuripai.com

Re: Fast faceting over large number of distinct terms

2013-05-22 Thread Otis Gospodnetic

Here's a possibility:

Otis
--
Solr ElasticSearch Support
http://sematext.com/

On Wed, May 22, 2013 at 10:54 PM, David Larochelle
dlaroche...@cyber.law.harvard.edu wrote:
The goal of the system is to obtain data that can be used to generate word
clouds so that users can quickly get a sense of the aggregate contents of
all documents matching a particular query. For example, a user might want
to see a word cloud of all documents discussing 'Iraq' in a particular new
papers.

Faceting on story_text gives counts of individual words rather than entire
text strings. I think this is because of the tokenization that happens
automatically as part of the text_general type. I'm happy to look at
alternatives to faceting but I wasn't able to find one that
provided aggregate word counts for just the documents matching a particular
query rather than an individual documents or the entire index.

David

On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger
brendan.grain...@gmail.com wrote:

Hi David,

Out of interest, what are you trying to accomplish by faceting over the
story_text field? Is it generally the case that the story_text field will
contain values that are repeated or categorize your documents somehow?
From your description: story_text is used to store free form text
obtained by crawling new papers and blogs, it doesn't seem that way, so
I'm not sure faceting is what you want in this situation.

Cheers,
Brendan

On Wed, May 22, 2013 at 9:49 PM, David Larochelle
dlaroche...@cyber.law.harvard.edu wrote:

I'm trying to quickly obtain cumulative word frequency counts over all
documents matching a particular query.

I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
GB
and has around ~350,000 documents.

My schema includes the following fields:

story_text is used to store free form text obtained by crawling new
papers
and blogs.

Running faceted searches with the fc or fcs methods fails with the error
Too many values for UnInvertedField faceting on field story_text

http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs

Running faceted search with the 'enum' method succeeds but takes a very
long time.

http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0

http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0

The frustrating thing is even if the query only returns a few hundred
documents, it still takes 10 minutes or longer to get the cumulative word
count results.

Eventually we're hoping to build a system that will return results in a
few
seconds and scale to hundreds of millions of documents.
Is there anyway to get this level of performance out of Solr/Lucene?

Thanks,

David

--
Brendan Grainger
www.kuripai.com

Re: Fast faceting over large number of distinct terms

2013-05-22 Thread Walter Underwood

I would fetch the term vectors for the top N documents and add them up myself.
You could even scale the term counts by the relevance score for the document.
That would avoid problems with analyzing ten documents where only the first
three were really good matches.

I did something similar in a different engine for a kNN classifier.

wunder

On May 22, 2013, at 8:12 PM, Otis Gospodnetic wrote: