Re: Fast faceting over large number of distinct terms

2013-05-23 Thread David Larochelle
Interesting solution. My concern is how to select the most frequent terms
in the story_text field in a way that would make sense to the user. Only
including the X most common non-stopword terms in a document could easily
cause important patterns to be missed. There's a similar issue with only
returning counts for terms in the top N documents matching a particular
query.

Also is there an efficient way to add term counts on the client side? I
thought of using the TermVectorComponent to get document level frequency
counts and then using something like Hadoop to add them up. However, I
couldn't find any documentation on using the results of a solr query to
feed a map reduce operation.

--

David


On Wed, May 22, 2013 at 11:12 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Here's a possibility:

 At index time extract important terms (and/or phrases) from this
 story_text and store top N of them in a separate field (which will be
 much smaller/shorter).  Then facet on that.  Or just retrieve it and
 manually parse and count in the client if that turns out to be faster.
 I did this in the previous decade before Solr was available and it
 worked well.  I limited my counting to top N (200?) hits.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Wed, May 22, 2013 at 10:54 PM, David Larochelle
 dlaroche...@cyber.law.harvard.edu wrote:
  The goal of the system is to obtain data that can be used to generate
 word
  clouds so that users can quickly get a sense of the aggregate contents of
  all documents matching a particular query. For example, a user might want
  to see a word cloud of all documents discussing 'Iraq' in a particular
 new
  papers.
 
  Faceting on story_text gives counts of individual words rather than
 entire
  text strings. I think this is because of the tokenization that happens
  automatically as part of the text_general type. I'm happy to look at
  alternatives to faceting but I wasn't able to find one that
  provided aggregate word counts for just the documents matching a
 particular
  query rather than an individual documents  or the entire index.
 
  --
 
  David
 
 
  On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger 
  brendan.grain...@gmail.com wrote:
 
  Hi David,
 
  Out of interest, what are you trying to accomplish by faceting over the
  story_text field? Is it generally the case that the story_text field
 will
  contain values that are repeated or categorize your documents somehow?
   From your description: story_text is used to store free form text
  obtained by crawling new papers and blogs, it doesn't seem that way, so
  I'm not sure faceting is what you want in this situation.
 
  Cheers,
  Brendan
 
 
  On Wed, May 22, 2013 at 9:49 PM, David Larochelle 
  dlaroche...@cyber.law.harvard.edu wrote:
 
   I'm trying to quickly obtain cumulative word frequency counts over all
   documents matching a particular query.
  
   I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is
 2.5
  GB
   and has around ~350,000 documents.
  
   My schema includes the following fields:
  
   field name=id type=string indexed=true stored=true
  required=true
   multiValued=false /
   field name=media_id type=int indexed=true stored=true
   required=true multiValued=false /
   field name=story_text  type=text_general indexed=true
  stored=true
   termVectors=true termPositions=true termOffsets=true /
  
  
   story_text is used to store free form text obtained by crawling new
  papers
   and blogs.
  
   Running faceted searches with the fc or fcs methods fails with the
 error
   Too many values for UnInvertedField faceting on field story_text
  
  
 
 http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs
  
   Running faceted search with the 'enum' method succeeds but takes a
 very
   long time.
  
  
 
 http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
   
  
 
 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
   
  
   The frustrating thing is even if the query only returns a few hundred
   documents, it still takes 10 minutes or longer to get the cumulative
 word
   count results.
  
   Eventually we're hoping to build a system that will return results in
 a
  few
   seconds and scale to hundreds of millions of documents.
   Is there anyway to get this level of performance out of Solr/Lucene?
  
   Thanks,
  
   David
  
 
 
 
  --
  Brendan Grainger
  www.kuripai.com
 



Re: Fast faceting over large number of distinct terms

2013-05-22 Thread Brendan Grainger
Hi David,

Out of interest, what are you trying to accomplish by faceting over the
story_text field? Is it generally the case that the story_text field will
contain values that are repeated or categorize your documents somehow?
 From your description: story_text is used to store free form text
obtained by crawling new papers and blogs, it doesn't seem that way, so
I'm not sure faceting is what you want in this situation.

Cheers,
Brendan


On Wed, May 22, 2013 at 9:49 PM, David Larochelle 
dlaroche...@cyber.law.harvard.edu wrote:

 I'm trying to quickly obtain cumulative word frequency counts over all
 documents matching a particular query.

 I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB
 and has around ~350,000 documents.

 My schema includes the following fields:

 field name=id type=string indexed=true stored=true required=true
 multiValued=false /
 field name=media_id type=int indexed=true stored=true
 required=true multiValued=false /
 field name=story_text  type=text_general indexed=true stored=true
 termVectors=true termPositions=true termOffsets=true /


 story_text is used to store free form text obtained by crawling new papers
 and blogs.

 Running faceted searches with the fc or fcs methods fails with the error
 Too many values for UnInvertedField faceting on field story_text

 http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs

 Running faceted search with the 'enum' method succeeds but takes a very
 long time.

 http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
 
 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
 

 The frustrating thing is even if the query only returns a few hundred
 documents, it still takes 10 minutes or longer to get the cumulative word
 count results.

 Eventually we're hoping to build a system that will return results in a few
 seconds and scale to hundreds of millions of documents.
 Is there anyway to get this level of performance out of Solr/Lucene?

 Thanks,

 David




-- 
Brendan Grainger
www.kuripai.com


Re: Fast faceting over large number of distinct terms

2013-05-22 Thread David Larochelle
The goal of the system is to obtain data that can be used to generate word
clouds so that users can quickly get a sense of the aggregate contents of
all documents matching a particular query. For example, a user might want
to see a word cloud of all documents discussing 'Iraq' in a particular new
papers.

Faceting on story_text gives counts of individual words rather than entire
text strings. I think this is because of the tokenization that happens
automatically as part of the text_general type. I'm happy to look at
alternatives to faceting but I wasn't able to find one that
provided aggregate word counts for just the documents matching a particular
query rather than an individual documents  or the entire index.

--

David


On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger 
brendan.grain...@gmail.com wrote:

 Hi David,

 Out of interest, what are you trying to accomplish by faceting over the
 story_text field? Is it generally the case that the story_text field will
 contain values that are repeated or categorize your documents somehow?
  From your description: story_text is used to store free form text
 obtained by crawling new papers and blogs, it doesn't seem that way, so
 I'm not sure faceting is what you want in this situation.

 Cheers,
 Brendan


 On Wed, May 22, 2013 at 9:49 PM, David Larochelle 
 dlaroche...@cyber.law.harvard.edu wrote:

  I'm trying to quickly obtain cumulative word frequency counts over all
  documents matching a particular query.
 
  I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
 GB
  and has around ~350,000 documents.
 
  My schema includes the following fields:
 
  field name=id type=string indexed=true stored=true
 required=true
  multiValued=false /
  field name=media_id type=int indexed=true stored=true
  required=true multiValued=false /
  field name=story_text  type=text_general indexed=true
 stored=true
  termVectors=true termPositions=true termOffsets=true /
 
 
  story_text is used to store free form text obtained by crawling new
 papers
  and blogs.
 
  Running faceted searches with the fc or fcs methods fails with the error
  Too many values for UnInvertedField faceting on field story_text
 
 
 http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs
 
  Running faceted search with the 'enum' method succeeds but takes a very
  long time.
 
 
 http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
  
 
 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
  
 
  The frustrating thing is even if the query only returns a few hundred
  documents, it still takes 10 minutes or longer to get the cumulative word
  count results.
 
  Eventually we're hoping to build a system that will return results in a
 few
  seconds and scale to hundreds of millions of documents.
  Is there anyway to get this level of performance out of Solr/Lucene?
 
  Thanks,
 
  David
 



 --
 Brendan Grainger
 www.kuripai.com



Re: Fast faceting over large number of distinct terms

2013-05-22 Thread Otis Gospodnetic
Here's a possibility:

At index time extract important terms (and/or phrases) from this
story_text and store top N of them in a separate field (which will be
much smaller/shorter).  Then facet on that.  Or just retrieve it and
manually parse and count in the client if that turns out to be faster.
I did this in the previous decade before Solr was available and it
worked well.  I limited my counting to top N (200?) hits.

Otis
--
Solr  ElasticSearch Support
http://sematext.com/





On Wed, May 22, 2013 at 10:54 PM, David Larochelle
dlaroche...@cyber.law.harvard.edu wrote:
 The goal of the system is to obtain data that can be used to generate word
 clouds so that users can quickly get a sense of the aggregate contents of
 all documents matching a particular query. For example, a user might want
 to see a word cloud of all documents discussing 'Iraq' in a particular new
 papers.

 Faceting on story_text gives counts of individual words rather than entire
 text strings. I think this is because of the tokenization that happens
 automatically as part of the text_general type. I'm happy to look at
 alternatives to faceting but I wasn't able to find one that
 provided aggregate word counts for just the documents matching a particular
 query rather than an individual documents  or the entire index.

 --

 David


 On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger 
 brendan.grain...@gmail.com wrote:

 Hi David,

 Out of interest, what are you trying to accomplish by faceting over the
 story_text field? Is it generally the case that the story_text field will
 contain values that are repeated or categorize your documents somehow?
  From your description: story_text is used to store free form text
 obtained by crawling new papers and blogs, it doesn't seem that way, so
 I'm not sure faceting is what you want in this situation.

 Cheers,
 Brendan


 On Wed, May 22, 2013 at 9:49 PM, David Larochelle 
 dlaroche...@cyber.law.harvard.edu wrote:

  I'm trying to quickly obtain cumulative word frequency counts over all
  documents matching a particular query.
 
  I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
 GB
  and has around ~350,000 documents.
 
  My schema includes the following fields:
 
  field name=id type=string indexed=true stored=true
 required=true
  multiValued=false /
  field name=media_id type=int indexed=true stored=true
  required=true multiValued=false /
  field name=story_text  type=text_general indexed=true
 stored=true
  termVectors=true termPositions=true termOffsets=true /
 
 
  story_text is used to store free form text obtained by crawling new
 papers
  and blogs.
 
  Running faceted searches with the fc or fcs methods fails with the error
  Too many values for UnInvertedField faceting on field story_text
 
 
 http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs
 
  Running faceted search with the 'enum' method succeeds but takes a very
  long time.
 
 
 http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
  
 
 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
  
 
  The frustrating thing is even if the query only returns a few hundred
  documents, it still takes 10 minutes or longer to get the cumulative word
  count results.
 
  Eventually we're hoping to build a system that will return results in a
 few
  seconds and scale to hundreds of millions of documents.
  Is there anyway to get this level of performance out of Solr/Lucene?
 
  Thanks,
 
  David
 



 --
 Brendan Grainger
 www.kuripai.com



Re: Fast faceting over large number of distinct terms

2013-05-22 Thread Walter Underwood
I would fetch the term vectors for the top N documents and add them up myself. 
You could even scale the term counts by the relevance score for the document. 
That would avoid problems with analyzing ten documents where only the first 
three were really good matches.

I did something similar in a different engine for a kNN classifier.

wunder

On May 22, 2013, at 8:12 PM, Otis Gospodnetic wrote:

 Here's a possibility:
 
 At index time extract important terms (and/or phrases) from this
 story_text and store top N of them in a separate field (which will be
 much smaller/shorter).  Then facet on that.  Or just retrieve it and
 manually parse and count in the client if that turns out to be faster.
 I did this in the previous decade before Solr was available and it
 worked well.  I limited my counting to top N (200?) hits.
 
 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/
 
 On Wed, May 22, 2013 at 10:54 PM, David Larochelle
 dlaroche...@cyber.law.harvard.edu wrote:
 The goal of the system is to obtain data that can be used to generate word
 clouds so that users can quickly get a sense of the aggregate contents of
 all documents matching a particular query. For example, a user might want
 to see a word cloud of all documents discussing 'Iraq' in a particular new
 papers.
 
 Faceting on story_text gives counts of individual words rather than entire
 text strings. I think this is because of the tokenization that happens
 automatically as part of the text_general type. I'm happy to look at
 alternatives to faceting but I wasn't able to find one that
 provided aggregate word counts for just the documents matching a particular
 query rather than an individual documents  or the entire index.
 
 --
 
 David
 
 
 On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger 
 brendan.grain...@gmail.com wrote:
 
 Hi David,
 
 Out of interest, what are you trying to accomplish by faceting over the
 story_text field? Is it generally the case that the story_text field will
 contain values that are repeated or categorize your documents somehow?
 From your description: story_text is used to store free form text
 obtained by crawling new papers and blogs, it doesn't seem that way, so
 I'm not sure faceting is what you want in this situation.
 
 Cheers,
 Brendan
 
 
 On Wed, May 22, 2013 at 9:49 PM, David Larochelle 
 dlaroche...@cyber.law.harvard.edu wrote:
 
 I'm trying to quickly obtain cumulative word frequency counts over all
 documents matching a particular query.
 
 I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
 GB
 and has around ~350,000 documents.
 
 My schema includes the following fields:
 
 field name=id type=string indexed=true stored=true
 required=true
 multiValued=false /
 field name=media_id type=int indexed=true stored=true
 required=true multiValued=false /
 field name=story_text  type=text_general indexed=true
 stored=true
 termVectors=true termPositions=true termOffsets=true /
 
 
 story_text is used to store free form text obtained by crawling new
 papers
 and blogs.
 
 Running faceted searches with the fc or fcs methods fails with the error
 Too many values for UnInvertedField faceting on field story_text
 
 
 http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs
 
 Running faceted search with the 'enum' method succeeds but takes a very
 long time.
 
 
 http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
 
 
 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
 
 
 The frustrating thing is even if the query only returns a few hundred
 documents, it still takes 10 minutes or longer to get the cumulative word
 count results.
 
 Eventually we're hoping to build a system that will return results in a
 few
 seconds and scale to hundreds of millions of documents.
 Is there anyway to get this level of performance out of Solr/Lucene?
 
 Thanks,
 
 David
 
 
 
 
 --
 Brendan Grainger
 www.kuripai.com
 

--
Walter Underwood
wun...@wunderwood.org