Re: Fast faceting over large number of distinct terms
Interesting solution. My concern is how to select the most frequent terms in the story_text field in a way that would make sense to the user. Only including the X most common non-stopword terms in a document could easily cause important patterns to be missed. There's a similar issue with only returning counts for terms in the top N documents matching a particular query. Also is there an efficient way to add term counts on the client side? I thought of using the TermVectorComponent to get document level frequency counts and then using something like Hadoop to add them up. However, I couldn't find any documentation on using the results of a solr query to feed a map reduce operation. -- David On Wed, May 22, 2013 at 11:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Here's a possibility: At index time extract important terms (and/or phrases) from this story_text and store top N of them in a separate field (which will be much smaller/shorter). Then facet on that. Or just retrieve it and manually parse and count in the client if that turns out to be faster. I did this in the previous decade before Solr was available and it worked well. I limited my counting to top N (200?) hits. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, May 22, 2013 at 10:54 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: The goal of the system is to obtain data that can be used to generate word clouds so that users can quickly get a sense of the aggregate contents of all documents matching a particular query. For example, a user might want to see a word cloud of all documents discussing 'Iraq' in a particular new papers. Faceting on story_text gives counts of individual words rather than entire text strings. I think this is because of the tokenization that happens automatically as part of the text_general type. I'm happy to look at alternatives to faceting but I wasn't able to find one that provided aggregate word counts for just the documents matching a particular query rather than an individual documents or the entire index. -- David On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger brendan.grain...@gmail.com wrote: Hi David, Out of interest, what are you trying to accomplish by faceting over the story_text field? Is it generally the case that the story_text field will contain values that are repeated or categorize your documents somehow? From your description: story_text is used to store free form text obtained by crawling new papers and blogs, it doesn't seem that way, so I'm not sure faceting is what you want in this situation. Cheers, Brendan On Wed, May 22, 2013 at 9:49 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: I'm trying to quickly obtain cumulative word frequency counts over all documents matching a particular query. I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB and has around ~350,000 documents. My schema includes the following fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=media_id type=int indexed=true stored=true required=true multiValued=false / field name=story_text type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / story_text is used to store free form text obtained by crawling new papers and blogs. Running faceted searches with the fc or fcs methods fails with the error Too many values for UnInvertedField faceting on field story_text http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs Running faceted search with the 'enum' method succeeds but takes a very long time. http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 The frustrating thing is even if the query only returns a few hundred documents, it still takes 10 minutes or longer to get the cumulative word count results. Eventually we're hoping to build a system that will return results in a few seconds and scale to hundreds of millions of documents. Is there anyway to get this level of performance out of Solr/Lucene? Thanks, David -- Brendan Grainger www.kuripai.com
Re: Fast faceting over large number of distinct terms
Hi David, Out of interest, what are you trying to accomplish by faceting over the story_text field? Is it generally the case that the story_text field will contain values that are repeated or categorize your documents somehow? From your description: story_text is used to store free form text obtained by crawling new papers and blogs, it doesn't seem that way, so I'm not sure faceting is what you want in this situation. Cheers, Brendan On Wed, May 22, 2013 at 9:49 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: I'm trying to quickly obtain cumulative word frequency counts over all documents matching a particular query. I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB and has around ~350,000 documents. My schema includes the following fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=media_id type=int indexed=true stored=true required=true multiValued=false / field name=story_text type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / story_text is used to store free form text obtained by crawling new papers and blogs. Running faceted searches with the fc or fcs methods fails with the error Too many values for UnInvertedField faceting on field story_text http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs Running faceted search with the 'enum' method succeeds but takes a very long time. http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 The frustrating thing is even if the query only returns a few hundred documents, it still takes 10 minutes or longer to get the cumulative word count results. Eventually we're hoping to build a system that will return results in a few seconds and scale to hundreds of millions of documents. Is there anyway to get this level of performance out of Solr/Lucene? Thanks, David -- Brendan Grainger www.kuripai.com
Re: Fast faceting over large number of distinct terms
The goal of the system is to obtain data that can be used to generate word clouds so that users can quickly get a sense of the aggregate contents of all documents matching a particular query. For example, a user might want to see a word cloud of all documents discussing 'Iraq' in a particular new papers. Faceting on story_text gives counts of individual words rather than entire text strings. I think this is because of the tokenization that happens automatically as part of the text_general type. I'm happy to look at alternatives to faceting but I wasn't able to find one that provided aggregate word counts for just the documents matching a particular query rather than an individual documents or the entire index. -- David On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger brendan.grain...@gmail.com wrote: Hi David, Out of interest, what are you trying to accomplish by faceting over the story_text field? Is it generally the case that the story_text field will contain values that are repeated or categorize your documents somehow? From your description: story_text is used to store free form text obtained by crawling new papers and blogs, it doesn't seem that way, so I'm not sure faceting is what you want in this situation. Cheers, Brendan On Wed, May 22, 2013 at 9:49 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: I'm trying to quickly obtain cumulative word frequency counts over all documents matching a particular query. I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB and has around ~350,000 documents. My schema includes the following fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=media_id type=int indexed=true stored=true required=true multiValued=false / field name=story_text type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / story_text is used to store free form text obtained by crawling new papers and blogs. Running faceted searches with the fc or fcs methods fails with the error Too many values for UnInvertedField faceting on field story_text http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs Running faceted search with the 'enum' method succeeds but takes a very long time. http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 The frustrating thing is even if the query only returns a few hundred documents, it still takes 10 minutes or longer to get the cumulative word count results. Eventually we're hoping to build a system that will return results in a few seconds and scale to hundreds of millions of documents. Is there anyway to get this level of performance out of Solr/Lucene? Thanks, David -- Brendan Grainger www.kuripai.com
Re: Fast faceting over large number of distinct terms
Here's a possibility: At index time extract important terms (and/or phrases) from this story_text and store top N of them in a separate field (which will be much smaller/shorter). Then facet on that. Or just retrieve it and manually parse and count in the client if that turns out to be faster. I did this in the previous decade before Solr was available and it worked well. I limited my counting to top N (200?) hits. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, May 22, 2013 at 10:54 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: The goal of the system is to obtain data that can be used to generate word clouds so that users can quickly get a sense of the aggregate contents of all documents matching a particular query. For example, a user might want to see a word cloud of all documents discussing 'Iraq' in a particular new papers. Faceting on story_text gives counts of individual words rather than entire text strings. I think this is because of the tokenization that happens automatically as part of the text_general type. I'm happy to look at alternatives to faceting but I wasn't able to find one that provided aggregate word counts for just the documents matching a particular query rather than an individual documents or the entire index. -- David On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger brendan.grain...@gmail.com wrote: Hi David, Out of interest, what are you trying to accomplish by faceting over the story_text field? Is it generally the case that the story_text field will contain values that are repeated or categorize your documents somehow? From your description: story_text is used to store free form text obtained by crawling new papers and blogs, it doesn't seem that way, so I'm not sure faceting is what you want in this situation. Cheers, Brendan On Wed, May 22, 2013 at 9:49 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: I'm trying to quickly obtain cumulative word frequency counts over all documents matching a particular query. I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB and has around ~350,000 documents. My schema includes the following fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=media_id type=int indexed=true stored=true required=true multiValued=false / field name=story_text type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / story_text is used to store free form text obtained by crawling new papers and blogs. Running faceted searches with the fc or fcs methods fails with the error Too many values for UnInvertedField faceting on field story_text http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs Running faceted search with the 'enum' method succeeds but takes a very long time. http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 The frustrating thing is even if the query only returns a few hundred documents, it still takes 10 minutes or longer to get the cumulative word count results. Eventually we're hoping to build a system that will return results in a few seconds and scale to hundreds of millions of documents. Is there anyway to get this level of performance out of Solr/Lucene? Thanks, David -- Brendan Grainger www.kuripai.com
Re: Fast faceting over large number of distinct terms
I would fetch the term vectors for the top N documents and add them up myself. You could even scale the term counts by the relevance score for the document. That would avoid problems with analyzing ten documents where only the first three were really good matches. I did something similar in a different engine for a kNN classifier. wunder On May 22, 2013, at 8:12 PM, Otis Gospodnetic wrote: Here's a possibility: At index time extract important terms (and/or phrases) from this story_text and store top N of them in a separate field (which will be much smaller/shorter). Then facet on that. Or just retrieve it and manually parse and count in the client if that turns out to be faster. I did this in the previous decade before Solr was available and it worked well. I limited my counting to top N (200?) hits. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, May 22, 2013 at 10:54 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: The goal of the system is to obtain data that can be used to generate word clouds so that users can quickly get a sense of the aggregate contents of all documents matching a particular query. For example, a user might want to see a word cloud of all documents discussing 'Iraq' in a particular new papers. Faceting on story_text gives counts of individual words rather than entire text strings. I think this is because of the tokenization that happens automatically as part of the text_general type. I'm happy to look at alternatives to faceting but I wasn't able to find one that provided aggregate word counts for just the documents matching a particular query rather than an individual documents or the entire index. -- David On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger brendan.grain...@gmail.com wrote: Hi David, Out of interest, what are you trying to accomplish by faceting over the story_text field? Is it generally the case that the story_text field will contain values that are repeated or categorize your documents somehow? From your description: story_text is used to store free form text obtained by crawling new papers and blogs, it doesn't seem that way, so I'm not sure faceting is what you want in this situation. Cheers, Brendan On Wed, May 22, 2013 at 9:49 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: I'm trying to quickly obtain cumulative word frequency counts over all documents matching a particular query. I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB and has around ~350,000 documents. My schema includes the following fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=media_id type=int indexed=true stored=true required=true multiValued=false / field name=story_text type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / story_text is used to store free form text obtained by crawling new papers and blogs. Running faceted searches with the fc or fcs methods fails with the error Too many values for UnInvertedField faceting on field story_text http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs Running faceted search with the 'enum' method succeeds but takes a very long time. http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 The frustrating thing is even if the query only returns a few hundred documents, it still takes 10 minutes or longer to get the cumulative word count results. Eventually we're hoping to build a system that will return results in a few seconds and scale to hundreds of millions of documents. Is there anyway to get this level of performance out of Solr/Lucene? Thanks, David -- Brendan Grainger www.kuripai.com -- Walter Underwood wun...@wunderwood.org