Facetting by field then query
I have the following schema field name=id type=string indexed=true stored=true required=true multiValued=false / field name=media_id type=int indexed=true stored=true required=false multiValued=false / field name=sentence type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / I'd like to be able to facet by a field and then by queries. i.e. facet_fields: {media_id: [1:{ sentence:foo: 102410, sentence:bar: 29710}2: { sentence:foo: 600, sentence:bar: 220} 3: { sentence:foo: 80, sentence:bar: 2330}]} However, when I try: http://localhost:8983/solr/collection1/select?q=*:*wt=jsonindent=truefacet=truefacet.query=sentence%3A%foofacet.query=sentence%3Abarfacet.field=media_id the facet counts for the queries and media_id are listed separately rather than hierarchically. I realize that I could use 2 separate requests and programmatically combine the results but would much prefer to use a single Solr request. Is there any way to go this in Solr? Thanks in advance, David
Using CachedSqlEntityProcessor with delta imports in DIH
I'm trying to use the CachedSqlEntityProcessor on a child entity that also has a delta query. Full imports and delta imports of the parent entity work fine however delta imports for the child entity have no effect. If I remove the processor=CachedSqlEntityProcessor attribute from the child entity, the delta import works flawlessly but the full import is very slow. Here's my data-config.xml: dataConfig xi:include href=db-connection.xml xmlns:xi=http://www.w3.org/2001/XInclude/ document entity name=story_sentences pk=story_sentences_id query=select story_sentences_id || '_ss' as id, 'ss' as field_type, * from story_sentences deltaImportQuery=select story_sentences_id || '_ss' as id, 'ss' as field_type, * from story_sentences where story_sentences_id=${ dataimporter.delta.id} deltaQuery=SELECT story_sentences_id as id, story_sentences_id from story_sentences where db_row_last_updated gt; '${dih.last_index_time}' entity name=media_tags_map pk=media_tags_map_id query=select tags_id as tags_id_media, * from media_tags_map cacheKey=media_id cacheLookup=story_sentences.media_id processor=CachedSqlEntityProcessor deltaQuery=select media_tags_map_id, media_id::varchar from media_tags_map where db_row_last_updated gt; '${dih.last_index_time}' parentDeltaQuery=select story_sentences_id as id from story_sentences where media_id = ${media_tags_map.media_id} /entity /entity /document /dataConfig I need to be able to run delta imports based on the media_tags_map table in addition to the story_sentences table. Any idea why delta imports for media_tags_map won't work when the CachedSqlEntityProcessor is used? I've searched extensively but can't find an example that uses both CachedSqlEntityProcessor and deltaQuery on the sub-entity or any explanation of why the above configuration won't work as expected. -- Thanks, David
Re: SolrCloud and Joins
Thanks Walter, Existing media sets will rarely change but new media sets will be added relatively frequently. (There is a many to many relationship between media sets and media sources.) Given the size of data, a new Media Set that only includes 1% of the collection would include 6 million rows. Our data is stored in a Postgresql database and imported using the dataImportHandler. It takes around 3 days to fully import the data. In the single shard case, the nice thing about using joins is that the media set to source mapping data could be updated using an hourly cron job while the sentence data could be updated using a delta query. The obvious alternative to joins is to add the media_sets_id to the sentence data as a multi-value field. We'll benchmark this. But my concern is that importing the full data will take even longer and that there will be no easy way to automatically update each affected row when a new media set is created. (I could write a separate one-off query for DataImportHandler each time a new media set is added but this requires a lot of manual interaction.) Does SolrCloud really not have a simple way to specify which shard to put a document on? I'm considering randomly generating document ID prefixes and then taking their murmurhash to determine what shards they correspond to. I could then explicitly send documents to a particular shard by specifying a document ID prefix. However, this seems like a hackish approach. Is there a better way? On Mon, Jul 29, 2013 at 12:45 PM, Walter Underwood wun...@wunderwood.orgwrote: A join may seem clean, but it will be slow and (currently) doesn't work in a cluster. You find all the sentences in a media set by searching for that set id and requesting only the sentence_id (yes, you need that). Then you reindex them. With small documents like this, it is probably fairly fast. If you can't estimate how often the media sets will change or the size of the changes, then you aren't ready to choose a design. wunder On Jul 29, 2013, at 8:41 AM, David Larochelle wrote: We'd like to be able to easily update the media set to source mapping. I'm concerned that if we store the media_sets_id in the sentence documents, it will be very difficult to add additional media set to source mapping. I imagine that adding a new media set would either require reimporting all 600 million documents or writing complicated application logic to find out which sentences to update. Hence joins seem like a cleaner solution. -- David On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood wun...@wunderwood.orgwrote: Denormalize. Add media_set_id to each sentence document. Done. wunder On Jul 29, 2013, at 7:58 AM, David Larochelle wrote: I'm setting up SolrCloud with around 600 million documents. The basic structure of each document is: stories_id: integer, media_id: integer, sentence: text_en We have a number of stories from different media and we treat each sentence as a separate document because we need to run sentence level analytics. We also have a concept of groups or sets of sources. We've imported this media source to media sets mapping into Solr using the following structure: media_id_inner: integer, media_sets_id: integer For the single node case, we're able to filter our sources by media_set_id using a join query like the following: http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1 http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1 However, this does not work correctly with SolrCloud. The problem is that the join query is performed separately on each of the shards and no shard has the complete media set to source mapping data. So SolrCloud returns incomplete results. Since the complete media set to source mapping data is comparatively small (~50,000 rows), I would like to replicate it on every shard. So that the results of the individual join queries on separate shards would be equivalent to performing the same query on a single shard system. However, I'm can't figure out how to replicate documents on separate shards. The compositeID router has the ability to colocate documents based on a prefix in the document ID but this isn't what I need. What I would like is some way to either have the media set to source data replicated on every shard or to be able to explicitly upload this data to the individual shards. (For the rest of the data I like the compositeID autorouting.) Any suggestions? -- Thanks, David -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
SolrCloud and Joins
I'm setting up SolrCloud with around 600 million documents. The basic structure of each document is: stories_id: integer, media_id: integer, sentence: text_en We have a number of stories from different media and we treat each sentence as a separate document because we need to run sentence level analytics. We also have a concept of groups or sets of sources. We've imported this media source to media sets mapping into Solr using the following structure: media_id_inner: integer, media_sets_id: integer For the single node case, we're able to filter our sources by media_set_id using a join query like the following: http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1 However, this does not work correctly with SolrCloud. The problem is that the join query is performed separately on each of the shards and no shard has the complete media set to source mapping data. So SolrCloud returns incomplete results. Since the complete media set to source mapping data is comparatively small (~50,000 rows), I would like to replicate it on every shard. So that the results of the individual join queries on separate shards would be equivalent to performing the same query on a single shard system. However, I'm can't figure out how to replicate documents on separate shards. The compositeID router has the ability to colocate documents based on a prefix in the document ID but this isn't what I need. What I would like is some way to either have the media set to source data replicated on every shard or to be able to explicitly upload this data to the individual shards. (For the rest of the data I like the compositeID autorouting.) Any suggestions? -- Thanks, David
Re: SolrCloud and Joins
We'd like to be able to easily update the media set to source mapping. I'm concerned that if we store the media_sets_id in the sentence documents, it will be very difficult to add additional media set to source mapping. I imagine that adding a new media set would either require reimporting all 600 million documents or writing complicated application logic to find out which sentences to update. Hence joins seem like a cleaner solution. -- David On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood wun...@wunderwood.orgwrote: Denormalize. Add media_set_id to each sentence document. Done. wunder On Jul 29, 2013, at 7:58 AM, David Larochelle wrote: I'm setting up SolrCloud with around 600 million documents. The basic structure of each document is: stories_id: integer, media_id: integer, sentence: text_en We have a number of stories from different media and we treat each sentence as a separate document because we need to run sentence level analytics. We also have a concept of groups or sets of sources. We've imported this media source to media sets mapping into Solr using the following structure: media_id_inner: integer, media_sets_id: integer For the single node case, we're able to filter our sources by media_set_id using a join query like the following: http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1 http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1 However, this does not work correctly with SolrCloud. The problem is that the join query is performed separately on each of the shards and no shard has the complete media set to source mapping data. So SolrCloud returns incomplete results. Since the complete media set to source mapping data is comparatively small (~50,000 rows), I would like to replicate it on every shard. So that the results of the individual join queries on separate shards would be equivalent to performing the same query on a single shard system. However, I'm can't figure out how to replicate documents on separate shards. The compositeID router has the ability to colocate documents based on a prefix in the document ID but this isn't what I need. What I would like is some way to either have the media set to source data replicated on every shard or to be able to explicitly upload this data to the individual shards. (For the rest of the data I like the compositeID autorouting.) Any suggestions? -- Thanks, David -- Walter Underwood wun...@wunderwood.org
Re: Solr indexer and Hadoop
Pardon, my unfamiliarity with the Solr development process. Now that it's in the trunk, will it appear in the next 4.X release? -- David On Wed, Jun 26, 2013 at 9:42 AM, Erick Erickson erickerick...@gmail.comwrote: Well, it's been merged into trunk according to the comments, so Try it on trunk, help with any bugs, buy Mark beer. And, most especially, document up what it takes to make it work. Mark is juggling a zillion things and I'm sure he'd appreciate any help there. Erick On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: zomghowcanihelp? :) Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson erickerick...@gmail.com wrote: You might be interested in following: https://issues.apache.org/jira/browse/SOLR-4916 Best Erick On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Jack, Sorry, but I don't agree that it's that cut and dried. I've very successfully worked with terabytes of data in Hadoop that was stored on an Isilon mounted via NFS, for example. In cases like this, you're using MapReduce purely for it's execution model (which existed far before Hadoop and HDFS ever did). Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky j...@basetechnology.com wrote: ??? Hadoop=HDFS If the data is not in Hadoop/HDFS, just use the normal Solr indexing tools, including SolrCell and Data Import Handler, and possibly ManifoldCF. -- Jack Krupansky -Original Message- From: engy.morsy Sent: Tuesday, June 25, 2013 8:10 AM To: solr-user@lucene.apache.org Subject: Re: Solr indexer and Hadoop Thank you Jack. So, I need to convert those nodes holding data to HDFS. -- View this message in context: http://lucene.472066.n3.** nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Fast faceting over large number of distinct terms
Interesting solution. My concern is how to select the most frequent terms in the story_text field in a way that would make sense to the user. Only including the X most common non-stopword terms in a document could easily cause important patterns to be missed. There's a similar issue with only returning counts for terms in the top N documents matching a particular query. Also is there an efficient way to add term counts on the client side? I thought of using the TermVectorComponent to get document level frequency counts and then using something like Hadoop to add them up. However, I couldn't find any documentation on using the results of a solr query to feed a map reduce operation. -- David On Wed, May 22, 2013 at 11:12 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Here's a possibility: At index time extract important terms (and/or phrases) from this story_text and store top N of them in a separate field (which will be much smaller/shorter). Then facet on that. Or just retrieve it and manually parse and count in the client if that turns out to be faster. I did this in the previous decade before Solr was available and it worked well. I limited my counting to top N (200?) hits. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, May 22, 2013 at 10:54 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: The goal of the system is to obtain data that can be used to generate word clouds so that users can quickly get a sense of the aggregate contents of all documents matching a particular query. For example, a user might want to see a word cloud of all documents discussing 'Iraq' in a particular new papers. Faceting on story_text gives counts of individual words rather than entire text strings. I think this is because of the tokenization that happens automatically as part of the text_general type. I'm happy to look at alternatives to faceting but I wasn't able to find one that provided aggregate word counts for just the documents matching a particular query rather than an individual documents or the entire index. -- David On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger brendan.grain...@gmail.com wrote: Hi David, Out of interest, what are you trying to accomplish by faceting over the story_text field? Is it generally the case that the story_text field will contain values that are repeated or categorize your documents somehow? From your description: story_text is used to store free form text obtained by crawling new papers and blogs, it doesn't seem that way, so I'm not sure faceting is what you want in this situation. Cheers, Brendan On Wed, May 22, 2013 at 9:49 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: I'm trying to quickly obtain cumulative word frequency counts over all documents matching a particular query. I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB and has around ~350,000 documents. My schema includes the following fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=media_id type=int indexed=true stored=true required=true multiValued=false / field name=story_text type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / story_text is used to store free form text obtained by crawling new papers and blogs. Running faceted searches with the fc or fcs methods fails with the error Too many values for UnInvertedField faceting on field story_text http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs Running faceted search with the 'enum' method succeeds but takes a very long time. http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 The frustrating thing is even if the query only returns a few hundred documents, it still takes 10 minutes or longer to get the cumulative word count results. Eventually we're hoping to build a system that will return results in a few seconds and scale to hundreds of millions of documents. Is there anyway to get this level of performance out of Solr/Lucene? Thanks, David -- Brendan Grainger www.kuripai.com
Re: Fast faceting over large number of distinct terms
The goal of the system is to obtain data that can be used to generate word clouds so that users can quickly get a sense of the aggregate contents of all documents matching a particular query. For example, a user might want to see a word cloud of all documents discussing 'Iraq' in a particular new papers. Faceting on story_text gives counts of individual words rather than entire text strings. I think this is because of the tokenization that happens automatically as part of the text_general type. I'm happy to look at alternatives to faceting but I wasn't able to find one that provided aggregate word counts for just the documents matching a particular query rather than an individual documents or the entire index. -- David On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger brendan.grain...@gmail.com wrote: Hi David, Out of interest, what are you trying to accomplish by faceting over the story_text field? Is it generally the case that the story_text field will contain values that are repeated or categorize your documents somehow? From your description: story_text is used to store free form text obtained by crawling new papers and blogs, it doesn't seem that way, so I'm not sure faceting is what you want in this situation. Cheers, Brendan On Wed, May 22, 2013 at 9:49 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: I'm trying to quickly obtain cumulative word frequency counts over all documents matching a particular query. I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5 GB and has around ~350,000 documents. My schema includes the following fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=media_id type=int indexed=true stored=true required=true multiValued=false / field name=story_text type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / story_text is used to store free form text obtained by crawling new papers and blogs. Running faceted searches with the fc or fcs methods fails with the error Too many values for UnInvertedField faceting on field story_text http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs Running faceted search with the 'enum' method succeeds but takes a very long time. http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0 The frustrating thing is even if the query only returns a few hundred documents, it still takes 10 minutes or longer to get the cumulative word count results. Eventually we're hoping to build a system that will return results in a few seconds and scale to hundreds of millions of documents. Is there anyway to get this level of performance out of Solr/Lucene? Thanks, David -- Brendan Grainger www.kuripai.com
Aggregate word counts over a subset of documents
Is there a way to get aggregate word counts over a subset of documents? For example given the following data: { id: 1, category: cat1, includes: The green car., }, { id: 2, category: cat1, includes: The red car., }, { id: 3, category: cat2, includes: The black car., } I'd like to be able to get total term frequency counts per category. e.g. category name=cat1 lst name=the2/lst lst name=car2/lst lst name=green1/lst lst name=red1/lst /category category name=cat2 lst name=the1/lst lst name=car1/lst lst name=black1/lst /category I was initially hoping to do this within Solr and I tried using the TermFrequencyComponent. This gives term frequencies for individual documents and term frequencies for the entire index but doesn't seem to help with subsets. For example, TermFrequencyComponent would tell me that car occurs 3 times over all documents in the index and 1 time in document 1 but not that it occurs 2 times over cat1 documents and 1 time over cat2 documents. Is there a good way to use Solr/Lucene to gather aggregate results like this? I've been focusing on just using Solr with XML files but I could certainly write Java code if necessary. Thanks, David
Re: Aggregate word counts over a subset of documents
Jason, Thanks so much for your suggestion. This seems to do what I need. -- David On Thu, May 16, 2013 at 3:59 PM, Jason Hellman jhell...@innoventsolutions.com wrote: David, A Pivot Facet could possibly provide these results by the following syntax: facet.pivot=category,includes We would presume that includes is a tokenized field and thus a set of facet values would be rendered from the terms resoling from that tokenization. This would be nested in each category…and, of course, the entire set of documents considered for these facets is constrained by the current query. I think this maps to your requirement. Jason On May 16, 2013, at 12:29 PM, David Larochelle dlaroche...@cyber.law.harvard.edu wrote: Is there a way to get aggregate word counts over a subset of documents? For example given the following data: { id: 1, category: cat1, includes: The green car., }, { id: 2, category: cat1, includes: The red car., }, { id: 3, category: cat2, includes: The black car., } I'd like to be able to get total term frequency counts per category. e.g. category name=cat1 lst name=the2/lst lst name=car2/lst lst name=green1/lst lst name=red1/lst /category category name=cat2 lst name=the1/lst lst name=car1/lst lst name=black1/lst /category I was initially hoping to do this within Solr and I tried using the TermFrequencyComponent. This gives term frequencies for individual documents and term frequencies for the entire index but doesn't seem to help with subsets. For example, TermFrequencyComponent would tell me that car occurs 3 times over all documents in the index and 1 time in document 1 but not that it occurs 2 times over cat1 documents and 1 time over cat2 documents. Is there a good way to use Solr/Lucene to gather aggregate results like this? I've been focusing on just using Solr with XML files but I could certainly write Java code if necessary. Thanks, David