Facetting by field then query

2014-03-26 Thread David Larochelle
I have the following schema

field name=id type=string indexed=true stored=true required=true
multiValued=false /
field name=media_id type=int indexed=true stored=true
required=false multiValued=false /
field name=sentence  type=text_general indexed=true stored=true
termVectors=true termPositions=true termOffsets=true /


I'd like to be able to facet by a field and then by queries. i.e.


facet_fields: {media_id: [1:{ sentence:foo: 102410, sentence:bar:
29710}2:
{ sentence:foo: 600, sentence:bar: 220}
3:
{ sentence:foo: 80, sentence:bar: 2330}]}


However, when I try:
http://localhost:8983/solr/collection1/select?q=*:*wt=jsonindent=truefacet=truefacet.query=sentence%3A%foofacet.query=sentence%3Abarfacet.field=media_id

the facet counts for the queries and media_id are listed separately rather
than hierarchically.

I realize that I could use 2 separate requests and programmatically combine
the results but would much prefer to use a single Solr request.

Is there any way to go this in Solr?

Thanks in advance,


David


Using CachedSqlEntityProcessor with delta imports in DIH

2013-09-23 Thread David Larochelle
I'm trying to use the CachedSqlEntityProcessor on a child entity that also
has a delta query.

Full imports and delta imports of the parent entity work fine however delta
imports for the child entity have no effect. If I remove the
processor=CachedSqlEntityProcessor attribute from the child entity, the
delta import works flawlessly but the full import is very slow.
Here's my data-config.xml:


dataConfig
  xi:include href=db-connection.xml
  xmlns:xi=http://www.w3.org/2001/XInclude/
  document
entity name=story_sentences
pk=story_sentences_id
query=select story_sentences_id || '_ss' as id, 'ss' as
field_type, * from story_sentences
deltaImportQuery=select story_sentences_id || '_ss' as id,
'ss' as field_type, * from story_sentences where story_sentences_id=${
dataimporter.delta.id}
deltaQuery=SELECT story_sentences_id as id, story_sentences_id
from story_sentences where db_row_last_updated gt;
'${dih.last_index_time}' 
  entity name=media_tags_map
  pk=media_tags_map_id
  query=select tags_id as tags_id_media, * from media_tags_map
  cacheKey=media_id
  cacheLookup=story_sentences.media_id
  processor=CachedSqlEntityProcessor
  deltaQuery=select media_tags_map_id, media_id::varchar from
media_tags_map where db_row_last_updated gt; '${dih.last_index_time}' 
  parentDeltaQuery=select story_sentences_id as id from
story_sentences where media_id = ${media_tags_map.media_id}
  
  /entity
/entity
  /document
/dataConfig


I need to be able to run delta imports based on the media_tags_map table in
addition to the story_sentences table.

Any idea why delta imports for media_tags_map won't work when the
CachedSqlEntityProcessor is used?

I've searched extensively but can't find an example that uses both
CachedSqlEntityProcessor and deltaQuery on the sub-entity or any
explanation of why the above configuration won't work as expected.

--

Thanks,

David


Re: SolrCloud and Joins

2013-07-31 Thread David Larochelle
Thanks Walter,

Existing media sets will rarely change but new media sets will be added
relatively frequently. (There is a many to many relationship between media
sets and media sources.) Given the size of data, a new Media Set that only
includes 1% of the collection would include 6 million rows.

Our data is stored in a Postgresql database and imported using the
dataImportHandler. It takes around 3 days to fully import the data.
In the single shard case, the nice thing about using joins is that the
media set to source mapping data could be updated using an hourly cron job
while the sentence data could be updated using a delta query.

The obvious alternative to joins is to add the media_sets_id to the
sentence data as a multi-value field. We'll benchmark this. But my concern
is that importing the full data will take even longer and that there will
be no easy way to automatically update each affected row when a new media
set is created. (I could write a separate one-off query for
DataImportHandler each time a new media set is added but this requires a
lot of manual interaction.)

Does SolrCloud really not have a simple way to specify which shard to put a
document on? I'm considering randomly generating document ID prefixes and
then taking their murmurhash to determine what shards they correspond to. I
could then explicitly send documents to a particular shard by specifying a
document ID prefix. However, this seems like a hackish approach. Is there a
better way?



On Mon, Jul 29, 2013 at 12:45 PM, Walter Underwood wun...@wunderwood.orgwrote:

 A join may seem clean, but it will be slow and (currently) doesn't work in
 a cluster.

 You find all the sentences in a media set by searching for that set id and
 requesting only the sentence_id (yes, you need that). Then you reindex
 them. With small documents like this, it is probably fairly fast.

 If you can't estimate how often the media sets will change or the size of
 the changes, then you aren't ready to choose a design.

 wunder

 On Jul 29, 2013, at 8:41 AM, David Larochelle wrote:

  We'd like to be able to easily update the media set to source mapping.
 I'm
  concerned that if we store the media_sets_id in the sentence documents,
 it
  will be very difficult to add additional media set to source mapping. I
  imagine that adding a new media set would either require reimporting all
  600 million documents or writing complicated application logic to find
 out
  which sentences to update. Hence joins seem like a cleaner solution.
 
  --
 
  David
 
 
  On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood 
 wun...@wunderwood.orgwrote:
 
  Denormalize. Add media_set_id to each sentence document. Done.
 
  wunder
 
  On Jul 29, 2013, at 7:58 AM, David Larochelle wrote:
 
  I'm setting up SolrCloud with around 600 million documents. The basic
  structure of each document is:
 
  stories_id: integer, media_id: integer, sentence: text_en
 
  We have a number of stories from different media and we treat each
  sentence
  as a separate document because we need to run sentence level analytics.
 
  We also have a concept of groups or sets of sources. We've imported
 this
  media source to media sets mapping into Solr using the following
  structure:
 
  media_id_inner: integer, media_sets_id: integer
 
  For the single node case, we're able to filter our sources by
  media_set_id
  using a join query like the following:
 
 
 
 http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1
  
 
 http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1
 
 
  However, this does not work correctly with SolrCloud. The problem is
 that
  the join query is performed separately on each of the shards and no
 shard
  has the complete media set to source mapping data. So SolrCloud returns
  incomplete results.
 
  Since the complete media set to source mapping data is comparatively
  small
  (~50,000 rows), I would like to replicate it on every shard. So that
 the
  results of the individual join queries on separate shards would  be
  equivalent to performing the same query on a single shard system.
 
  However, I'm can't figure out how to replicate documents on separate
  shards. The compositeID router has the ability to colocate documents
  based
  on a prefix in the document ID but this isn't what I need. What I would
  like is some way to either have the media set to source data replicated
  on
  every shard or to be able to explicitly upload this data to the
  individual
  shards. (For the rest of the data I like the compositeID autorouting.)
 
  Any suggestions?
 
  --
 
  Thanks,
 
 
  David
 
  --
  Walter Underwood
  wun...@wunderwood.org
 
 
 
 

 --
 Walter Underwood
 wun...@wunderwood.org






SolrCloud and Joins

2013-07-29 Thread David Larochelle
I'm setting up SolrCloud with around 600 million documents. The basic
structure of each document is:

stories_id: integer, media_id: integer, sentence: text_en

We have a number of stories from different media and we treat each sentence
as a separate document because we need to run sentence level analytics.

We also have a concept of groups or sets of sources. We've imported this
media source to media sets mapping into Solr using the following structure:

media_id_inner: integer, media_sets_id: integer

For the single node case, we're able to filter our sources by media_set_id
using a join query like the following:

http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1

However, this does not work correctly with SolrCloud. The problem is that
the join query is performed separately on each of the shards and no shard
has the complete media set to source mapping data. So SolrCloud returns
incomplete results.

Since the complete media set to source mapping data is comparatively small
(~50,000 rows), I would like to replicate it on every shard. So that the
results of the individual join queries on separate shards would  be
equivalent to performing the same query on a single shard system.

However, I'm can't figure out how to replicate documents on separate
shards. The compositeID router has the ability to colocate documents based
on a prefix in the document ID but this isn't what I need. What I would
like is some way to either have the media set to source data replicated on
every shard or to be able to explicitly upload this data to the individual
shards. (For the rest of the data I like the compositeID autorouting.)

Any suggestions?

--

Thanks,


David


Re: SolrCloud and Joins

2013-07-29 Thread David Larochelle
We'd like to be able to easily update the media set to source mapping. I'm
concerned that if we store the media_sets_id in the sentence documents, it
will be very difficult to add additional media set to source mapping. I
imagine that adding a new media set would either require reimporting all
600 million documents or writing complicated application logic to find out
which sentences to update. Hence joins seem like a cleaner solution.

--

David


On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood wun...@wunderwood.orgwrote:

 Denormalize. Add media_set_id to each sentence document. Done.

 wunder

 On Jul 29, 2013, at 7:58 AM, David Larochelle wrote:

  I'm setting up SolrCloud with around 600 million documents. The basic
  structure of each document is:
 
  stories_id: integer, media_id: integer, sentence: text_en
 
  We have a number of stories from different media and we treat each
 sentence
  as a separate document because we need to run sentence level analytics.
 
  We also have a concept of groups or sets of sources. We've imported this
  media source to media sets mapping into Solr using the following
 structure:
 
  media_id_inner: integer, media_sets_id: integer
 
  For the single node case, we're able to filter our sources by
 media_set_id
  using a join query like the following:
 
 
 http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1
 
 http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1
 
 
  However, this does not work correctly with SolrCloud. The problem is that
  the join query is performed separately on each of the shards and no shard
  has the complete media set to source mapping data. So SolrCloud returns
  incomplete results.
 
  Since the complete media set to source mapping data is comparatively
 small
  (~50,000 rows), I would like to replicate it on every shard. So that the
  results of the individual join queries on separate shards would  be
  equivalent to performing the same query on a single shard system.
 
  However, I'm can't figure out how to replicate documents on separate
  shards. The compositeID router has the ability to colocate documents
 based
  on a prefix in the document ID but this isn't what I need. What I would
  like is some way to either have the media set to source data replicated
 on
  every shard or to be able to explicitly upload this data to the
 individual
  shards. (For the rest of the data I like the compositeID autorouting.)
 
  Any suggestions?
 
  --
 
  Thanks,
 
 
  David

 --
 Walter Underwood
 wun...@wunderwood.org






Re: Solr indexer and Hadoop

2013-06-26 Thread David Larochelle
Pardon, my unfamiliarity with the Solr development process.

Now that it's in the trunk, will it appear in the next 4.X release?

--

David



On Wed, Jun 26, 2013 at 9:42 AM, Erick Erickson erickerick...@gmail.comwrote:

 Well, it's been merged into trunk according to the comments, so

 Try it on trunk, help with any bugs, buy Mark beer.

 And, most especially, document up what it takes to make it work.
 Mark is juggling a zillion things and I'm sure he'd appreciate any
 help there.

 Erick

 On Tue, Jun 25, 2013 at 11:25 AM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
  zomghowcanihelp? :)
 
  Michael Della Bitta
 
  Applications Developer
 
  o: +1 646 532 3062  | c: +1 917 477 7906
 
  appinions inc.
 
  “The Science of Influence Marketing”
 
  18 East 41st Street
 
  New York, NY 10017
 
  t: @appinions https://twitter.com/Appinions | g+:
  plus.google.com/appinions
  w: appinions.com http://www.appinions.com/
 
 
  On Tue, Jun 25, 2013 at 2:08 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  You might be interested in following:
  https://issues.apache.org/jira/browse/SOLR-4916
 
  Best
  Erick
 
  On Tue, Jun 25, 2013 at 7:28 AM, Michael Della Bitta
  michael.della.bi...@appinions.com wrote:
   Jack,
  
   Sorry, but I don't agree that it's that cut and dried. I've very
   successfully worked with terabytes of data in Hadoop that was stored
 on
  an
   Isilon mounted via NFS, for example. In cases like this, you're using
   MapReduce purely for it's execution model (which existed far before
  Hadoop
   and HDFS ever did).
  
  
   Michael Della Bitta
  
   Applications Developer
  
   o: +1 646 532 3062  | c: +1 917 477 7906
  
   appinions inc.
  
   “The Science of Influence Marketing”
  
   18 East 41st Street
  
   New York, NY 10017
  
   t: @appinions https://twitter.com/Appinions | g+:
   plus.google.com/appinions
   w: appinions.com http://www.appinions.com/
  
  
   On Tue, Jun 25, 2013 at 8:58 AM, Jack Krupansky 
 j...@basetechnology.com
  wrote:
  
   ???
  
   Hadoop=HDFS
  
   If the data is not in Hadoop/HDFS, just use the normal Solr indexing
   tools, including SolrCell and Data Import Handler, and possibly
  ManifoldCF.
  
  
   -- Jack Krupansky
  
   -Original Message- From: engy.morsy
   Sent: Tuesday, June 25, 2013 8:10 AM
   To: solr-user@lucene.apache.org
   Subject: Re: Solr indexer and Hadoop
  
  
   Thank you Jack. So, I need to convert those nodes holding data to
 HDFS.
  
  
  
   --
   View this message in context: http://lucene.472066.n3.**
   nabble.com/Solr-indexer-and-**Hadoop-tp4072951p4073013.html
 
 http://lucene.472066.n3.nabble.com/Solr-indexer-and-Hadoop-tp4072951p4073013.html
  
   Sent from the Solr - User mailing list archive at Nabble.com.
  
 



Re: Fast faceting over large number of distinct terms

2013-05-23 Thread David Larochelle
Interesting solution. My concern is how to select the most frequent terms
in the story_text field in a way that would make sense to the user. Only
including the X most common non-stopword terms in a document could easily
cause important patterns to be missed. There's a similar issue with only
returning counts for terms in the top N documents matching a particular
query.

Also is there an efficient way to add term counts on the client side? I
thought of using the TermVectorComponent to get document level frequency
counts and then using something like Hadoop to add them up. However, I
couldn't find any documentation on using the results of a solr query to
feed a map reduce operation.

--

David


On Wed, May 22, 2013 at 11:12 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Here's a possibility:

 At index time extract important terms (and/or phrases) from this
 story_text and store top N of them in a separate field (which will be
 much smaller/shorter).  Then facet on that.  Or just retrieve it and
 manually parse and count in the client if that turns out to be faster.
 I did this in the previous decade before Solr was available and it
 worked well.  I limited my counting to top N (200?) hits.

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Wed, May 22, 2013 at 10:54 PM, David Larochelle
 dlaroche...@cyber.law.harvard.edu wrote:
  The goal of the system is to obtain data that can be used to generate
 word
  clouds so that users can quickly get a sense of the aggregate contents of
  all documents matching a particular query. For example, a user might want
  to see a word cloud of all documents discussing 'Iraq' in a particular
 new
  papers.
 
  Faceting on story_text gives counts of individual words rather than
 entire
  text strings. I think this is because of the tokenization that happens
  automatically as part of the text_general type. I'm happy to look at
  alternatives to faceting but I wasn't able to find one that
  provided aggregate word counts for just the documents matching a
 particular
  query rather than an individual documents  or the entire index.
 
  --
 
  David
 
 
  On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger 
  brendan.grain...@gmail.com wrote:
 
  Hi David,
 
  Out of interest, what are you trying to accomplish by faceting over the
  story_text field? Is it generally the case that the story_text field
 will
  contain values that are repeated or categorize your documents somehow?
   From your description: story_text is used to store free form text
  obtained by crawling new papers and blogs, it doesn't seem that way, so
  I'm not sure faceting is what you want in this situation.
 
  Cheers,
  Brendan
 
 
  On Wed, May 22, 2013 at 9:49 PM, David Larochelle 
  dlaroche...@cyber.law.harvard.edu wrote:
 
   I'm trying to quickly obtain cumulative word frequency counts over all
   documents matching a particular query.
  
   I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is
 2.5
  GB
   and has around ~350,000 documents.
  
   My schema includes the following fields:
  
   field name=id type=string indexed=true stored=true
  required=true
   multiValued=false /
   field name=media_id type=int indexed=true stored=true
   required=true multiValued=false /
   field name=story_text  type=text_general indexed=true
  stored=true
   termVectors=true termPositions=true termOffsets=true /
  
  
   story_text is used to store free form text obtained by crawling new
  papers
   and blogs.
  
   Running faceted searches with the fc or fcs methods fails with the
 error
   Too many values for UnInvertedField faceting on field story_text
  
  
 
 http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs
  
   Running faceted search with the 'enum' method succeeds but takes a
 very
   long time.
  
  
 
 http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
   
  
 
 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
   
  
   The frustrating thing is even if the query only returns a few hundred
   documents, it still takes 10 minutes or longer to get the cumulative
 word
   count results.
  
   Eventually we're hoping to build a system that will return results in
 a
  few
   seconds and scale to hundreds of millions of documents.
   Is there anyway to get this level of performance out of Solr/Lucene?
  
   Thanks,
  
   David
  
 
 
 
  --
  Brendan Grainger
  www.kuripai.com
 



Re: Fast faceting over large number of distinct terms

2013-05-22 Thread David Larochelle
The goal of the system is to obtain data that can be used to generate word
clouds so that users can quickly get a sense of the aggregate contents of
all documents matching a particular query. For example, a user might want
to see a word cloud of all documents discussing 'Iraq' in a particular new
papers.

Faceting on story_text gives counts of individual words rather than entire
text strings. I think this is because of the tokenization that happens
automatically as part of the text_general type. I'm happy to look at
alternatives to faceting but I wasn't able to find one that
provided aggregate word counts for just the documents matching a particular
query rather than an individual documents  or the entire index.

--

David


On Wed, May 22, 2013 at 10:32 PM, Brendan Grainger 
brendan.grain...@gmail.com wrote:

 Hi David,

 Out of interest, what are you trying to accomplish by faceting over the
 story_text field? Is it generally the case that the story_text field will
 contain values that are repeated or categorize your documents somehow?
  From your description: story_text is used to store free form text
 obtained by crawling new papers and blogs, it doesn't seem that way, so
 I'm not sure faceting is what you want in this situation.

 Cheers,
 Brendan


 On Wed, May 22, 2013 at 9:49 PM, David Larochelle 
 dlaroche...@cyber.law.harvard.edu wrote:

  I'm trying to quickly obtain cumulative word frequency counts over all
  documents matching a particular query.
 
  I'm running in Solr 4.3.0 on a machine with 16GB of ram. My index is 2.5
 GB
  and has around ~350,000 documents.
 
  My schema includes the following fields:
 
  field name=id type=string indexed=true stored=true
 required=true
  multiValued=false /
  field name=media_id type=int indexed=true stored=true
  required=true multiValued=false /
  field name=story_text  type=text_general indexed=true
 stored=true
  termVectors=true termPositions=true termOffsets=true /
 
 
  story_text is used to store free form text obtained by crawling new
 papers
  and blogs.
 
  Running faceted searches with the fc or fcs methods fails with the error
  Too many values for UnInvertedField faceting on field story_text
 
 
 http://localhost:8983/solr/query?q=id:106714828_6621facet=truefacet.limit=10facet.pivot=publish_date,story_textrows=0facet.method=fcs
 
  Running faceted search with the 'enum' method succeeds but takes a very
  long time.
 
 
 http://localhost:8983/solr/query?q=includes:foobarfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
  
 
 http://localhost:8983/solr/query?q=includes:mccainfacet=truefacet.limit=100facet.pivot=media_id,includesfacet.method=enumrows=0
  
 
  The frustrating thing is even if the query only returns a few hundred
  documents, it still takes 10 minutes or longer to get the cumulative word
  count results.
 
  Eventually we're hoping to build a system that will return results in a
 few
  seconds and scale to hundreds of millions of documents.
  Is there anyway to get this level of performance out of Solr/Lucene?
 
  Thanks,
 
  David
 



 --
 Brendan Grainger
 www.kuripai.com



Aggregate word counts over a subset of documents

2013-05-16 Thread David Larochelle
Is there a way to get aggregate word counts over a subset of documents?

For example given the following data:

  {
id: 1,
category: cat1,
includes: The green car.,
  },
  {
id: 2,
category: cat1,
includes: The red car.,
  },
  {
id: 3,
category: cat2,
includes: The black car.,
  }

I'd like to be able to get total term frequency counts per category. e.g.

category name=cat1
   lst name=the2/lst
   lst name=car2/lst
   lst name=green1/lst
   lst name=red1/lst
/category
category name=cat2
   lst name=the1/lst
   lst name=car1/lst
   lst name=black1/lst
/category

I was initially hoping to do this within Solr and I tried using the
TermFrequencyComponent. This gives term frequencies for individual
documents and term frequencies for the entire index but doesn't seem to
help with subsets. For example, TermFrequencyComponent would tell me that
car occurs 3 times over all documents in the index and 1 time in document 1
but not that it occurs 2 times over cat1 documents and 1 time over cat2
documents.

Is there a good way to use Solr/Lucene to gather aggregate results like
this? I've been focusing on just using Solr with XML files but I could
certainly write Java code if necessary.

Thanks,

David


Re: Aggregate word counts over a subset of documents

2013-05-16 Thread David Larochelle
Jason,

Thanks so much for your suggestion. This seems to do what I need.

--

David

On Thu, May 16, 2013 at 3:59 PM, Jason Hellman 
jhell...@innoventsolutions.com wrote:

 David,

 A Pivot Facet could possibly provide these results by the following syntax:

 facet.pivot=category,includes

 We would presume that includes is a tokenized field and thus a set of
 facet values would be rendered from the terms resoling from that
 tokenization.  This would be nested in each category…and, of course, the
 entire set of documents considered for these facets is constrained by the
 current query.

 I think this maps to your requirement.

 Jason

 On May 16, 2013, at 12:29 PM, David Larochelle 
 dlaroche...@cyber.law.harvard.edu wrote:

  Is there a way to get aggregate word counts over a subset of documents?
 
  For example given the following data:
 
   {
 id: 1,
 category: cat1,
 includes: The green car.,
   },
   {
 id: 2,
 category: cat1,
 includes: The red car.,
   },
   {
 id: 3,
 category: cat2,
 includes: The black car.,
   }
 
  I'd like to be able to get total term frequency counts per category. e.g.
 
  category name=cat1
lst name=the2/lst
lst name=car2/lst
lst name=green1/lst
lst name=red1/lst
  /category
  category name=cat2
lst name=the1/lst
lst name=car1/lst
lst name=black1/lst
  /category
 
  I was initially hoping to do this within Solr and I tried using the
  TermFrequencyComponent. This gives term frequencies for individual
  documents and term frequencies for the entire index but doesn't seem to
  help with subsets. For example, TermFrequencyComponent would tell me that
  car occurs 3 times over all documents in the index and 1 time in
 document 1
  but not that it occurs 2 times over cat1 documents and 1 time over cat2
  documents.
 
  Is there a good way to use Solr/Lucene to gather aggregate results like
  this? I've been focusing on just using Solr with XML files but I could
  certainly write Java code if necessary.
 
  Thanks,
 
  David