Re: Date faceting - howto improve performance

2009-04-27 Thread Ning Li
You mean doc A and doc B will become one doc after adding index 2 to
index 1? I don't think this is currently supported either at Lucene
level or at Solr level. If index 1 has m docs and index 2 has n docs,
index 1 will have m+n docs after adding index 2 to index 1. Documents
themselves are not modified by index merge.

Cheers,
Ning


On Sat, Apr 25, 2009 at 4:03 PM, Marcus Herou
marcus.he...@tailsweep.com wrote:
 Hmm looking in the code for the IndexMerger in Solr
 (org.apache.solr.update.DirectUpdateHandler(2)

 See that the IndexWriter.addIndexesNoOptimize(dirs) is used (union of
 indexes) ?

 And the test class org.apache.solr.client.solrj.MergeIndexesExampleTestBase
 suggests:
 add doc A to index1 with id=AAA,name=core1
 add doc B to index2 with id=BBB,name=core2
 merge the two indexes into one index which then contains both docs.
 The resulting index will have 2 docs.

 Great but in my case I think it should work more like this.

 add doc A to index1 with id=X,title=blog entry title,description=blog entry
 description
 add doc B to index2 with id=X,score=1.2
 somehow add index2 to index1 so id=XX has score=1.2 when searching in index1
 The resulting index should have 1 doc.

 So this is not really what I want right ?

 Sorry for being a smart-ass...

 Kindly

 //Marcus





 On Sat, Apr 25, 2009 at 5:10 PM, Marcus Herou 
 marcus.he...@tailsweep.comwrote:

 Guys!

 Thanks for these insights, I think we will head for Lucene level merging
 strategy (two or more indexes).
 When merging I guess the second index need to have the same doc ids
 somehow. This is an internal id in Lucene, not that easy to get hold of
 right ?

 So you are saying the the solr: ExternalFileField + FunctionQuery stuff
 would not work very well performance wise or what do you mean ?

 I sure like bleeding edge :)

 Cheers dudes

 //Marcus





 On Sat, Apr 25, 2009 at 3:46 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:


 I should emphasize that the PR trick I mentioned is something you'd do at
 the Lucene level, outside Solr, and then you'd just slip the modified index
 back into Solr.
 Of, if you like the bleeding edge, perhaps you can make use of Ning Li's
 Solr index merging functionality (patch in JIRA).


 Otis --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: Otis Gospodnetic otis_gospodne...@yahoo.com
  To: solr-user@lucene.apache.org
  Sent: Saturday, April 25, 2009 9:41:45 AM
  Subject: Re: Date faceting - howto improve performance
 
 
  Yes, you could simply round the date, no need for a non-date type field.
  Yes, you can add a field after the fact by making use of ParallelReader
 and
  merging (I don't recall the details, search the ML for ParallelReader
 and
  Andrzej), I remember he once provided the working recipe.
 
 
  Otis --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
   From: Marcus Herou
   To: solr-user@lucene.apache.org
   Sent: Saturday, April 25, 2009 6:54:02 AM
   Subject: Date faceting - howto improve performance
  
   Hi.
  
   One of our faceting use-cases:
   We are creating trend graphs of how many blog posts that contains a
 certain
   term and groups it by day/week/year etc. with the nice DateMathParser
   functions.
  
   The performance degrades really fast and consumes a lot of memory
 which
   forces OOM from time to time
   We think it is due the fact that the cardinality of the field
 publishedDate
   in our index is huge, almost equal to the nr of documents in the
 index.
  
   We need to address that...
  
   Some questions:
  
   1. Can a datefield have other date-formats than the default of
 -MM-dd
   HH:mm:ssZ ?
  
   2. We are thinking of adding a field to the index which have the
 format
   -MM-dd to reduce the cardinality, if that field can't be a date,
 it
   could perhaps be a string, but the question then is if faceting can be
 used
   ?
  
   3. Since we now already have such a huge index, is there a way to add
 a
   field afterwards and apply it to all documents without actually
 reindexing
   the whole shebang ?
  
   4. If the field cannot be a string can we just leave out the
   hour/minute/second information and to reduce the cardinality and
 improve
   performance ? Example: 2009-01-01 00:00:00Z
  
   5. I am afraid that we need to reindex everything to get this to work
   (negates Q3). We have 8 shards as of current, what would the most
 efficient
   way be to reindexing the whole shebang ? Dump the entire database to
 disk
   (sigh), create many xml file splits and use curl in a
   random/hash(numServers) manner on them ?
  
  
   Kindly
  
   //Marcus
  
  
  
  
  
  
  
   --
   Marcus Herou CTO and co-founder Tailsweep AB
   +46702561312
   marcus.he...@tailsweep.com
   http://www.tailsweep.com/
   http://blogg.tailsweep.com/




 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/
 

Re: solr index size

2009-04-03 Thread Ning Li
Slightly different index sizes (even optimized) are normal - a same
document may get different internal docids in different runs. I don't
know why the number of terms are slight different.


On Fri, Apr 3, 2009 at 7:21 PM, Jun Rao jun...@almaden.ibm.com wrote:


 Hi,

 We built a Solr index on a set of documents a few times. Each time, we did
 an optimize to reduce the index to a single segment. The index sizes are
 slightly different across different runs. Even though the documents are not
 inserted in the same order across runs, it seems to me that the final
 optimized index should be identical. Running CheckIndex  showed that the
 number of docs and fields are the same, but the number of terms are
 slightly different. Does anyone know how to explain this? Thanks,

 Jun
 IBM Almaden Research Center
 K55/B1, 650 Harry Road, San Jose, CA  95120-6099

 jun...@almaden.ibm.com


Re: Merging Solr Indexes

2009-04-01 Thread Ning Li
There is a jira issue on supporting index merge:
https://issues.apache.org/jira/browse/SOLR-1051.
But I agree with Otis that you should go with a single index first.

Cheers,
Ning


On Wed, Apr 1, 2009 at 12:06 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 Hi,

 Yes, you can write to the same index from multiple threads.  You still need 
 to keep track of the index size manually, whether you create 1 or N 
 indices/cores.  I'd go with a single index first.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: vivek sar vivex...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wednesday, April 1, 2009 4:26:04 AM
 Subject: Re: Merging Solr Indexes

 Thanks Otis. Could you write to same core (same index) from multiple
 threads at the same time? I thought each writer would lock the index
 so other can not write at the same time. I'll try it though.

 Another reason of putting indexes in separate core was to limit the
 index size. Our index can grow up to 50G a day, so I was hoping
 writing to smaller indexes would be faster in separate cores and if
 needed I can merge them at later point (like end of day). I want to
 keep daily cores. Isn't this a good idea? How else can I limit the
 index size (beside multiple instances or separate boxes).

 Thanks,
 -vivek


 On Tue, Mar 31, 2009 at 8:28 PM, Otis Gospodnetic
 wrote:
 
  Let me start with 4)
  Have you tried simply using multiple threads to send your docs to a single
 Solr instance/core?  You should get about the same performance as what you 
 are
 trying with your approach below, but without the headache of managing 
 multiple
 cores and index merging (not yet possible to do programatically).
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: vivek sar
  To: solr-user@lucene.apache.org
  Sent: Tuesday, March 31, 2009 1:59:01 PM
  Subject: Merging Solr Indexes
 
  Hi,
 
    As part of speeding up the index process I'm thinking of spawning
  multiple threads which will write to different temporary SolrCores.
  Once the index process is done I want to merge all the indexes in
  temporary cores to a master core. For ex., if I want one SolrCore per
  day then every index cycle I'll spawn 4 threads which will index into
  some temporary index and once they are done I want to merge all these
  into the day core. My questions,
 
  1) I want to use the same schema and solrconfig.xml for all cores
  without duplicating them - how do I do that?
  2) How do I merge the temporary Solr cores into one master core
  programmatically? I've read the wiki on MergingSolrIndexes, but I
  want to do it programmatically (like in Lucene -
  writer.addIndexes(..)) once the temporary indices are done.
  3) Can I remove the temporary indices once the merge process is done?
  4) Is this the right strategy to speed up indexing?
 
  Thanks,
  -vivek
 
 




Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
 a consistent view of the
shards in the index. The results of a search query include either all or
none
of a recent update to the index. The details of the algorithm to accomplish
this are omitted here, but the basic flow is pretty simple.

After the Map/Reduce job to update the shards completes, the master will
tell
each shard server to prepare the new version of the index. After all the
shard servers have responded affirmatively to the prepare message, the new

index is ready to be queried. An index client will then lazily learn about
the new index when it makes its next getShardLocations() call to the master.

In essence, a lazy two-phase commit protocol is used, with prepare and
commit messages piggybacked on heartbeats. After a shard has switched to
the new index, the Lucene files in the old index that are no longer needed
can safely be deleted.

ACHIEVING FAULT-TOLERANCE
We rely on the fault-tolerance of Map/Reduce to guarantee that an index
update
will eventually succeed. All shards are stored in HDFS and can be read by
any
shard server in a cluster. For a given shard, if one of its shard servers
dies,
new search requests are handled by its surviving shard servers. To ensure
that
there is always enough coverage for a shard, the master will instruct other
shard servers to take over the shards of a dead shard server.

PERFORMANCE ISSUES
Currently, each shard server reads a shard directly from HDFS. Experiments
have shown that this approach does not perform very well, with HDFS causing
Lucene to slow down fairly dramatically (by well over 5x when data blocks
are
accessed over the network). Consequently, we are exploring different ways to
leverage the fault tolerance of HDFS and, at the same time, work around its
performance problems. One simple alternative is to add a local file system
cache on each shard server. Another alternative is to modify HDFS so that an
application has more control over where to store the primary and replicas of
an HDFS block. This feature may be useful for other HDFS applications (e.g.,
HBase). We would like to collaborate with other people who are interested in
adding this feature to HDFS.


Regards,
Ning Li


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
I work for IBM Research. I read the Rackspace article. Rackspace's Mailtrust
has a similar design. Happy to see an existing application on such a system.
Do they plan to open-source it? Is the AOL project an open source project?

On Feb 6, 2008 11:33 AM, Clay Webster [EMAIL PROTECTED] wrote:


 There seem to be a few other players in this space too.

 Are you from Rackspace?
 (http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-
 query-terabytes-data)

 AOL also has a Hadoop/Solr project going on.

 CNET does not have much brewing there.  Although Yonik and I had
 talked about it a bunch -- but that was long ago.

 --cw

 Clay Webster   tel:1.908.541.3724
 Associate VP, Platform Infrastructure http://www.cnet.com
 CNET, Inc. (Nasdaq:CNET) mailto:[EMAIL PROTECTED]




Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
No. I'm curious too. :)

On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote:

 I assume that Google also has distributed index over their
 GFS/MapReduce implementation. Any idea how they achieve this?

 J.D.



Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread Ning Li
One main focus is to provide fault-tolerance in this distributed index
system. Correct me if I'm wrong, I think SOLR-303 is focusing on merging
results from multiple shards right now. We'd like to start an open source
project for a fault-tolerant distributed index system (or join if one
already exists) if there is enough interest. Making Solr work on top of such
a system could be an important goal and SOLR-303 is a big part of it in that
case.

I should have made it clear that disjoint data sets are not a requirement of
the system.


On Feb 6, 2008 12:57 PM, Ian Holsman [EMAIL PROTECTED] wrote:

 Hi.
 AOL has a couple of projects going on in the lucene/hadoop/solr space,
 and we will be pushing more stuff out as we can. We don't have anything
 going with solr over hadoop at the moment.

 I'm not sure if this would be better than what SOLR-303 does, but you
 should have a look at the work being done there.

 One of the things you mentioned is that the data sets are disjoint.
 SOLR-303 doesn't require this, and allows us to have a document stored
 in multiple shards (with different caching/update characteristics).