Re: Fastest way to import big amount of documents in SolrCloud

2014-05-02 Thread Alexander Kanarsky
If you build your index in Hadoop, read this (it is about the Cloudera
Search but in my understanding also should work with Solr Hadoop contrib
since 4.7)
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html


On Thu, May 1, 2014 at 1:47 PM, Costi Muraru costimur...@gmail.com wrote:

 Hi guys,

 What would you say it's the fastest way to import data in SolrCloud?
 Our use case: each day do a single import of a big number of documents.

 Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk
 import feature in SOLR? I came upon this promising link:
 http://wiki.apache.org/solr/UpdateCSV
 Any idea on how UpdateCSV is performance-wise compared with
 SolrJ/DataImportHandler?

 If SolrJ, should we split the data in chunks and start multiple clients at
 once? In this way we could perhaps take advantage of the multitude number
 of servers in the SolrCloud configuration?

 Either way, after the import is finished, should we do an optimize or a
 commit or none (
 http://wiki.solarium-project.org/index.php/V1:Optimize_command)?

 Any tips and tricks to perform this process the right way are gladly
 appreciated.

 Thanks,
 Costi



Re: Production Release process with Solr 3.5 implementation.

2012-11-01 Thread Alexander Kanarsky
Why not to change the order to this:

3. Upgrade Solr Schema (Master) Replication is disabled.
4. Start Index Rebuild. (if step 3)
1. Pull up Maintenance Pages
2. Upgrade DB
5. Upgrade UI code
6. Index build complete ? Start Replication
7. Verify UI and Drop Maintenance Pages.

So your slaves will continue to serve traffic until you're done with the
master index. Or the master index also imports from the same database?


On Thu, Nov 1, 2012 at 4:08 PM, Shawn Heisey s...@elyograg.org wrote:

 On 11/1/2012 2:46 PM, adityab wrote:

 1. Pull up Maintenance Pages
 2. Upgrade DB
 3. Upgrade Solr Schema (Master) Replication is disabled.
 4. Start Index Rebuild. (if step 3)
 5. Upgrade UI code
 6. Index build complete ? Start Replication
 7. Verify UI and Drop Maintenance Pages.

 As # 4 takes couple of hours compared to all other steps which run within
 few minutes, we need to have down time for the duration of that.


 What I do is a little bit different. I have two completely independent
 copies of my index, no replication.  The build system maintains each copy
 simultaneously, including managing independent rebuilds.  I used to run two
 copies of my build system, but I recently made it so that one copy manages
 multiple indexes.

 If I need to do an upgrade, I will first test everything out as much as
 possible on my test environment.  Then I will take one copy of my index
 offline, perform the required changes, and reindex.  The UI continues to
 send queries to the online index that hasn't been changed.  At that point,
 we initiate the upgrade sequence you've described, except that instead of
 step 4 taking a few hours, we just have to redirect traffic to the brand
 new index copy.  If everything works out, we then repeat with the other
 index copy.  If it doesn't work out, we revert everything and go back to
 the original index.

 Also, every index has a build core and a live core.  I currently maintain
 the same config in both cores, but it would be possible to change the
 config in the build core, reload or restart Solr, do your reindex, and
 simply do a core swap, which is almost instantaneous.  If you are doing
 replication, swapping cores on the master initiates full replication to the
 slave. Excerpt from my solr.xml:

 core instanceDir=cores/s0_1/ name=s0live
 dataDir=../../data/s0_1/
 core instanceDir=cores/s0_0/ name=s0build
 dataDir=../../data/s0_0/

 Thanks,
 Shawn

 P.S. Actually, I have three full copies of my index now -- I recently
 upgraded my test server so it has enough disk capacity to hold my entire
 index.  The test server runs a local copy of the build system which keeps
 it up to date with the two production copies.




Re: 400 MB Fields

2011-06-08 Thread Alexander Kanarsky
Otis,

Not sure about the Solr, but with Lucene It was certainly doable. I
saw fields way bigger than 400Mb indexed, sometimes having a large set
of unique terms as well (think something like log file with lots of
alphanumeric tokens, couple of gigs in size). While indexing and
querying of such things the I/O, naturally, could easily become a
bottleneck.

-Alexander


Re: copyField generates multiple values encountered for non multiValued field

2011-05-31 Thread Alexander Kanarsky
Alexander,

I saw the same behavior in 1.4.x with non-multivalued fields when
updating the document in the index (i.e obtaining the doc from the
index, modifying some fields and then adding the document with the same
id back). I do not know what causes this, but it looks like the
copyField logic completely bypasses the multivalueness check and just
adds the value in addition to whatever already there (instead of
replacing the value). So yes, Solr renders itself into incorrect state
then (note that the index is still correct from the Lucene's
standpoint). 

-Alexander

 


On Wed, 2011-05-25 at 16:50 +0200, Alexander Golubowitsch wrote:
 Dear list,
  
 hope somebody can help me understand/avoid this.
  
 I am sending an add request with allowDuplicates=false to a Solr 1.4.1
 instance.
 This is for debugging purposes, so I am sending the exact same data that are
 already stored in Solr's index.
 I am using the PHP PECL libraries, which fail completely in giving me any
 hint on what goes wrong.
 
 Only sending the same add request again gives me a proper
 SolrClientException that hints:
  
 ERROR: [288400] multiple values encountered for non multiValued field
 field2 [fieldvalue, fieldvalue]
 
 The scenario:
 - field1 is implicitly single value, type text, indexed and stored
 - field2 is generated via a copyField directive in schema.xml, implicitly
 single value, type string, indexed and stored
 
 What appears to happen:
 - On the first add (SolrClient::addDocuments(array(SolrInputDocument
 theDocument))), regular fields like field1 get overwritten as intended
 - field2, defined with a copyField, but still single value, gets
 _appended_ instead
 - When I retrieve the updated document in a query and try to add it again,
 it won't let me because of the inconsistent multi-value state
 - The PECL library, in addition, appears to hit some internal exception
 (that it doesn't handle properly) when encountering multiple values for a
 single value field. That gives me zero results querying a set that includes
 the document via PHP, while the document can be retrieved properly, though
 in inconsistent state, any other way.
 
 But: Solr appears to be generating the corrupted state itsself via
 copyField?
 What's going wrong? I'm pretty confused...
 
 Thank you,
  Alex
 




Re: Replication Clarification Please

2011-05-15 Thread Alexander Kanarsky
Ravi,

what is the replication configuration on both master and slave? 
Also could you list of files in the index folder on master and slave
before and after the replication? 

-Alexander


On Fri, 2011-05-13 at 18:34 -0400, Ravi Solr wrote:
 Sorry guys spoke too soon I guess. The replication still remains very
 slow even after upgrading to 3.1 and setting the compression off. Now
 Iam totally clueless. I have tried everything that I know of to
 increase the speed of replication but failed. if anybody faced the
 same issue, can you please tell me how you solved it.
 
 Ravi Kiran Bhaskar
 
 On Thu, May 12, 2011 at 6:42 PM, Ravi Solr ravis...@gmail.com wrote:
  Thank you Mr. Bell and Mr. Kanarsky, as per your advise we have moved
  from 1.4.1 to 3.1 and have made several changes to configuration. The
  configuration changes have worked nicely till now and the replication
  is finishing within the interval and not backing up. The changes we
  made are as follows
 
  1. Increased the mergeFactor from 10 to 15
  2. Increased ramBufferSizeMB to 1024
  3. Changed lockType to single (previously it was simple)
  4. Set maxCommitsToKeep to 1 in the deletionPolicy
  5. Set maxPendingDeletes to 0
  6. Changed caches from LRUCache to FastLRUCache as we had hit ratios
  well over 75% to increase warming speed
  7. Increased the poll interval to 6 minutes and re-indexed all content.
 
  Thanks,
 
  Ravi Kiran Bhaskar
 
  On Wed, May 11, 2011 at 6:00 PM, Alexander Kanarsky
  alexan...@trulia.com wrote:
  Ravi,
 
  if you have what looks like a full replication each time even if the
  master generation is greater than slave, try to watch for the index on
  both master and slave the same time to see what files are getting
  replicated. You probably may need to adjust your merge factor, as Bill
  mentioned.
 
  -Alexander
 
 
 
  On Tue, 2011-05-10 at 12:45 -0400, Ravi Solr wrote:
  Hello Mr. Kanarsky,
  Thank you very much for the detailed explanation,
  probably the best explanation I found regarding replication. Just to
  be sure, I wanted to test solr 3.1 to see if it alleviates the
  problems...I dont think it helped. The master index version and
  generation are greater than the slave, still the slave replicates the
  entire index form master (see replication admin screen output below).
  Any idea why it would get the whole index everytime even in 3.1 or am
  I misinterpreting the output ? However I must admit that 3.1 finished
  the replication unlike 1.4.1 which would hang and be backed up for
  ever.
 
  Masterhttp://masterurl:post/solr-admin/searchcore/replication
Latest Index Version:null, Generation: null
Replicatable Index Version:1296217097572, Generation: 12726
 
  Poll Interval 00:03:00
 
  Local Index   Index Version: 1296217097569, Generation: 12725
 
Location: /data/solr/core/search-data/index
Size: 944.32 MB
Times Replicated Since Startup: 148
Previous Replication Done At: Tue May 10 12:32:42 EDT 2011
Config Files Replicated At: null
Config Files Replicated: null
Times Config Files Replicated Since Startup: null
Next Replication Cycle At: Tue May 10 12:35:41 EDT 2011
 
  Current Replication StatusStart Time: Tue May 10 12:32:41 EDT 2011
Files Downloaded: 18 / 108
Downloaded: 317.48 KB / 436.24 MB [0.0%]
Downloading File: _ayu.nrm, Downloaded: 4 bytes / 4 bytes [100.0%]
Time Elapsed: 17s, Estimated Time Remaining: 23902s, Speed: 18.67 
  KB/s
 
 
  Thanks,
  Ravi Kiran Bhaskar
 
  On Tue, May 10, 2011 at 4:10 AM, Alexander Kanarsky
  alexan...@trulia.com wrote:
   Ravi,
  
   as far as I remember, this is how the replication logic works (see
   SnapPuller class, fetchLatestIndex method):
  
   1. Does the Slave get the whole index every time during replication or
   just the delta since the last replication happened ?
  
  
   It look at the index version AND the index generation. If both slave's
   version and generation are the same as on master, nothing gets
   replicated. if the master's generation is greater than on slave, the
   slave fetches the delta files only (even if the partial merge was done
   on the master) and put the new files from master to the same index
   folder on slave (either index or index.timestamp, see further
   explanation). However, if the master's index generation is equals or
   less than one on slave, the slave does the full replication by
   fetching all files of the master's index and place them into a
   separate folder on slave (index.timestamp). Then, if the fetch is
   successfull, the slave updates (or creates) the index.properties file
   and puts there the name of the current index folder. The old
   index.timestamp folder(s) will be kept in 1.4.x - which was treated
   as a bug - see SOLR-2156 (and this was fixed in 3.1). After this, the
   slave does commit or reload core depending whether the config files
   were

Re: Replication Clarification Please

2011-05-11 Thread Alexander Kanarsky
Ravi,

if you have what looks like a full replication each time even if the
master generation is greater than slave, try to watch for the index on
both master and slave the same time to see what files are getting
replicated. You probably may need to adjust your merge factor, as Bill
mentioned. 

-Alexander



On Tue, 2011-05-10 at 12:45 -0400, Ravi Solr wrote:
 Hello Mr. Kanarsky,
 Thank you very much for the detailed explanation,
 probably the best explanation I found regarding replication. Just to
 be sure, I wanted to test solr 3.1 to see if it alleviates the
 problems...I dont think it helped. The master index version and
 generation are greater than the slave, still the slave replicates the
 entire index form master (see replication admin screen output below).
 Any idea why it would get the whole index everytime even in 3.1 or am
 I misinterpreting the output ? However I must admit that 3.1 finished
 the replication unlike 1.4.1 which would hang and be backed up for
 ever.
 
 Masterhttp://masterurl:post/solr-admin/searchcore/replication
   Latest Index Version:null, Generation: null
   Replicatable Index Version:1296217097572, Generation: 12726
 
 Poll Interval 00:03:00
 
 Local Index   Index Version: 1296217097569, Generation: 12725
 
   Location: /data/solr/core/search-data/index
   Size: 944.32 MB
   Times Replicated Since Startup: 148
   Previous Replication Done At: Tue May 10 12:32:42 EDT 2011
   Config Files Replicated At: null
   Config Files Replicated: null
   Times Config Files Replicated Since Startup: null
   Next Replication Cycle At: Tue May 10 12:35:41 EDT 2011
 
 Current Replication StatusStart Time: Tue May 10 12:32:41 EDT 2011
   Files Downloaded: 18 / 108
   Downloaded: 317.48 KB / 436.24 MB [0.0%]
   Downloading File: _ayu.nrm, Downloaded: 4 bytes / 4 bytes [100.0%]
   Time Elapsed: 17s, Estimated Time Remaining: 23902s, Speed: 18.67 KB/s
 
 
 Thanks,
 Ravi Kiran Bhaskar
 
 On Tue, May 10, 2011 at 4:10 AM, Alexander Kanarsky
 alexan...@trulia.com wrote:
  Ravi,
 
  as far as I remember, this is how the replication logic works (see
  SnapPuller class, fetchLatestIndex method):
 
  1. Does the Slave get the whole index every time during replication or
  just the delta since the last replication happened ?
 
 
  It look at the index version AND the index generation. If both slave's
  version and generation are the same as on master, nothing gets
  replicated. if the master's generation is greater than on slave, the
  slave fetches the delta files only (even if the partial merge was done
  on the master) and put the new files from master to the same index
  folder on slave (either index or index.timestamp, see further
  explanation). However, if the master's index generation is equals or
  less than one on slave, the slave does the full replication by
  fetching all files of the master's index and place them into a
  separate folder on slave (index.timestamp). Then, if the fetch is
  successfull, the slave updates (or creates) the index.properties file
  and puts there the name of the current index folder. The old
  index.timestamp folder(s) will be kept in 1.4.x - which was treated
  as a bug - see SOLR-2156 (and this was fixed in 3.1). After this, the
  slave does commit or reload core depending whether the config files
  were replicated. There is another bug in 1.4.x that fails replication
  if the slave need to do the full replication AND the config files were
  changed - also fixed in 3.1 (see SOLR-1983).
 
  2. If there are huge number of queries being done on slave will it
  affect the replication ? How can I improve the performance ? (see the
  replications details at he bottom of the page)
 
 
  From my experience the half of the replication time is a time when the
  transferred data flushes to the disk. So the IO impact is important.
 
  3. Will the segment names be same be same on master and slave after
  replication ? I see that they are different. Is this correct ? If it
  is correct how does the slave know what to fetch the next time i.e.
  the delta.
 
 
  They should be the same. The slave fetches the changed files only (see
  above), also look at SnapPuller code.
 
  4. When and why does the index.TIMESTAMP folder get created ? I see
  this type of folder getting created only on slave and the slave
  instance is pointing to it.
 
 
  See above.
 
  5. Does replication process copy both the index and index.TIMESTAMP
  folder ?
 
 
  index.timestamp folder gets created only of the full replication
  happened at least once. Otherwise, the slave will use the index
  folder.
 
  6. what happens if the replication kicks off even before the previous
  invocation has not completed ? will the 2nd invocation block or will
  it go through causing more confusion ?
 
 
  There is a lock (snapPullLock in ReplicationHandler) that prevents two
  replications run simultaneously

Re: Replication Clarification Please

2011-05-10 Thread Alexander Kanarsky
Ravi,

as far as I remember, this is how the replication logic works (see
SnapPuller class, fetchLatestIndex method):

 1. Does the Slave get the whole index every time during replication or
 just the delta since the last replication happened ?


It look at the index version AND the index generation. If both slave's
version and generation are the same as on master, nothing gets
replicated. if the master's generation is greater than on slave, the
slave fetches the delta files only (even if the partial merge was done
on the master) and put the new files from master to the same index
folder on slave (either index or index.timestamp, see further
explanation). However, if the master's index generation is equals or
less than one on slave, the slave does the full replication by
fetching all files of the master's index and place them into a
separate folder on slave (index.timestamp). Then, if the fetch is
successfull, the slave updates (or creates) the index.properties file
and puts there the name of the current index folder. The old
index.timestamp folder(s) will be kept in 1.4.x - which was treated
as a bug - see SOLR-2156 (and this was fixed in 3.1). After this, the
slave does commit or reload core depending whether the config files
were replicated. There is another bug in 1.4.x that fails replication
if the slave need to do the full replication AND the config files were
changed - also fixed in 3.1 (see SOLR-1983).

 2. If there are huge number of queries being done on slave will it
 affect the replication ? How can I improve the performance ? (see the
 replications details at he bottom of the page)


From my experience the half of the replication time is a time when the
transferred data flushes to the disk. So the IO impact is important.

 3. Will the segment names be same be same on master and slave after
 replication ? I see that they are different. Is this correct ? If it
 is correct how does the slave know what to fetch the next time i.e.
 the delta.


They should be the same. The slave fetches the changed files only (see
above), also look at SnapPuller code.

 4. When and why does the index.TIMESTAMP folder get created ? I see
 this type of folder getting created only on slave and the slave
 instance is pointing to it.


See above.

 5. Does replication process copy both the index and index.TIMESTAMP
folder ?


index.timestamp folder gets created only of the full replication
happened at least once. Otherwise, the slave will use the index
folder.

 6. what happens if the replication kicks off even before the previous
 invocation has not completed ? will the 2nd invocation block or will
 it go through causing more confusion ?


There is a lock (snapPullLock in ReplicationHandler) that prevents two
replications run simultaneously. If there is no bug, it should just
return silently from the replication call. (I personally never had
problem with this so it looks there is no bug :)

 7. If I have to prep a new master-slave combination is it OK to copy
 the respective contents into the new master-slave and start solr ? or
 do I have have to wipe the new slave and let it replicate from its new
 master ?


If the new master has a different index, the slave will create a new
index.timestamp folder. There is no need to wipe it.

 8. Doing an 'ls | wc -l' on index folder of master and slave gave 194
 and 17968 respectively...I slave has lot of segments_xxx files. Is
 this normal ?


No, it looks like in your case the slave continues to replicate to the
same folder for a long time period but the old files are not getting
deleted bu some reason. Try to restart the slave or do core reload on
it to see if the old segments gone.

-Alexander



Re: Multicore Relaod Theoretical Question

2011-01-24 Thread Alexander Kanarsky
Em,

that's correct. You can use 'lsof' to see file handles still in use.
See 
http://0xfe.blogspot.com/2006/03/troubleshooting-unix-systems-with-lsof.html,
Recipe #11.

-Alexander

On Sun, Jan 23, 2011 at 1:52 AM, Em mailformailingli...@yahoo.de wrote:

 Hi Alexander,

 thank you for your response.

 You said that the old index files were still in use. That means Linux does
 not *really* delete them until Solr frees its locks from it, which happens
 while reloading?



 Thank you for sharing your experiences!

 Kind regards,
 Em


 Alexander Kanarsky wrote:

 Em,

 yes, you can replace the index (get the new one into a separate folder
 like index.new and then rename it to the index folder) outside the
 Solr, then just do the http call to reload the core.

 Note that the old index files may still be in use (continue to serve
 the queries while reloading), even if the old index folder is deleted
 - that is on Linux filesystems, not sure about NTFS.
 That means the space on disk will be freed only when the old files are
 not referenced by Solr searcher any longer.

 -Alexander

 On Sat, Jan 22, 2011 at 1:51 PM, Em mailformailingli...@yahoo.de wrote:

 Hi Erick,

 thanks for your response.

 Yes, it's really not that easy.

 However, the target is to avoid any kind of master-slave-setup.

 The most recent idea i got is to create a new core with a data-dir
 pointing
 to an already existing directory with a fully optimized index.

 Regards,
 Em
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multicore-Relaod-Theoretical-Question-tp2293999p2310709.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Multicore-Relaod-Theoretical-Question-tp2293999p2312778.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multicore Relaod Theoretical Question

2011-01-22 Thread Alexander Kanarsky
Em,

yes, you can replace the index (get the new one into a separate folder
like index.new and then rename it to the index folder) outside the
Solr, then just do the http call to reload the core.

Note that the old index files may still be in use (continue to serve
the queries while reloading), even if the old index folder is deleted
- that is on Linux filesystems, not sure about NTFS.
That means the space on disk will be freed only when the old files are
not referenced by Solr searcher any longer.

-Alexander

On Sat, Jan 22, 2011 at 1:51 PM, Em mailformailingli...@yahoo.de wrote:

 Hi Erick,

 thanks for your response.

 Yes, it's really not that easy.

 However, the target is to avoid any kind of master-slave-setup.

 The most recent idea i got is to create a new core with a data-dir pointing
 to an already existing directory with a fully optimized index.

 Regards,
 Em
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Multicore-Relaod-Theoretical-Question-tp2293999p2310709.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: old index files not deleted on slave

2011-01-22 Thread Alexander Kanarsky
I see the file

-rw-rw-r-- 1 feeddo feeddo0 Dec 15 01:19
lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock

was created on Dec. 15. At the end of the replication, as far as I
remember, the SnapPuller tries to open the writer to ensure the old
files are deleted, and in
your case it cannot obtain a lock on the index folder on Dec 16,
17,18. Can you reproduce the problem if you delete the lock file,
restart the slave
and try replication again? Do you have any other Writer(s) open for
this folder outside of this core?

-Alexander

On Sat, Jan 22, 2011 at 3:52 PM, feedly team feedly...@gmail.com wrote:
 The file system checked out, I also tried creating a slave on a
 different machine and could reproduce the issue. I logged SOLR-2329.

 On Sat, Dec 18, 2010 at 8:01 PM, Lance Norskog goks...@gmail.com wrote:
 This could be a quirk of the native locking feature. What's the file
 system? Can you fsck it?

 If this error keeps happening, please file this. It should not happen.
 Add the text above and also your solrconfigs if you can.

 One thing you could try is to change from the native locking policy to
 the simple locking policy - but only on the child.

 On Sat, Dec 18, 2010 at 4:44 PM, feedly team feedly...@gmail.com wrote:
 I have set up index replication (triggered on optimize). The problem I
 am having is the old index files are not being deleted on the slave.
 After each replication, I can see the old files still hanging around
 as well as the files that have just been pulled. This causes the data
 directory size to increase by the index size every replication until
 the disk fills up.

 Checking the logs, I see the following error:

 SEVERE: SnapPull failed
 org.apache.solr.common.SolrException: Index fetch failed :
        at 
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
        at 
 org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265)
        at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
        at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at 
 java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
        at 
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
 Caused by: org.apache.lucene.store.LockObtainFailedException: Lock
 obtain timed out:
 NativeFSLock@/var/solrhome/data/index/lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock
        at org.apache.lucene.store.Lock.obtain(Lock.java:84)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1065)
        at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:954)
        at 
 org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:192)
        at 
 org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99)
        at 
 org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
        at 
 org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376)
        at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471)
        at 
 org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
        ... 11 more

 lsof reveals that the file is still opened from the java process.

 I am running 4.0 rev 993367 with patch SOLR-1316. Otherwise, the setup
 is pretty vanilla. The OS is linux, the indexes are on local
 directories, write permissions look ok, nothing unusual in the config
 (default deletion policy, etc.). Contents of the index data dir:

 master:
 -rw-rw-r-- 1 feeddo feeddo  191 Dec 14 01:06 _1lg.fnm
 -rw-rw-r-- 1 feeddo feeddo  26M Dec 14 01:07 _1lg.fdx
 -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 14 01:07 _1lg.fdt
 -rw-rw-r-- 1 feeddo feeddo 474M Dec 14 01:12 _1lg.tis
 -rw-rw-r-- 1 feeddo feeddo  15M Dec 14 01:12 _1lg.tii
 -rw-rw-r-- 1 feeddo feeddo 144M Dec 14 01:12 _1lg.prx
 -rw-rw-r-- 1 feeddo feeddo 277M Dec 14 01:12 _1lg.frq
 -rw-rw-r-- 1 feeddo feeddo  311 Dec 14 01:12 segments_1ji
 -rw-rw-r-- 1 feeddo feeddo  23M Dec 14 01:12 _1lg.nrm
 -rw-rw-r-- 1 feeddo feeddo  191 Dec 18 01:11 _24e.fnm
 -rw-rw-r-- 1 feeddo feeddo  26M Dec 18 01:12 _24e.fdx
 -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 01:12 _24e.fdt
 -rw-rw-r-- 1 feeddo feeddo 483M Dec 18 01:23 _24e.tis
 -rw-rw-r-- 1 feeddo feeddo  15M Dec 18 01:23 _24e.tii
 -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 01:23 

Re: Can I host TWO separate datasets in Solr?

2011-01-21 Thread Alexander Kanarsky
Igor,

you can set two different Solr cores in solr.xml and search them separately.
See multicore example in Solr distribution.

-Alexander

On Fri, Jan 21, 2011 at 3:51 PM, Igor Chudov ichu...@gmail.com wrote:
 I would like to have two sets of data and search them separately (they are
 used for two different websites).

 How can I do it?

 Thanks!



Re: Solr + Hadoop

2011-01-13 Thread Alexander Kanarsky
Joan,

make sure that you are running the job on Hadoop 0.21 cluster. (It
looks like you have compiled the apache-solr-hadoop jar with Hadoop
0.21 but using it on 0.20 cluster).

-Alexander


Re: Creating Solr index from map/reduce

2011-01-03 Thread Alexander Kanarsky
Joan,

current version of the patch assumes the location and names for the
schema and solrconfig files ($SOLR_HOME/conf), it is hardcoded (see
the SolrRecordWriter's constructor). Multi-core configuration with
separate configuration locations via solr.xml is not supported as for
now.  As a workaround, you could link or copy the schema and
solrconfig files to follow the hardcoded assumption.

Thanks,
-Alexander

On Wed, Dec 29, 2010 at 2:50 AM, Joan joan.monp...@gmail.com wrote:
 If I rename my custom schema file (schema-xx.xml), whitch is located in
 SOLR_HOME/schema/, and then I copy it to conf folder and finally I try to
 run CSVIndexer, it shows me an other error:

 Caused by: java.lang.RuntimeException: Can't find resource 'solrconfig.xml'
 in classpath or
 '/tmp/hadoop-root/mapred/local/taskTracker/archive/localhost/tmp/b7611d6d-9cc7-4237-a240-96ecaab9f21a.solr.zip/conf/'

 I dont't understand because I've a solr configuration file (solr.xml) where
 I define all core:

  core name=core_name
        instanceDir=solr-data/index
        config=solr/conf/solrconfig_xx.xml
        schema=solr/schema/schema_xx.xml
        properties=solr/conf/solrcore.properties/ 

 But I think that when I run CSVIndexer, it doesn't know that solr.xml exist,
 and it try to looking for schema.xml and solrconfig.xml by default in
 default folder (conf)



 2010/12/29 Joan joan.monp...@gmail.com

 Hi,

 I'm trying generate Solr index from hadoop (map/reduce) so I'm using this
 patch SOLR-301 https://issues.apache.org/jira/browse/SOLR-1301, however
 I don't get it.

 When I try to run CSVIndexer with some arguments: directory Solr index
 -solr Solr home input, in this case CSV

 I'm runnig CSVIndexer:

 HADOOP_INSTALL/bin/hadoop jar my.jar CSVIndexer INDEX_FOLDER -solr
 /SOLR_HOME CSV FILE PATH

 Before that I run CSVIndexer, I've put csv file into HDFS.

 My Solr home hasn't default files configurations, but which is divided
 into multiple folders

 /conf
 /schema

 I have custom solr file configurations so CSVIndexer can't find schema.xml,
 obviously It won't be able to find it because this file doesn't exist, in my
 case, this file is named schema-xx.xml and CSVIndexer is looking for it
 inside conf folder and It don't know that schema folder exist. And I have
 solr configuration file (solr.xml) where I configure multiple cores.

 I tried to modify solr's paths but It still not working .

 I understand that CSVIndexer copy Solr Home specified into HDFS
 (/tmp/hadoop-user/mapred/local/taskTracker/archive/...) and when It try to
 find schema.xml it doesn't exit:

 10/12/29 10:18:11 INFO mapred.JobClient: Task Id :
 attempt_201012291016_0002_r_00_1, Status : FAILED
 java.lang.IllegalStateException: Failed to initialize record writer for
 my.jar, attempt_201012291016_0002_r_00_1
         at
 org.apache.solr.hadoop.SolrRecordWriter.init(SolrRecordWriter.java:253)
         at
 org.apache.solr.hadoop.SolrOutputFormat.getRecordWriter(SolrOutputFormat.java:152)
         at
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:553)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
         at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.io.FileNotFoundException: Source
 '/tmp/hadoop-guest/mapred/local/taskTracker/archive/localhost/tmp/e8be5bb1-e910-47a1-b5a7-1352dfec2b1f.solr.zip/conf/schema.xml'
 does not exist
         at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:636)
         at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:606)
         at
 org.apache.solr.hadoop.SolrRecordWriter.init(SolrRecordWriter.java:222)
         ... 4 more



Re: Searching with wrong keyboard layout or using translit

2010-10-28 Thread Alexander Kanarsky
Pavel,

I think there is no single way to implement this. Some ideas that
might be helpful:

1. Consider adding additional terms while indexing. This assumes
conversion of Russian text to both translit and wrong keyboard
forms and index converted terms along with original terms (i.e. your
Analyzer/Filter should produce Moskva and Vjcrdf for term Москва). You
may re-use the same field (if you plan for a simple term queries) or
create a separate fields for the generated terms (better for phrase,
proximity queries etc. since it keeps the original text positional
info). Then the query could use any of these forms to fetch the
document. If you use separate fields, you'll need to expand/create
your query to search for them, of course.
2. If you have to index just an original Russian text, you might
generate all term forms while analyzing the query, then you could
treat the converted terms as a synonyms and use the combination of
TermQuery for all term forms or the MultiPhraseQuery for the phrases.
For Solr in this case you probably will need to add a custom filter
similar to SynonymFilter.

Hope this helps,
-Alexander

On Wed, Oct 27, 2010 at 1:31 PM, Pavel Minchenkov char...@gmail.com wrote:
 Hi,

 When I'm trying to search Google with wrong keyboard layout -- it corrects
 my query, example: http://www.google.ru/search?q=vjcrdf (I typed word
 Moscow in Russian but in English keyboard layout).
 http://www.google.ru/search?q=vjcrdfAlso, when I'm searching using
 translit, It does the same: http://www.google.ru/search?q=moskva

 What is the right way to implement this feature in Solr?

 --
 Pavel Minchenkov



Re: I was at a search vendor round table today...

2010-09-22 Thread Alexander Kanarsky
  He said some other things about a huge petabyte hosted search collection 
 they have used by banks..

In context of your discussion this reference sounds really, really funny... :)

-Alexander

On Wed, Sep 22, 2010 at 1:17 PM, Grant Ingersoll gsing...@apache.org wrote:

 On Sep 22, 2010, at 2:04 PM, Smiley, David W. wrote:

 (I don't twitter or blog so I thought I'd send this message here)

 Today at work (at MITRE outside DC) there was (is) a day of technical 
 presentations about topics related to information dissemination and 
 discovery (broad squishy words there, but mostly covered search) at which 
 I spoke about the value of faceting, and gave a quick Solr pitch.  There was 
 an hour vendor panel in which a representative from Autonomy, Microsoft 
 (i.e. FAST), Google, Vivisimo, and Endeca had the opportunity to espouse the 
 virtues of their product, and fit in an occasional jab at their competitors 
 next to them.  In the absence of a suitable representative for Solr (e.g. 
 Lucid) I pointed out how open-source Solr has democratized (i.e. made 
 free) search and faceting when it used to require paying lots of money.  And 
 I asked them how their products have reacted to this new reality.  Autonomy 
 acknowledged they used to make millions on simple engagements in the distant 
 past but that isn't the case these days.  He said some other things about a 
 huge petabyte hosted search collection they have used by banks... I forget 
 what else he said.  I forgot what Google said.  Vivisimo quoted Steve 
 Ballmer, saying open source is as free as a free puppy (not a bad point 
 IMO).

 Too funny.  Hadn't heard that one before.  Presumably meaning you have to 
 care and feed it, despite the fact that you really do love it and it is cute 
 as hell?  The care and feeding is true of the commercial ones, too, 
 especially in terms of  for supporting features you never use, but love 
 (as in we love using this tool) is usually not a word I hear associated in 
 those respects too often, but of course that is likely self selecting.

 Endeca claimed to be happy Solr exists because it raises the awareness of 
 faceted search, but then claimed it would not scale and they should then 
 upgrade to Endeca.  (!)  I found that claim ridiculous, of course.

 Having replaced all the above on a number of occasions w/ Solr at both a 
 significant cost savings on licensing, dev time, and hardware, I would agree 
 that claim is quite ridiculous.  Besides, in my experience, the scale claim 
 is silly.  Everyone (customers) says they need scale, but few of them really 
 know what scale is, so it is all relative.   For some, scale is 1M docs, for 
 others it's 1B+ docs;  for others it's 100K queries per day, for others it's 
 100M per day.  (BTW, I've seen Lucene/Solr do both, just fine.  Not that it 
 is a free lunch, but neither are the other ones despite what they say.)


 Speaking of performance, on a large scale search project where we're using 
 Solr in place of a MarkLogic prototype (because ML is so friggin expensive, 
 for one reason), the search results were so fast (~150ms) vs. the ML's 
 results of 2-3 seconds, that the UI engineers building the interface on top 
 of the XML output thought Solr was broken because it was so fast.  The quote 
 was It's so fast, it's broken.    In other words, they were used to 2-3 
 second response times and so if the results came back as fast as what Solr 
 has been doing, then surely there's a bug.  There's no bug.  :)  Admittedly, 
 I think it was a bit of an apples and oranges comparison but I love that 
 quote nonetheless.


 I love it.  I have had the same experience where people think it's broken b/c 
 it's so fast.  Large vendor named above took 24 hours to index 4M records 
 (they weren't even doing anything fancy on the indexing side) and search was 
 slow too.  Solr took about 40 minutes to index all the content and search was 
 blazing.  Same content, faster indexing, better search results, a lot less 
 time.

 At any rate, enough of tooting our own horn.  Thanks for sharing!

 -Grant


 --
 Grant Ingersoll
 http://www.lucidimagination.com/