Re: Fastest way to import big amount of documents in SolrCloud
If you build your index in Hadoop, read this (it is about the Cloudera Search but in my understanding also should work with Solr Hadoop contrib since 4.7) http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_batch_index_to_solr_servers_using_golive.html On Thu, May 1, 2014 at 1:47 PM, Costi Muraru costimur...@gmail.com wrote: Hi guys, What would you say it's the fastest way to import data in SolrCloud? Our use case: each day do a single import of a big number of documents. Should we use SolrJ/DataImportHandler/other? Or perhaps is there a bulk import feature in SOLR? I came upon this promising link: http://wiki.apache.org/solr/UpdateCSV Any idea on how UpdateCSV is performance-wise compared with SolrJ/DataImportHandler? If SolrJ, should we split the data in chunks and start multiple clients at once? In this way we could perhaps take advantage of the multitude number of servers in the SolrCloud configuration? Either way, after the import is finished, should we do an optimize or a commit or none ( http://wiki.solarium-project.org/index.php/V1:Optimize_command)? Any tips and tricks to perform this process the right way are gladly appreciated. Thanks, Costi
Re: Production Release process with Solr 3.5 implementation.
Why not to change the order to this: 3. Upgrade Solr Schema (Master) Replication is disabled. 4. Start Index Rebuild. (if step 3) 1. Pull up Maintenance Pages 2. Upgrade DB 5. Upgrade UI code 6. Index build complete ? Start Replication 7. Verify UI and Drop Maintenance Pages. So your slaves will continue to serve traffic until you're done with the master index. Or the master index also imports from the same database? On Thu, Nov 1, 2012 at 4:08 PM, Shawn Heisey s...@elyograg.org wrote: On 11/1/2012 2:46 PM, adityab wrote: 1. Pull up Maintenance Pages 2. Upgrade DB 3. Upgrade Solr Schema (Master) Replication is disabled. 4. Start Index Rebuild. (if step 3) 5. Upgrade UI code 6. Index build complete ? Start Replication 7. Verify UI and Drop Maintenance Pages. As # 4 takes couple of hours compared to all other steps which run within few minutes, we need to have down time for the duration of that. What I do is a little bit different. I have two completely independent copies of my index, no replication. The build system maintains each copy simultaneously, including managing independent rebuilds. I used to run two copies of my build system, but I recently made it so that one copy manages multiple indexes. If I need to do an upgrade, I will first test everything out as much as possible on my test environment. Then I will take one copy of my index offline, perform the required changes, and reindex. The UI continues to send queries to the online index that hasn't been changed. At that point, we initiate the upgrade sequence you've described, except that instead of step 4 taking a few hours, we just have to redirect traffic to the brand new index copy. If everything works out, we then repeat with the other index copy. If it doesn't work out, we revert everything and go back to the original index. Also, every index has a build core and a live core. I currently maintain the same config in both cores, but it would be possible to change the config in the build core, reload or restart Solr, do your reindex, and simply do a core swap, which is almost instantaneous. If you are doing replication, swapping cores on the master initiates full replication to the slave. Excerpt from my solr.xml: core instanceDir=cores/s0_1/ name=s0live dataDir=../../data/s0_1/ core instanceDir=cores/s0_0/ name=s0build dataDir=../../data/s0_0/ Thanks, Shawn P.S. Actually, I have three full copies of my index now -- I recently upgraded my test server so it has enough disk capacity to hold my entire index. The test server runs a local copy of the build system which keeps it up to date with the two production copies.
Re: 400 MB Fields
Otis, Not sure about the Solr, but with Lucene It was certainly doable. I saw fields way bigger than 400Mb indexed, sometimes having a large set of unique terms as well (think something like log file with lots of alphanumeric tokens, couple of gigs in size). While indexing and querying of such things the I/O, naturally, could easily become a bottleneck. -Alexander
Re: copyField generates multiple values encountered for non multiValued field
Alexander, I saw the same behavior in 1.4.x with non-multivalued fields when updating the document in the index (i.e obtaining the doc from the index, modifying some fields and then adding the document with the same id back). I do not know what causes this, but it looks like the copyField logic completely bypasses the multivalueness check and just adds the value in addition to whatever already there (instead of replacing the value). So yes, Solr renders itself into incorrect state then (note that the index is still correct from the Lucene's standpoint). -Alexander On Wed, 2011-05-25 at 16:50 +0200, Alexander Golubowitsch wrote: Dear list, hope somebody can help me understand/avoid this. I am sending an add request with allowDuplicates=false to a Solr 1.4.1 instance. This is for debugging purposes, so I am sending the exact same data that are already stored in Solr's index. I am using the PHP PECL libraries, which fail completely in giving me any hint on what goes wrong. Only sending the same add request again gives me a proper SolrClientException that hints: ERROR: [288400] multiple values encountered for non multiValued field field2 [fieldvalue, fieldvalue] The scenario: - field1 is implicitly single value, type text, indexed and stored - field2 is generated via a copyField directive in schema.xml, implicitly single value, type string, indexed and stored What appears to happen: - On the first add (SolrClient::addDocuments(array(SolrInputDocument theDocument))), regular fields like field1 get overwritten as intended - field2, defined with a copyField, but still single value, gets _appended_ instead - When I retrieve the updated document in a query and try to add it again, it won't let me because of the inconsistent multi-value state - The PECL library, in addition, appears to hit some internal exception (that it doesn't handle properly) when encountering multiple values for a single value field. That gives me zero results querying a set that includes the document via PHP, while the document can be retrieved properly, though in inconsistent state, any other way. But: Solr appears to be generating the corrupted state itsself via copyField? What's going wrong? I'm pretty confused... Thank you, Alex
Re: Replication Clarification Please
Ravi, what is the replication configuration on both master and slave? Also could you list of files in the index folder on master and slave before and after the replication? -Alexander On Fri, 2011-05-13 at 18:34 -0400, Ravi Solr wrote: Sorry guys spoke too soon I guess. The replication still remains very slow even after upgrading to 3.1 and setting the compression off. Now Iam totally clueless. I have tried everything that I know of to increase the speed of replication but failed. if anybody faced the same issue, can you please tell me how you solved it. Ravi Kiran Bhaskar On Thu, May 12, 2011 at 6:42 PM, Ravi Solr ravis...@gmail.com wrote: Thank you Mr. Bell and Mr. Kanarsky, as per your advise we have moved from 1.4.1 to 3.1 and have made several changes to configuration. The configuration changes have worked nicely till now and the replication is finishing within the interval and not backing up. The changes we made are as follows 1. Increased the mergeFactor from 10 to 15 2. Increased ramBufferSizeMB to 1024 3. Changed lockType to single (previously it was simple) 4. Set maxCommitsToKeep to 1 in the deletionPolicy 5. Set maxPendingDeletes to 0 6. Changed caches from LRUCache to FastLRUCache as we had hit ratios well over 75% to increase warming speed 7. Increased the poll interval to 6 minutes and re-indexed all content. Thanks, Ravi Kiran Bhaskar On Wed, May 11, 2011 at 6:00 PM, Alexander Kanarsky alexan...@trulia.com wrote: Ravi, if you have what looks like a full replication each time even if the master generation is greater than slave, try to watch for the index on both master and slave the same time to see what files are getting replicated. You probably may need to adjust your merge factor, as Bill mentioned. -Alexander On Tue, 2011-05-10 at 12:45 -0400, Ravi Solr wrote: Hello Mr. Kanarsky, Thank you very much for the detailed explanation, probably the best explanation I found regarding replication. Just to be sure, I wanted to test solr 3.1 to see if it alleviates the problems...I dont think it helped. The master index version and generation are greater than the slave, still the slave replicates the entire index form master (see replication admin screen output below). Any idea why it would get the whole index everytime even in 3.1 or am I misinterpreting the output ? However I must admit that 3.1 finished the replication unlike 1.4.1 which would hang and be backed up for ever. Masterhttp://masterurl:post/solr-admin/searchcore/replication Latest Index Version:null, Generation: null Replicatable Index Version:1296217097572, Generation: 12726 Poll Interval 00:03:00 Local Index Index Version: 1296217097569, Generation: 12725 Location: /data/solr/core/search-data/index Size: 944.32 MB Times Replicated Since Startup: 148 Previous Replication Done At: Tue May 10 12:32:42 EDT 2011 Config Files Replicated At: null Config Files Replicated: null Times Config Files Replicated Since Startup: null Next Replication Cycle At: Tue May 10 12:35:41 EDT 2011 Current Replication StatusStart Time: Tue May 10 12:32:41 EDT 2011 Files Downloaded: 18 / 108 Downloaded: 317.48 KB / 436.24 MB [0.0%] Downloading File: _ayu.nrm, Downloaded: 4 bytes / 4 bytes [100.0%] Time Elapsed: 17s, Estimated Time Remaining: 23902s, Speed: 18.67 KB/s Thanks, Ravi Kiran Bhaskar On Tue, May 10, 2011 at 4:10 AM, Alexander Kanarsky alexan...@trulia.com wrote: Ravi, as far as I remember, this is how the replication logic works (see SnapPuller class, fetchLatestIndex method): 1. Does the Slave get the whole index every time during replication or just the delta since the last replication happened ? It look at the index version AND the index generation. If both slave's version and generation are the same as on master, nothing gets replicated. if the master's generation is greater than on slave, the slave fetches the delta files only (even if the partial merge was done on the master) and put the new files from master to the same index folder on slave (either index or index.timestamp, see further explanation). However, if the master's index generation is equals or less than one on slave, the slave does the full replication by fetching all files of the master's index and place them into a separate folder on slave (index.timestamp). Then, if the fetch is successfull, the slave updates (or creates) the index.properties file and puts there the name of the current index folder. The old index.timestamp folder(s) will be kept in 1.4.x - which was treated as a bug - see SOLR-2156 (and this was fixed in 3.1). After this, the slave does commit or reload core depending whether the config files were
Re: Replication Clarification Please
Ravi, if you have what looks like a full replication each time even if the master generation is greater than slave, try to watch for the index on both master and slave the same time to see what files are getting replicated. You probably may need to adjust your merge factor, as Bill mentioned. -Alexander On Tue, 2011-05-10 at 12:45 -0400, Ravi Solr wrote: Hello Mr. Kanarsky, Thank you very much for the detailed explanation, probably the best explanation I found regarding replication. Just to be sure, I wanted to test solr 3.1 to see if it alleviates the problems...I dont think it helped. The master index version and generation are greater than the slave, still the slave replicates the entire index form master (see replication admin screen output below). Any idea why it would get the whole index everytime even in 3.1 or am I misinterpreting the output ? However I must admit that 3.1 finished the replication unlike 1.4.1 which would hang and be backed up for ever. Masterhttp://masterurl:post/solr-admin/searchcore/replication Latest Index Version:null, Generation: null Replicatable Index Version:1296217097572, Generation: 12726 Poll Interval 00:03:00 Local Index Index Version: 1296217097569, Generation: 12725 Location: /data/solr/core/search-data/index Size: 944.32 MB Times Replicated Since Startup: 148 Previous Replication Done At: Tue May 10 12:32:42 EDT 2011 Config Files Replicated At: null Config Files Replicated: null Times Config Files Replicated Since Startup: null Next Replication Cycle At: Tue May 10 12:35:41 EDT 2011 Current Replication StatusStart Time: Tue May 10 12:32:41 EDT 2011 Files Downloaded: 18 / 108 Downloaded: 317.48 KB / 436.24 MB [0.0%] Downloading File: _ayu.nrm, Downloaded: 4 bytes / 4 bytes [100.0%] Time Elapsed: 17s, Estimated Time Remaining: 23902s, Speed: 18.67 KB/s Thanks, Ravi Kiran Bhaskar On Tue, May 10, 2011 at 4:10 AM, Alexander Kanarsky alexan...@trulia.com wrote: Ravi, as far as I remember, this is how the replication logic works (see SnapPuller class, fetchLatestIndex method): 1. Does the Slave get the whole index every time during replication or just the delta since the last replication happened ? It look at the index version AND the index generation. If both slave's version and generation are the same as on master, nothing gets replicated. if the master's generation is greater than on slave, the slave fetches the delta files only (even if the partial merge was done on the master) and put the new files from master to the same index folder on slave (either index or index.timestamp, see further explanation). However, if the master's index generation is equals or less than one on slave, the slave does the full replication by fetching all files of the master's index and place them into a separate folder on slave (index.timestamp). Then, if the fetch is successfull, the slave updates (or creates) the index.properties file and puts there the name of the current index folder. The old index.timestamp folder(s) will be kept in 1.4.x - which was treated as a bug - see SOLR-2156 (and this was fixed in 3.1). After this, the slave does commit or reload core depending whether the config files were replicated. There is another bug in 1.4.x that fails replication if the slave need to do the full replication AND the config files were changed - also fixed in 3.1 (see SOLR-1983). 2. If there are huge number of queries being done on slave will it affect the replication ? How can I improve the performance ? (see the replications details at he bottom of the page) From my experience the half of the replication time is a time when the transferred data flushes to the disk. So the IO impact is important. 3. Will the segment names be same be same on master and slave after replication ? I see that they are different. Is this correct ? If it is correct how does the slave know what to fetch the next time i.e. the delta. They should be the same. The slave fetches the changed files only (see above), also look at SnapPuller code. 4. When and why does the index.TIMESTAMP folder get created ? I see this type of folder getting created only on slave and the slave instance is pointing to it. See above. 5. Does replication process copy both the index and index.TIMESTAMP folder ? index.timestamp folder gets created only of the full replication happened at least once. Otherwise, the slave will use the index folder. 6. what happens if the replication kicks off even before the previous invocation has not completed ? will the 2nd invocation block or will it go through causing more confusion ? There is a lock (snapPullLock in ReplicationHandler) that prevents two replications run simultaneously
Re: Replication Clarification Please
Ravi, as far as I remember, this is how the replication logic works (see SnapPuller class, fetchLatestIndex method): 1. Does the Slave get the whole index every time during replication or just the delta since the last replication happened ? It look at the index version AND the index generation. If both slave's version and generation are the same as on master, nothing gets replicated. if the master's generation is greater than on slave, the slave fetches the delta files only (even if the partial merge was done on the master) and put the new files from master to the same index folder on slave (either index or index.timestamp, see further explanation). However, if the master's index generation is equals or less than one on slave, the slave does the full replication by fetching all files of the master's index and place them into a separate folder on slave (index.timestamp). Then, if the fetch is successfull, the slave updates (or creates) the index.properties file and puts there the name of the current index folder. The old index.timestamp folder(s) will be kept in 1.4.x - which was treated as a bug - see SOLR-2156 (and this was fixed in 3.1). After this, the slave does commit or reload core depending whether the config files were replicated. There is another bug in 1.4.x that fails replication if the slave need to do the full replication AND the config files were changed - also fixed in 3.1 (see SOLR-1983). 2. If there are huge number of queries being done on slave will it affect the replication ? How can I improve the performance ? (see the replications details at he bottom of the page) From my experience the half of the replication time is a time when the transferred data flushes to the disk. So the IO impact is important. 3. Will the segment names be same be same on master and slave after replication ? I see that they are different. Is this correct ? If it is correct how does the slave know what to fetch the next time i.e. the delta. They should be the same. The slave fetches the changed files only (see above), also look at SnapPuller code. 4. When and why does the index.TIMESTAMP folder get created ? I see this type of folder getting created only on slave and the slave instance is pointing to it. See above. 5. Does replication process copy both the index and index.TIMESTAMP folder ? index.timestamp folder gets created only of the full replication happened at least once. Otherwise, the slave will use the index folder. 6. what happens if the replication kicks off even before the previous invocation has not completed ? will the 2nd invocation block or will it go through causing more confusion ? There is a lock (snapPullLock in ReplicationHandler) that prevents two replications run simultaneously. If there is no bug, it should just return silently from the replication call. (I personally never had problem with this so it looks there is no bug :) 7. If I have to prep a new master-slave combination is it OK to copy the respective contents into the new master-slave and start solr ? or do I have have to wipe the new slave and let it replicate from its new master ? If the new master has a different index, the slave will create a new index.timestamp folder. There is no need to wipe it. 8. Doing an 'ls | wc -l' on index folder of master and slave gave 194 and 17968 respectively...I slave has lot of segments_xxx files. Is this normal ? No, it looks like in your case the slave continues to replicate to the same folder for a long time period but the old files are not getting deleted bu some reason. Try to restart the slave or do core reload on it to see if the old segments gone. -Alexander
Re: Multicore Relaod Theoretical Question
Em, that's correct. You can use 'lsof' to see file handles still in use. See http://0xfe.blogspot.com/2006/03/troubleshooting-unix-systems-with-lsof.html, Recipe #11. -Alexander On Sun, Jan 23, 2011 at 1:52 AM, Em mailformailingli...@yahoo.de wrote: Hi Alexander, thank you for your response. You said that the old index files were still in use. That means Linux does not *really* delete them until Solr frees its locks from it, which happens while reloading? Thank you for sharing your experiences! Kind regards, Em Alexander Kanarsky wrote: Em, yes, you can replace the index (get the new one into a separate folder like index.new and then rename it to the index folder) outside the Solr, then just do the http call to reload the core. Note that the old index files may still be in use (continue to serve the queries while reloading), even if the old index folder is deleted - that is on Linux filesystems, not sure about NTFS. That means the space on disk will be freed only when the old files are not referenced by Solr searcher any longer. -Alexander On Sat, Jan 22, 2011 at 1:51 PM, Em mailformailingli...@yahoo.de wrote: Hi Erick, thanks for your response. Yes, it's really not that easy. However, the target is to avoid any kind of master-slave-setup. The most recent idea i got is to create a new core with a data-dir pointing to an already existing directory with a fully optimized index. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Multicore-Relaod-Theoretical-Question-tp2293999p2310709.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Multicore-Relaod-Theoretical-Question-tp2293999p2312778.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multicore Relaod Theoretical Question
Em, yes, you can replace the index (get the new one into a separate folder like index.new and then rename it to the index folder) outside the Solr, then just do the http call to reload the core. Note that the old index files may still be in use (continue to serve the queries while reloading), even if the old index folder is deleted - that is on Linux filesystems, not sure about NTFS. That means the space on disk will be freed only when the old files are not referenced by Solr searcher any longer. -Alexander On Sat, Jan 22, 2011 at 1:51 PM, Em mailformailingli...@yahoo.de wrote: Hi Erick, thanks for your response. Yes, it's really not that easy. However, the target is to avoid any kind of master-slave-setup. The most recent idea i got is to create a new core with a data-dir pointing to an already existing directory with a fully optimized index. Regards, Em -- View this message in context: http://lucene.472066.n3.nabble.com/Multicore-Relaod-Theoretical-Question-tp2293999p2310709.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: old index files not deleted on slave
I see the file -rw-rw-r-- 1 feeddo feeddo0 Dec 15 01:19 lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock was created on Dec. 15. At the end of the replication, as far as I remember, the SnapPuller tries to open the writer to ensure the old files are deleted, and in your case it cannot obtain a lock on the index folder on Dec 16, 17,18. Can you reproduce the problem if you delete the lock file, restart the slave and try replication again? Do you have any other Writer(s) open for this folder outside of this core? -Alexander On Sat, Jan 22, 2011 at 3:52 PM, feedly team feedly...@gmail.com wrote: The file system checked out, I also tried creating a slave on a different machine and could reproduce the issue. I logged SOLR-2329. On Sat, Dec 18, 2010 at 8:01 PM, Lance Norskog goks...@gmail.com wrote: This could be a quirk of the native locking feature. What's the file system? Can you fsck it? If this error keeps happening, please file this. It should not happen. Add the text above and also your solrconfigs if you can. One thing you could try is to change from the native locking policy to the simple locking policy - but only on the child. On Sat, Dec 18, 2010 at 4:44 PM, feedly team feedly...@gmail.com wrote: I have set up index replication (triggered on optimize). The problem I am having is the old index files are not being deleted on the slave. After each replication, I can see the old files still hanging around as well as the files that have just been pulled. This causes the data directory size to increase by the index size every replication until the disk fills up. Checking the logs, I see the following error: SEVERE: SnapPull failed org.apache.solr.common.SolrException: Index fetch failed : at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:265) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/solrhome/data/index/lucene-cdaa80c0fefe1a7dfc7aab89298c614c-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1065) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:954) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:192) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:99) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319) ... 11 more lsof reveals that the file is still opened from the java process. I am running 4.0 rev 993367 with patch SOLR-1316. Otherwise, the setup is pretty vanilla. The OS is linux, the indexes are on local directories, write permissions look ok, nothing unusual in the config (default deletion policy, etc.). Contents of the index data dir: master: -rw-rw-r-- 1 feeddo feeddo 191 Dec 14 01:06 _1lg.fnm -rw-rw-r-- 1 feeddo feeddo 26M Dec 14 01:07 _1lg.fdx -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 14 01:07 _1lg.fdt -rw-rw-r-- 1 feeddo feeddo 474M Dec 14 01:12 _1lg.tis -rw-rw-r-- 1 feeddo feeddo 15M Dec 14 01:12 _1lg.tii -rw-rw-r-- 1 feeddo feeddo 144M Dec 14 01:12 _1lg.prx -rw-rw-r-- 1 feeddo feeddo 277M Dec 14 01:12 _1lg.frq -rw-rw-r-- 1 feeddo feeddo 311 Dec 14 01:12 segments_1ji -rw-rw-r-- 1 feeddo feeddo 23M Dec 14 01:12 _1lg.nrm -rw-rw-r-- 1 feeddo feeddo 191 Dec 18 01:11 _24e.fnm -rw-rw-r-- 1 feeddo feeddo 26M Dec 18 01:12 _24e.fdx -rw-rw-r-- 1 feeddo feeddo 1.9G Dec 18 01:12 _24e.fdt -rw-rw-r-- 1 feeddo feeddo 483M Dec 18 01:23 _24e.tis -rw-rw-r-- 1 feeddo feeddo 15M Dec 18 01:23 _24e.tii -rw-rw-r-- 1 feeddo feeddo 146M Dec 18 01:23
Re: Can I host TWO separate datasets in Solr?
Igor, you can set two different Solr cores in solr.xml and search them separately. See multicore example in Solr distribution. -Alexander On Fri, Jan 21, 2011 at 3:51 PM, Igor Chudov ichu...@gmail.com wrote: I would like to have two sets of data and search them separately (they are used for two different websites). How can I do it? Thanks!
Re: Solr + Hadoop
Joan, make sure that you are running the job on Hadoop 0.21 cluster. (It looks like you have compiled the apache-solr-hadoop jar with Hadoop 0.21 but using it on 0.20 cluster). -Alexander
Re: Creating Solr index from map/reduce
Joan, current version of the patch assumes the location and names for the schema and solrconfig files ($SOLR_HOME/conf), it is hardcoded (see the SolrRecordWriter's constructor). Multi-core configuration with separate configuration locations via solr.xml is not supported as for now. As a workaround, you could link or copy the schema and solrconfig files to follow the hardcoded assumption. Thanks, -Alexander On Wed, Dec 29, 2010 at 2:50 AM, Joan joan.monp...@gmail.com wrote: If I rename my custom schema file (schema-xx.xml), whitch is located in SOLR_HOME/schema/, and then I copy it to conf folder and finally I try to run CSVIndexer, it shows me an other error: Caused by: java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in classpath or '/tmp/hadoop-root/mapred/local/taskTracker/archive/localhost/tmp/b7611d6d-9cc7-4237-a240-96ecaab9f21a.solr.zip/conf/' I dont't understand because I've a solr configuration file (solr.xml) where I define all core: core name=core_name instanceDir=solr-data/index config=solr/conf/solrconfig_xx.xml schema=solr/schema/schema_xx.xml properties=solr/conf/solrcore.properties/ But I think that when I run CSVIndexer, it doesn't know that solr.xml exist, and it try to looking for schema.xml and solrconfig.xml by default in default folder (conf) 2010/12/29 Joan joan.monp...@gmail.com Hi, I'm trying generate Solr index from hadoop (map/reduce) so I'm using this patch SOLR-301 https://issues.apache.org/jira/browse/SOLR-1301, however I don't get it. When I try to run CSVIndexer with some arguments: directory Solr index -solr Solr home input, in this case CSV I'm runnig CSVIndexer: HADOOP_INSTALL/bin/hadoop jar my.jar CSVIndexer INDEX_FOLDER -solr /SOLR_HOME CSV FILE PATH Before that I run CSVIndexer, I've put csv file into HDFS. My Solr home hasn't default files configurations, but which is divided into multiple folders /conf /schema I have custom solr file configurations so CSVIndexer can't find schema.xml, obviously It won't be able to find it because this file doesn't exist, in my case, this file is named schema-xx.xml and CSVIndexer is looking for it inside conf folder and It don't know that schema folder exist. And I have solr configuration file (solr.xml) where I configure multiple cores. I tried to modify solr's paths but It still not working . I understand that CSVIndexer copy Solr Home specified into HDFS (/tmp/hadoop-user/mapred/local/taskTracker/archive/...) and when It try to find schema.xml it doesn't exit: 10/12/29 10:18:11 INFO mapred.JobClient: Task Id : attempt_201012291016_0002_r_00_1, Status : FAILED java.lang.IllegalStateException: Failed to initialize record writer for my.jar, attempt_201012291016_0002_r_00_1 at org.apache.solr.hadoop.SolrRecordWriter.init(SolrRecordWriter.java:253) at org.apache.solr.hadoop.SolrOutputFormat.getRecordWriter(SolrOutputFormat.java:152) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:553) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.io.FileNotFoundException: Source '/tmp/hadoop-guest/mapred/local/taskTracker/archive/localhost/tmp/e8be5bb1-e910-47a1-b5a7-1352dfec2b1f.solr.zip/conf/schema.xml' does not exist at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:636) at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:606) at org.apache.solr.hadoop.SolrRecordWriter.init(SolrRecordWriter.java:222) ... 4 more
Re: Searching with wrong keyboard layout or using translit
Pavel, I think there is no single way to implement this. Some ideas that might be helpful: 1. Consider adding additional terms while indexing. This assumes conversion of Russian text to both translit and wrong keyboard forms and index converted terms along with original terms (i.e. your Analyzer/Filter should produce Moskva and Vjcrdf for term Москва). You may re-use the same field (if you plan for a simple term queries) or create a separate fields for the generated terms (better for phrase, proximity queries etc. since it keeps the original text positional info). Then the query could use any of these forms to fetch the document. If you use separate fields, you'll need to expand/create your query to search for them, of course. 2. If you have to index just an original Russian text, you might generate all term forms while analyzing the query, then you could treat the converted terms as a synonyms and use the combination of TermQuery for all term forms or the MultiPhraseQuery for the phrases. For Solr in this case you probably will need to add a custom filter similar to SynonymFilter. Hope this helps, -Alexander On Wed, Oct 27, 2010 at 1:31 PM, Pavel Minchenkov char...@gmail.com wrote: Hi, When I'm trying to search Google with wrong keyboard layout -- it corrects my query, example: http://www.google.ru/search?q=vjcrdf (I typed word Moscow in Russian but in English keyboard layout). http://www.google.ru/search?q=vjcrdfAlso, when I'm searching using translit, It does the same: http://www.google.ru/search?q=moskva What is the right way to implement this feature in Solr? -- Pavel Minchenkov
Re: I was at a search vendor round table today...
He said some other things about a huge petabyte hosted search collection they have used by banks.. In context of your discussion this reference sounds really, really funny... :) -Alexander On Wed, Sep 22, 2010 at 1:17 PM, Grant Ingersoll gsing...@apache.org wrote: On Sep 22, 2010, at 2:04 PM, Smiley, David W. wrote: (I don't twitter or blog so I thought I'd send this message here) Today at work (at MITRE outside DC) there was (is) a day of technical presentations about topics related to information dissemination and discovery (broad squishy words there, but mostly covered search) at which I spoke about the value of faceting, and gave a quick Solr pitch. There was an hour vendor panel in which a representative from Autonomy, Microsoft (i.e. FAST), Google, Vivisimo, and Endeca had the opportunity to espouse the virtues of their product, and fit in an occasional jab at their competitors next to them. In the absence of a suitable representative for Solr (e.g. Lucid) I pointed out how open-source Solr has democratized (i.e. made free) search and faceting when it used to require paying lots of money. And I asked them how their products have reacted to this new reality. Autonomy acknowledged they used to make millions on simple engagements in the distant past but that isn't the case these days. He said some other things about a huge petabyte hosted search collection they have used by banks... I forget what else he said. I forgot what Google said. Vivisimo quoted Steve Ballmer, saying open source is as free as a free puppy (not a bad point IMO). Too funny. Hadn't heard that one before. Presumably meaning you have to care and feed it, despite the fact that you really do love it and it is cute as hell? The care and feeding is true of the commercial ones, too, especially in terms of for supporting features you never use, but love (as in we love using this tool) is usually not a word I hear associated in those respects too often, but of course that is likely self selecting. Endeca claimed to be happy Solr exists because it raises the awareness of faceted search, but then claimed it would not scale and they should then upgrade to Endeca. (!) I found that claim ridiculous, of course. Having replaced all the above on a number of occasions w/ Solr at both a significant cost savings on licensing, dev time, and hardware, I would agree that claim is quite ridiculous. Besides, in my experience, the scale claim is silly. Everyone (customers) says they need scale, but few of them really know what scale is, so it is all relative. For some, scale is 1M docs, for others it's 1B+ docs; for others it's 100K queries per day, for others it's 100M per day. (BTW, I've seen Lucene/Solr do both, just fine. Not that it is a free lunch, but neither are the other ones despite what they say.) Speaking of performance, on a large scale search project where we're using Solr in place of a MarkLogic prototype (because ML is so friggin expensive, for one reason), the search results were so fast (~150ms) vs. the ML's results of 2-3 seconds, that the UI engineers building the interface on top of the XML output thought Solr was broken because it was so fast. The quote was It's so fast, it's broken. In other words, they were used to 2-3 second response times and so if the results came back as fast as what Solr has been doing, then surely there's a bug. There's no bug. :) Admittedly, I think it was a bit of an apples and oranges comparison but I love that quote nonetheless. I love it. I have had the same experience where people think it's broken b/c it's so fast. Large vendor named above took 24 hours to index 4M records (they weren't even doing anything fancy on the indexing side) and search was slow too. Solr took about 40 minutes to index all the content and search was blazing. Same content, faster indexing, better search results, a lot less time. At any rate, enough of tooting our own horn. Thanks for sharing! -Grant -- Grant Ingersoll http://www.lucidimagination.com/