The best way to delete several documents?
Dear Solr users, Every Friday I need to delete some documents on my solr db (around 100~200 docs). Could you help me to choose the best way to delete these documents. - I have the unique ID of each documents Another question: How can I disable to possibility to do: http://localhost:8983/solr/update?stream.body=deletequeryid:298253/query/deletecommit=true by using a webbrowser. I would like to do operations on my DB only if use a command line like java -jar -DUrl= http://localhost:8983/solr/update?stream.body=deletequeryid:298253/query/deletecommit=true post.jar is it possible? Thanks a lot, Bruno
Re: [ANNOUNCE] Web Crawler
Hii, I'm trying to configure crawl-anywhere 3.0.3 version in my local system.. i'm following the steps from the page http://www.crawl-anywhere.com/installation-v300/ but, crawlerws is failing and throwing the below error message in the brower http://localhost:8080/crawlerws/ error errno1/errno errmsgMissing action/errmsg /error Not sure where im doing wrong.. could please help me out to resolve the problem.. thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036493.html Sent from the Solr - User mailing list archive at Nabble.com.
indexing Text file in solr
i have a large Arabic Text File that contains Tweets each line contains one tweet , that i want to index in solr such that each line of this document should be indexed in a separate solr document what i tried so far : i know how to SQL databse records in solr i know how to change solr schema to fit the data and working with Data import handler i know how the queries used to index data in solr what i want is : know how to index text file in solr in order that each line is considered a solr document -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-Text-file-in-solr-tp4036496.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: The best way to delete several documents?
Hi, The best is if you could find a query for all docs you want to remove. If this is not simple you can use the following syntax: id: (1 2 3 4 5) to remove group of docs by ID (and if your default query operator is OR). Regards. On 27 January 2013 11:47, Bruno Mannina bmann...@free.fr wrote: Dear Solr users, Every Friday I need to delete some documents on my solr db (around 100~200 docs). Could you help me to choose the best way to delete these documents. - I have the unique ID of each documents Another question: How can I disable to possibility to do: http://localhost:8983/solr/**update?stream.body=http://localhost:8983/solr/update?stream.body= delete**queryid:298253/query/**deletecommit=true by using a webbrowser. I would like to do operations on my DB only if use a command line like java -jar -DUrl= http://localhost:8983/solr/**update?stream.body=http://localhost:8983/solr/update?stream.body= delete**queryid:298253/query/**deletecommit=true post.jar is it possible? Thanks a lot, Bruno
Re: The best way to delete several documents?
Hi, Even If I have one or two thousands of Id ? Thanks Le 27/01/2013 13:15, Marcin Rzewucki a écrit : Hi, The best is if you could find a query for all docs you want to remove. If this is not simple you can use the following syntax: id: (1 2 3 4 5) to remove group of docs by ID (and if your default query operator is OR). Regards. On 27 January 2013 11:47, Bruno Mannina bmann...@free.fr wrote: Dear Solr users, Every Friday I need to delete some documents on my solr db (around 100~200 docs). Could you help me to choose the best way to delete these documents. - I have the unique ID of each documents Another question: How can I disable to possibility to do: http://localhost:8983/solr/**update?stream.body=http://localhost:8983/solr/update?stream.body= delete**queryid:298253/query/**deletecommit=true by using a webbrowser. I would like to do operations on my DB only if use a command line like java -jar -DUrl= http://localhost:8983/solr/**update?stream.body=http://localhost:8983/solr/update?stream.body= delete**queryid:298253/query/**deletecommit=true post.jar is it possible? Thanks a lot, Bruno
Re: The best way to delete several documents?
You can write a script and remove say 50 docs in 1 call. It's always better than removing 1 by 1. Regards. On 27 January 2013 13:17, Bruno Mannina bmann...@free.fr wrote: Hi, Even If I have one or two thousands of Id ? Thanks Le 27/01/2013 13:15, Marcin Rzewucki a écrit : Hi, The best is if you could find a query for all docs you want to remove. If this is not simple you can use the following syntax: id: (1 2 3 4 5) to remove group of docs by ID (and if your default query operator is OR). Regards. On 27 January 2013 11:47, Bruno Mannina bmann...@free.fr wrote: Dear Solr users, Every Friday I need to delete some documents on my solr db (around 100~200 docs). Could you help me to choose the best way to delete these documents. - I have the unique ID of each documents Another question: How can I disable to possibility to do: http://localhost:8983/solr/update?stream.body=http://localhost:8983/solr/**update?stream.body= http://**localhost:8983/solr/update?**stream.body=http://localhost:8983/solr/update?stream.body= delete**queryid:298253/**query/**deletecommit=true by using a webbrowser. I would like to do operations on my DB only if use a command line like java -jar -DUrl= http://localhost:8983/solr/update?stream.body=http://localhost:8983/solr/**update?stream.body= http://**localhost:8983/solr/update?**stream.body=http://localhost:8983/solr/update?stream.body= delete**queryid:298253/**query/**deletecommit=true post.jar is it possible? Thanks a lot, Bruno
Re: The best way to delete several documents?
yep ok thks ! Le 27/01/2013 13:27, Marcin Rzewucki a écrit : You can write a script and remove say 50 docs in 1 call. It's always better than removing 1 by 1. Regards. On 27 January 2013 13:17, Bruno Mannina bmann...@free.fr wrote: Hi, Even If I have one or two thousands of Id ? Thanks Le 27/01/2013 13:15, Marcin Rzewucki a écrit : Hi, The best is if you could find a query for all docs you want to remove. If this is not simple you can use the following syntax: id: (1 2 3 4 5) to remove group of docs by ID (and if your default query operator is OR). Regards. On 27 January 2013 11:47, Bruno Mannina bmann...@free.fr wrote: Dear Solr users, Every Friday I need to delete some documents on my solr db (around 100~200 docs). Could you help me to choose the best way to delete these documents. - I have the unique ID of each documents Another question: How can I disable to possibility to do: http://localhost:8983/solr/update?stream.body=http://localhost:8983/solr/**update?stream.body= http://**localhost:8983/solr/update?**stream.body=http://localhost:8983/solr/update?stream.body= delete**queryid:298253/**query/**deletecommit=true by using a webbrowser. I would like to do operations on my DB only if use a command line like java -jar -DUrl= http://localhost:8983/solr/update?stream.body=http://localhost:8983/solr/**update?stream.body= http://**localhost:8983/solr/update?**stream.body=http://localhost:8983/solr/update?stream.body= delete**queryid:298253/**query/**deletecommit=true post.jar is it possible? Thanks a lot, Bruno
Re: [ANNOUNCE] Web Crawler
This is actualy showing it works. crawlerws is used by Crawl Anywhere UI and will pass it the correct arguments when needed. SivaKarthik wrote Hii, I'm trying to configure crawl-anywhere 3.0.3 version in my local system.. i'm following the steps from the page http://www.crawl-anywhere.com/installation-v300/ but, crawlerws is failing and throwing the below error message in the brower http://localhost:8080/crawlerws/ error errno 1 /errno errmsg Missing action /errmsg /error Not sure where im doing wrong.. could please help me out to resolve the problem.. thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036520.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud index recovery
Hi Mark, I see no such issues in Solr 4.1. It seems to work fine. Thanks. On 24 January 2013 03:58, Mark Miller markrmil...@gmail.com wrote: Yeah, I don't know what you are seeing offhand. You might try Solr 4.1 and see if it's something that has been resolved. - Mark On Jan 23, 2013, at 3:14 PM, Marcin Rzewucki mrzewu...@gmail.com wrote: Guys, I pasted you the full log (see pastebin url). Yes, it is Solr4.0. 2 cores are in sync, but the 3rd one is not: INFO: PeerSync Recovery was not successful - trying replication. core=ofac INFO: Starting Replication Recovery. core=ofac It started replication and even says it is done successfully: INFO: Replication Recovery was successful - registering as Active. core=ofac but index files were not downloaded. It's empty, no docs. Also I do not see replication.properties file. tlog dir is empty and index dir contains only 3 files: segments.gen, segments_7 and write.lock It seems to be tough issue. Anyway, thanks for your help. On 23 January 2013 15:41, Mark Miller markrmil...@gmail.com wrote: Looks like it shows 3 cores start - 2 with versions that decide they are up to date and one that replicates. The one that replicates doesn't have much logging showing that activity. Is this Solr 4.0? - Mark On Jan 23, 2013, at 9:27 AM, Upayavira u...@odoko.co.uk wrote: Mark, Take a peek in the pastebin url Marcin mentioned earlier (http://pastebin.com/qMC9kDvt) is there enough info there? Upayavira On Wed, Jan 23, 2013, at 02:04 PM, Mark Miller wrote: Was your full logged stripped? You are right, we need more. Yes, the peer sync failed, but then you cut out all the important stuff about the replication attempt that happens after. - Mark On Jan 23, 2013, at 5:28 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi, Previously, I took the lines related to collection I tested. Maybe some interesting part was missing. I'm sending the full log this time. It ends up with: INFO: Finished recovery process. core=ofac The issue I described is related to collection called ofac. I hope the log is meaningful now. It is trying to do the replication, but it seems to not know which files to download. Regards. On 23 January 2013 10:39, Upayavira u...@odoko.co.uk wrote: the first stage is identifying whether it can sync with transaction logs. It couldn't, because there's no index. So the logs you have shown make complete sense. It then says 'trying replication', which is what I would expect, and the bit you are saying has failed. So the interesting bit is likely immediately after the snippet you showed. Upayavira On Wed, Jan 23, 2013, at 07:40 AM, Marcin Rzewucki wrote: OK, so I did yet another test. I stopped solr, removed whole data/ dir and started Solr again. Directories were recreated fine, but missing files were not downloaded from leader. Log is attached (I took the lines related to my test with 2 lines of context. I hope it helps.). I could find the following warning message: Jan 23, 2013 7:16:08 AM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=ofac url=http://replica_host:8983/solr START replicas=[http://leader_host:8983/solr/ofac/] nUpdates=100 Jan 23, 2013 7:16:08 AM org.apache.solr.update.PeerSync sync WARNING: no frame of reference to tell of we've missed updates Jan 23, 2013 7:16:08 AM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: PeerSync Recovery was not successful - trying replication. core=ofac So it did not know which files to download ?? Could you help me to solve this problem ? Thanks in advance. Regards. On 22 January 2013 23:06, Yonik Seeley [1]yo...@lucidworks.com wrote: On Tue, Jan 22, 2013 at 4:37 PM, Marcin Rzewucki [2]mrzewu...@gmail.com wrote: Sorry, my mistake. I did 2 tests: in the 1st I removed just index directory and in 2nd test I removed both index and tlog directory. Log lines I've sent are related to the first case. So Solr could read tlog directory in that moment. Anyway, do you have an idea why it did not download files from leader ? For your 1st test, if you only deleted the index and not the transaction logs, Solr will look at the transaction logs to try and determine if it is up to date or not (by comparing with peers). If you want to clear out all the data, remove the entire data directory. -Yonik [3]http://lucidworks.com References 1. mailto:yo...@lucidworks.com 2. mailto:mrzewu...@gmail.com 3. http://lucidworks.com/
[Announce] Apache Solr 3.6.2 with RankingAlgorithm 1.4.3 available for download now -- includes experimental TimedSerialMergeSchdeduler
Hi: I am very excited to announce the availability of Apache Solr 3.6.2 with RankingAlgorithm30 1.4.3 with realtime-search support. realtime-search is very fast NRT and allows you to not only lookup a document by id but also allows you to search in realtime, see http://tgels.org/realtime-nrt.jsp. The update performance is about 10,000 docs / sec. The query performance is in ms, allows you to query a 10m wikipedia index (complete index) in 50 ms. This release also includes a experimental TimedSerialMergeScheduler http://rankingalgorithm.1050964.n5.nabble.com/TimedSerialMergerScheduler-java-allows-merges-to-be-deferred-to-a-known-time-like-11pm-or-1am-tp5706350.html that allows you to postpone your merges to off hours time like 11pm or 1am increasing performance. RankingAlgorithm30 1.4.3 supports the entire Lucene Query Syntax, ± and/or boolean queries. You can get more information about realtime-search performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver3.x You can download Solr 3.6.2 with RankingAlgorithm30 1.4.3 from here: http://solr-ra.tgels.org Please download and give the new version a try. Note: 1. Apache Solr 3.6.2 with RankingAlgorithm30 1.4.3 is an external project. 2. realtime-search has been contributed back to Apache Solr, see https://issues.apache.org/jira/browse/SOLR-3816 Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://elasticsearch-ra.tgels.org http://rankingalgorithm.tgels.org
Re: java.lang.IllegalArgumentException: ./collection_shard2_1/data/index does not exist
Thanks Marcin. I found your post via a google search and there was no reply attached to it so I thought no one replied. Apologies and thanks again. On Sat, Jan 26, 2013 at 6:48 PM, Marcin Rzewucki mrzewu...@gmail.comwrote: Hi, Actually Mark Miller replied to this issue and it seems to be fixed in Solr 4.1 as far as I checked. Anyway, it was harmless both for querying and indexing. Regards. On 26 January 2013 20:14, Prashant Saraswat prashant.saras...@pixalsoft.com wrote: Hi Guys, We are using Solr 4.0 in a 2 shard cluster with replication enabled. On solr startup we get an exception like this: WARNING: Could not getStatistics on info bean org.apache.solr.handler.ReplicationHandler java.lang.IllegalArgumentException: ./collectionOne_shard2_1/data/index does not exist at org.apache.commons.io.FileUtils.sizeOfDirectory(FileUtils.java:2074) at org.apache.solr.handler.ReplicationHandler.getIndexSize(ReplicationHandler.java:477) at org.apache.solr.handler.ReplicationHandler.getStatistics(ReplicationHandler.java:525) at org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean.getMBeanInfo(JmxMonitoredMap.java:231) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getNewMBeanClassName(DefaultMBeanServerInterc eptor.java:321)... This directory doesn't exist. But we do have a directory like this: index.1234... Indexing and search seem to be fine. Can someone confirm that this is harmless? Marcin Rzewucki asked the same question on December 28 2012 and got no response. Can someone kindly respond please? Thanks PixalSoft
secure Solr server
Before Solr 4.0, I secure solr by enable password protection in Jetty. However, password protection will make solrcloud not work. We use EC2 now, and we need the www admin interface of solr to be accessible (with password) from anywhere. How do you protect your solr sever from unauthorized access? Thanks, Ming
Re: secure Solr server
You can define a security filter in WEB-INF\web.xml, on specific url patterns. You might want to set the url pattern to /admin/*. [find examples here: http://stackoverflow.com/questions/7920092/how-can-i-bypass-security-filter-in-web-xml ] On Sun, Jan 27, 2013 at 8:07 PM, Mingfeng Yang mfy...@wisewindow.comwrote: Before Solr 4.0, I secure solr by enable password protection in Jetty. However, password protection will make solrcloud not work. We use EC2 now, and we need the www admin interface of solr to be accessible (with password) from anywhere. How do you protect your solr sever from unauthorized access? Thanks, Ming
RE: The best way to delete several documents?
Hi Bruno, Why don't you write deletePkQuery to delete these documents and set your cron to run delta query on every Friday? Regards Harshvardhan Ojha -Original Message- From: Bruno Mannina [mailto:bmann...@free.fr] Sent: Sunday, January 27, 2013 6:03 PM To: solr-user@lucene.apache.org Subject: Re: The best way to delete several documents? yep ok thks ! Le 27/01/2013 13:27, Marcin Rzewucki a écrit : You can write a script and remove say 50 docs in 1 call. It's always better than removing 1 by 1. Regards. On 27 January 2013 13:17, Bruno Mannina bmann...@free.fr wrote: Hi, Even If I have one or two thousands of Id ? Thanks Le 27/01/2013 13:15, Marcin Rzewucki a écrit : Hi, The best is if you could find a query for all docs you want to remove. If this is not simple you can use the following syntax: id: (1 2 3 4 5) to remove group of docs by ID (and if your default query operator is OR). Regards. On 27 January 2013 11:47, Bruno Mannina bmann...@free.fr wrote: Dear Solr users, Every Friday I need to delete some documents on my solr db (around 100~200 docs). Could you help me to choose the best way to delete these documents. - I have the unique ID of each documents Another question: How can I disable to possibility to do: http://localhost:8983/solr/update?stream.body=http://localhost :8983/solr/**update?stream.body= http://**localhost:8983/solr/update?**stream.body=http://localhos t:8983/solr/update?stream.body= delete**queryid:298253/**query/**deletecommit=true by using a webbrowser. I would like to do operations on my DB only if use a command line like java -jar -DUrl= http://localhost:8983/solr/update?stream.body=http://localhost :8983/solr/**update?stream.body= http://**localhost:8983/solr/update?**stream.body=http://localhos t:8983/solr/update?stream.body= delete**queryid:298253/**query/**deletecommit=true post.jar is it possible? Thanks a lot, Bruno The contents of this email, including the attachments, are PRIVILEGED AND CONFIDENTIAL to the intended recipient at the email address to which it has been addressed. If you receive it in error, please notify the sender immediately by return email and then permanently delete it from your system. The unauthorized use, distribution, copying or alteration of this email, including the attachments, is strictly forbidden. Please note that neither MakeMyTrip nor the sender accepts any responsibility for viruses and it is your responsibility to scan the email and attachments (if any). No contracts may be concluded on behalf of MakeMyTrip by means of email communications.
Re: SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs
Hi Shawn, Thanks for your reply. After following your suggestions we were able to index 30k documents. I have some queries: 1) What is stored in the RAM while only indexing is going on? How to calculate the RAM/heap requirements for our documents? 2) The document cache, filter cache, etc...are populated while querying. Correct me if I am wrong. Are there any caches that are populated while indexing? Thanks, Rahul On Sat, Jan 26, 2013 at 11:46 PM, Shawn Heisey s...@elyograg.org wrote: On 1/26/2013 12:55 AM, Rahul Bishnoi wrote: Thanks for quick reply and addressing each point queried. Additional asked information is mentioned below: OS = Ubuntu 12.04 (64 bit) Sun Java 7 (64 bit) Total RAM = 8GB SolrConfig.xml is available at http://pastebin.com/SEFxkw2R Rahul, The MaxPermGenSize could be a contributing factor. The documents where you have 1000 words are somewhat large, though your overall index size is pretty small. I would try removing the MaxPermGenSize option and see what happens. You can also try reducing the ramBufferSizeMB in solrconfig.xml. The default in previous versions of Solr was 32, which is big enough for most things, unless you are indexing HUGE documents like entire books. It looks like you have the cache sizes under query at values close to default. I wouldn't decrease the documentCache any - in fact an increase might be a good thing there. As for the others, you could probably reduce them. The filterCache size I would start at 64 or 128. Watch your cache hitratios to see whether the changes make things remarkably worse. If that doesn't help, try increasing the -Xmx option - first 3072m, next 4096m. You could go as high as 6GB and not run into any OS cache problems with your small index size, though you might run into long GC pauses. Indexing, especially big documents, is fairly memory intensive. Some queries can be memory intensive as well, especially those using facets or a lot of clauses. Under normal operation, I could probably get away with a 3GB heap size, but I have it at 8GB because otherwise a full reindex (full-import from mysql) runs into OOM errors. Thanks, Shawn
Re: SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs
On 1/27/2013 10:28 PM, Rahul Bishnoi wrote: Thanks for your reply. After following your suggestions we were able to index 30k documents. I have some queries: 1) What is stored in the RAM while only indexing is going on? How to calculate the RAM/heap requirements for our documents? 2) The document cache, filter cache, etc...are populated while querying. Correct me if I am wrong. Are there any caches that are populated while indexing? If anyone catches me making statements that are not true, please feel free to correct me. The caches are indeed only used during querying. If you are not making queries at all, they aren't much of a factor. I can't give you any definitive answers to your question about RAM usage and how to calculate RAM/heap requirements. I can make some general statements without looking at the code, just based on what I've learned so far about Solr, and about Java in general. You would have an exact copy of the input text for each field initially, which would ultimately get used for the stored data (for those fields that are stored). Each one is probably just a plain String, though I don't know as I haven't read the code. If the field is not being stored or copied, then it would be possible to get rid of that data as soon as it is no longer required for indexing. I don't have any idea whether Solr/Lucene code actually gets rid of the exact copy in this way. If you are storing termvectors, additional memory would be needed for that. I don't know if that involves lots of objects or if it's one object with index information. Based on my experience, termvectors can be bigger than the stored data for the same field. Tokenization and filtering is where I imagine that most of the memory would get used. If you're using a filter like EdgeNGram, that's a LOT of tokens. Even if you're just tokenizing words, it can add up. There is also space required for the inverted index, norms, and other data/metadata. If each token is a separate Java object (which I do not know), there would be a fair amount of memory overhead involved. A String object in java has something like 40 bytes of overhead above and beyond the space required for the data. Also, strings in Java are internally represented in UTF-16, so each character actually takes two bytes. http://www.javamex.com/tutorials/memory/string_memory_usage.shtml The finished documents stack up in the ramBufferSizeMB space until it gets full or a hard commit is issued, at which point they are flushed to disk as a Lucene segment. One thing that I'm not sure about is whether an additional ram buffer is allocated for further indexing while the flush is happening, or if it flushes and then re-uses the buffer for subsequent documents. Another way that it can use memory is when merging index segments. I don't know how much memory gets used for this process. On Solr 4 with the default directory factory, part of a flushed segment may remain in RAM until enough additional segment data is created. The amount of memory used by this feature should be pretty small, unless you have a lot of cores on a single JVM. That extra memory can be eliminated by using MMapDirectoryFactory instead of NRTCachingDirectoryFactory, at the expense of fast Near-RealTime index updates. Thanks, Shawn
[SOLR 4.0] Number of fields vs searching speed
Hi guys, what is relation between number of indexed fields and searching speed? For example I have same number of records, same searching SOLR query but 100 indexed fields for each record in case 1 and 1000 fields in case 2. I's obvious that searching time in case 2 will be greater, but how much? 10 times? Or is there another relation between number of indexed fields and search time? Thanks a lot! -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-0-Number-of-fields-vs-searching-speed-tp4036665.html Sent from the Solr - User mailing list archive at Nabble.com.