Re: Caused by: org.noggit.JSONParser$ParseException: Expected ',' or '}': char=",position=312 BEFORE='ssions"
Yes, absolutely correct, comma is missing at the end of line 10 All key-value pairs inside the same block should be comma separated, except last one From: Shawn HeiseyReply: solr-user@lucene.apache.org Date: April 25, 2017 at 2:29:03 PM To: solr-user@lucene.apache.org Subject: Re: Caused by: org.noggit.JSONParser$ParseException: Expected ',' or '}': char=",position=312 BEFORE='ssions" On 4/25/2017 12:10 PM, bay chae wrote: > https://stackoverflow.com/questions/43618000/solr-standalone-basicauth-org-noggit-jsonparserparseexception < https://stackoverflow.com/questions/43618000/solr-standalone-basicauth-org-noggit-jsonparserparseexception> > > Hi I am following guides on security.json in https://cwiki.apache.org/confluence/display/solr/Rule-Based+Authorization+Plugin < https://cwiki.apache.org/confluence/display/solr/Rule-Based+Authorization+Plugin>. > > But when solr starts up I am getting: > > Caused by: org.noggit.JSONParser$ParseException: Expected ',' or '}': char=",position=312 BEFORE='ssions":[{"name":"security-edit", "role":"admin"}] "' AFTER='user-role":{"solr":"admin"} }} Looks like the JSON on that documentation page is incorrect, and has been wrong for a very long time. It doesn't validate when run through a JSON validator. If I add a comma at the end of line 10 (just before "user-role"), then it validates. I do not know whether this is the correct fix, but I think it probably is. Before I update the documentation, I would like somebody who's familiar with this file to tell me whether I've got the right fix. Thanks, Shawn
Re: CPU Intensive Scoring Alternatives
Walter, I use BM25 which is default for Solr 6.3, and I clearly visually saw correlation between number of hits and response times in Solr logs, it is almost linear. With underloaded system. With “solrmeter” 10-requests-per-second CPU goes to 400% on 12-core-hyperthread machine, and with 20-requests-per-second goes to 1100%. No issues with GC. Java 8 121 from Oracle, 64-bit. 20 requests per second, Solr 6, (to SOlr) kidding? I never expected that for simplest queries Doug, I was never been able to make “mm” parameter work for me; I cannot understand how it works. I use eDisMax, and few “text_general” fields, with default for Solr operator “OR”, and default “mm” (which should be “1” for “OR) From: Walter Underwood <wun...@wunderwood.org> <wun...@wunderwood.org> Reply: solr-user@lucene.apache.org <solr-user@lucene.apache.org> <solr-user@lucene.apache.org> Date: February 21, 2017 at 5:24:23 PM To: solr-user@lucene.apache.org <solr-user@lucene.apache.org> <solr-user@lucene.apache.org> Subject: Re: CPU Intensive Scoring Alternatives 300 ms seems pretty good for 200 million documents. Is that average? Median? 95th percentile? Why are you sure it is because the huge number of hits? That would be unusual. The size of the posting lists is a more common cause. Why do you think it is caused by tf.idf? That should be faster than BM25. Does host have enough RAM to hold most or all of the index in file buffers? What are the hit rates on your caches? Are you using fuzzy matches? N-gram prefix matching? Phrase matching? Shingles? What version of Java are you running? What garbage collector? wunder Walter Underwood wun...@wunderwood.org <mailto:wun...@wunderwood.org> http://observer.wunderwood.org/ (my blog) > On Feb 21, 2017, at 10:42 AM, Doug Turnbull < dturnb...@opensourceconnections.com > wrote: > > With that many documents, why not start with an AND search and reissue an > OR query if there's no results? My strategy is to prefer an AND for large > collections (or a higher mm than 1) and prefer closer to an OR for smaller > collections. > > -Doug > > On Tue, Feb 21, 2017 at 1:39 PM Fuad Efendi <f...@efendi.ca > wrote: > >> Thank you Ahmet, I will try it; sounds reasonable >> >> >> From: Ahmet Arslan <iori...@yahoo.com.invalid > <iori...@yahoo.com.invalid > >> Reply: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org> < solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>> >> <solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>>, Ahmet Arslan <iori...@yahoo.com <mailto:iori...@yahoo.com>> >> <iori...@yahoo.com <mailto:iori...@yahoo.com>> >> Date: February 21, 2017 at 3:02:11 AM >> To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org> < solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>> >> <solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>> >> Subject: Re: CPU Intensive Scoring Alternatives >> >> Hi, >> >> New default similarity is BM25. >> May be explicitly set similarity to tf-idf and see how it goes? >> >> Ahmet >> >> >> On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi <f...@efendi.ca <mailto:f...@efendi.ca>> wrote: >> Hello, >> >> >> Default TF-IDF performs poorly with the indexed 200 millions documents. >> Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3 >> seconds. eDisMax. Because default operator "OR" and stopword "The" we have >> 50-70 millions documents as a query result, and scoring is CPU intensive. >> What to do? Our typical queries return over million documents, and response >> times of simple queries ranges from 50 milliseconds to 5-10 seconds >> depending on result set. >> >> This was just an exaggerated example with stopword “the”, but even simplest >> query “Michael Jackson” runs 300ms instead of 3ms just because huge number >> of hits and TF-IDF calculations. Solr 6.3. >> >> >> Thanks, >> >> -- >> >> Fuad Efendi >> >> (416) 993-2060 >> >> http://www.tokenizer.ca <http://www.tokenizer.ca/> >> Search Relevancy, Recommender Systems >>
Re: CPU Intensive Scoring Alternatives
Thank you Ahmet, I will try it; sounds reasonable From: Ahmet Arslan <iori...@yahoo.com.invalid> <iori...@yahoo.com.invalid> Reply: solr-user@lucene.apache.org <solr-user@lucene.apache.org> <solr-user@lucene.apache.org>, Ahmet Arslan <iori...@yahoo.com> <iori...@yahoo.com> Date: February 21, 2017 at 3:02:11 AM To: solr-user@lucene.apache.org <solr-user@lucene.apache.org> <solr-user@lucene.apache.org> Subject: Re: CPU Intensive Scoring Alternatives Hi, New default similarity is BM25. May be explicitly set similarity to tf-idf and see how it goes? Ahmet On Tuesday, February 21, 2017 4:28 AM, Fuad Efendi <f...@efendi.ca> wrote: Hello, Default TF-IDF performs poorly with the indexed 200 millions documents. Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3 seconds. eDisMax. Because default operator "OR" and stopword "The" we have 50-70 millions documents as a query result, and scoring is CPU intensive. What to do? Our typical queries return over million documents, and response times of simple queries ranges from 50 milliseconds to 5-10 seconds depending on result set. This was just an exaggerated example with stopword “the”, but even simplest query “Michael Jackson” runs 300ms instead of 3ms just because huge number of hits and TF-IDF calculations. Solr 6.3. Thanks, -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Search Relevancy, Recommender Systems
CPU Intensive Scoring Alternatives
Hello, Default TF-IDF performs poorly with the indexed 200 millions documents. Query "Michael Jackson" may run 300ms, and "Michael The Jackson" over 3 seconds. eDisMax. Because default operator "OR" and stopword "The" we have 50-70 millions documents as a query result, and scoring is CPU intensive. What to do? Our typical queries return over million documents, and response times of simple queries ranges from 50 milliseconds to 5-10 seconds depending on result set. This was just an exaggerated example with stopword “the”, but even simplest query “Michael Jackson” runs 300ms instead of 3ms just because huge number of hits and TF-IDF calculations. Solr 6.3. Thanks, -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Search Relevancy, Recommender Systems
Re: Solr 5.5.0 MSSQL Datasource Example
Perhaps this answers your question: http://stackoverflow.com/questions/27418875/microsoft-sqlserver-driver-datasource-have-password-empty Try different one as per Eclipse docs, http://www.eclipse.org/jetty/documentation/9.4.x/jndi-datasource-examples.html jdbc/DSTest user pass dbname localhost 1433 -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Search Relevancy, Recommender Systems From: Per Newgro <per.new...@gmx.ch> <per.new...@gmx.ch> Reply: solr-user@lucene.apache.org <solr-user@lucene.apache.org> <solr-user@lucene.apache.org> Date: February 7, 2017 at 10:15:42 AM To: solr-user-group <solr-user@lucene.apache.org> <solr-user@lucene.apache.org> Subject: Solr 5.5.0 MSSQL Datasource Example Hello, has someone a working example for MSSQL Datasource with 'Standard Microsoft SQL Driver'. My environment: debian Java 8 Solr 5.5.0 Standard (download and installed as service) server/lib/ext sqljdbc4-4.0.jar Global JNDI resource defined server/etc/jetty.xml java:comp/env/jdbc/mydb ip mydb user password or 2nd option tried java:comp/env/jdbc/mydb jdbc:sqlserver://ip;databaseName=mydb; user password collection1/conf/db-data-config.xml ... This leads to SqlServerException login failed for user. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216) at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onEOF(tdsparser.java:254) at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:84) at com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:2908) at com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:2234) at com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:41) at com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:2220) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715) at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1326) at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:991) at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:827) at com.microsoft.sqlserver.jdbc.SQLServerDataSource.getConnectionInternal(SQLServerDataSource.java:621) at com.microsoft.sqlserver.jdbc.SQLServerDataSource.getConnection(SQLServerDataSource.java:57) at org.apache.solr.handler.dataimport.JdbcDataSource$1.getFromJndi(JdbcDataSource.java:256) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:182) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:172) at org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:463) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:309) ... 12 more But when i remove the jndi datasource and rewrite the dataimport data source to driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" br/> url="jdbc:sqlserver://ip;databaseName=mydb" user="user" password="password" /> ... Then it works. But this way i need to configure the db in every core. I would like to avoid that. Thanks Per
Re: Solr 5.3.1: Collection reload results in IndexWriter is closed exception
Were you indexing new documents while reloading? “Previously we’ve done reloads of a collection after changing solrconfig.xml without any issues.” -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Search Relevancy, Recommender Systems From: Kelly, Frank <frank.ke...@here.com> <frank.ke...@here.com> Reply: solr-user@lucene.apache.org <solr-user@lucene.apache.org> <solr-user@lucene.apache.org> Date: February 7, 2017 at 12:19:21 PM To: solr-user@lucene.apache.org <solr-user@lucene.apache.org> <solr-user@lucene.apache.org> Subject: Solr 5.3.1: Collection reload results in IndexWriter is closed exception Just wondering if anyone has seen this before and might understand why this is happening Environment: Solr 5.3.1 in Solr Cloud (3 shards each with 3 replicas across 3 EC2 Vms) 100m documents (20+ GB index) Previously we’ve done reloads of a collection after changing solrconfig.xml without any issues. This time we saw it across 3 of 3 environments where we got several Solr’s showing “IndexWriter is closed” errors and had to stop and restart those Solr instances. In our final environment we skipped the RELOAD and just did solr stop, solr start. The solrconfig.xml change we made was turning on the replication handler (not sure if this has any bearing on the issue) 96 6 6 Is there anything “unsafe” about reload on a collection that is handling live traffic in that version? Cheers! -Frank [image: Description: Macintosh HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf] *Frank Kelly* *Principal Software Engineer* HERE 5 Wayside Rd, Burlington, MA 01803, USA *42° 29' 7" N 71° 11' 32" W* [image: Description: /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif] <http://360.here.com/>[image: Description: /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif] <https://www.twitter.com/here> [image: Description: /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif] <https://www.facebook.com/here>[image: Description: /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif] <https://www.linkedin.com/company/heremaps>[image: Description: /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif] <https://www.instagram.com/here/>
Re: Help with design choice: join or multiValued field
Correct: multivalued field with 1 shop IDs. Use case: shopping network in U.S. for example for a big brand such as Walmart, when user implicitly provides IP address or explicitly Postal Code, so that we can find items in his/her neighbourhood. You basically provide “join” information via this 10,000-sized collection of IDs per document. It almost doesn’t have any impact on index size. User query needs to provide list of preferred IDs (if for example we know user’s geo location). And for this “Walmart” use case you may also need “Available Online Only” option, etc. From: Karl KildénReply: solr-user@lucene.apache.org Date: February 6, 2017 at 5:57:41 AM To: solr-user@lucene.apache.org Subject: Help with design choice: join or multiValued field Hello! I have Items and I have Shops. This is a e-commerce system with items from thousands of shops all though the inventory is often similar between shops. Some users can shop from any shop and some only from their default one. One item can exist in about 1 shops. - When a user logs in they may have a shop pre selected so when they search for items we need to get all matching documents but if it's' found in their pre selected shop we should mark it out in the UI. - They need to be able to filter out only items in their current shop - Items found in their shop should always be boosted heavily TLDR: Either we just have a multiValued field on the item document with all shops. This would be a multivalued field with 1 rows Or Could we have a new document ShopItem that has the shopId and the itemId (think join table). Then we join this document instead... But we still need to get the Item document back, and we need bq boosting on item?
Re: Time of insert
Not; historical logs for document updates is not provided. Users need to implement such functionality themselves if needed. From: Mahmoud AlmokademReply: solr-user@lucene.apache.org Date: February 6, 2017 at 3:32:34 PM To: solr-user@lucene.apache.org Subject: Time of insert Hello, I'm using dih on solr 6 for indexing data from sql server. The document can br indexed many times according to the updates on it. Is that available to get the first time the document inserted to solr? And how to get the dates of the document updated? Thanks for help, Mahmoud
Re: How-To: Secure Solr by IP Address
*Deserves* to mention: I run Solr on 8080 port, and Firewall blocks *port* 8080. It is not indeed securing by IP address! “block by IP” vs. “block by port number” “block *all* services run on a machine by IP address” vs. “block only Jetty” and etc. Still need option for Jetty, it will simplify life ;) On November 4, 2016 at 12:05:13 PM, Fuad Efendi (f...@efendi.ca) wrote: Yes we need that documented, http://stackoverflow.com/questions/8924102/restricting-ip-addresses-for-jetty-and-solr Of course Firewall is a must for extremely strong environments / large corporations, DMZ, and etc; IPTables is the simplest solution if you run Linux; my vendor 1and1.com provides firewall functionality too - but I wouldn’t trust it: what if local at 1and1.com servers (in the same rack for example) can bypass this firewall? Having option to configure Jetty minimizes dependencies. In real production I’d use all possible options: firewall(s) + iptable + Jetty config + DMZ(s) -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Search Relevancy, Recommender Systems On November 4, 2016 at 9:28:21 AM, David Smiley (david.w.smi...@gmail.com) wrote: I was just researching how to secure Solr by IP address and I finally figured it out. Perhaps this might go in the ref guide but I'd like to share it here anyhow. The scenario is where only "localhost" should have full unfettered access to Solr, whereas everyone else (notably web clients) can only access some whitelisted paths. This setup is intended for a single instance of Solr (not a member of a cluster); the particular config below would probably need adaptations for a cluster of Solr instances. The technique here uses a utility with Jetty called IPAccessHandler -- http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/handler/IPAccessHandler.html For reasons I don't know (and I did search), it was recently deprecated and there's another InetAccessHandler (not in Solr's current version of Jetty) but it doesn't support constraints incorporating paths, so it's a non-option for my needs. First, Java must be told to insist on it's IPv4 stack. This is because Jetty's IPAccessHandler simply doesn't support IPv6 IP matching; it throws NPEs in my experience. In recent versions of Solr, this can be easily done just by adding -Djava.net.preferIPv4Stack=true at the Solr start invocation. Alternatively put it into SOLR_OPTS perhaps in solr.in.sh. Edit server/etc/jetty.xml, and replace the line mentioning ContextHandlerCollection with this: 127.0.0.1 -.-.-.-|/solr/techproducts/select false This mechanism wraps ContextHandlerCollection (which ultimately serves Solr) with this handler that adds the constraints. These constraints above allow localhost to do anything; other IP addresses can only access /solr/techproducts/select. That line could be duplicated for other white-listed paths -- I recommend creating request handlers for your use, possibly with invariants to further constraint what someone can do. note: I originally tried inserting the IPAccessHandler in server/contexts/solr-jetty-context.xml but found that there's a bug in IPAccessHanlder that fails to consider when HttpServletRequest.getPathInfo is null. And it wound up letting everything through (if I recall). But I like it up in server.xml anyway as it intercepts everything ~ David -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com
Re: How-To: Secure Solr by IP Address
Yes we need that documented, http://stackoverflow.com/questions/8924102/restricting-ip-addresses-for-jetty-and-solr Of course Firewall is a must for extremely strong environments / large corporations, DMZ, and etc; IPTables is the simplest solution if you run Linux; my vendor 1and1.com provides firewall functionality too - but I wouldn’t trust it: what if local at 1and1.com servers (in the same rack for example) can bypass this firewall? Having option to configure Jetty minimizes dependencies. In real production I’d use all possible options: firewall(s) + iptable + Jetty config + DMZ(s) -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Search Relevancy, Recommender Systems On November 4, 2016 at 9:28:21 AM, David Smiley (david.w.smi...@gmail.com) wrote: I was just researching how to secure Solr by IP address and I finally figured it out. Perhaps this might go in the ref guide but I'd like to share it here anyhow. The scenario is where only "localhost" should have full unfettered access to Solr, whereas everyone else (notably web clients) can only access some whitelisted paths. This setup is intended for a single instance of Solr (not a member of a cluster); the particular config below would probably need adaptations for a cluster of Solr instances. The technique here uses a utility with Jetty called IPAccessHandler -- http://download.eclipse.org/jetty/stable-9/apidocs/org/eclipse/jetty/server/handler/IPAccessHandler.html For reasons I don't know (and I did search), it was recently deprecated and there's another InetAccessHandler (not in Solr's current version of Jetty) but it doesn't support constraints incorporating paths, so it's a non-option for my needs. First, Java must be told to insist on it's IPv4 stack. This is because Jetty's IPAccessHandler simply doesn't support IPv6 IP matching; it throws NPEs in my experience. In recent versions of Solr, this can be easily done just by adding -Djava.net.preferIPv4Stack=true at the Solr start invocation. Alternatively put it into SOLR_OPTS perhaps in solr.in.sh. Edit server/etc/jetty.xml, and replace the line mentioning ContextHandlerCollection with this: 127.0.0.1 -.-.-.-|/solr/techproducts/select false This mechanism wraps ContextHandlerCollection (which ultimately serves Solr) with this handler that adds the constraints. These constraints above allow localhost to do anything; other IP addresses can only access /solr/techproducts/select. That line could be duplicated for other white-listed paths -- I recommend creating request handlers for your use, possibly with invariants to further constraint what someone can do. note: I originally tried inserting the IPAccessHandler in server/contexts/solr-jetty-context.xml but found that there's a bug in IPAccessHanlder that fails to consider when HttpServletRequest.getPathInfo is null. And it wound up letting everything through (if I recall). But I like it up in server.xml anyway as it intercepts everything ~ David -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com
Re: Different Sorts based on Different Groups
Hi Gustatec, Relevancy tuning is really *huge* area, check this book when you have a chance: https://www.manning.com/books/relevant-search Default Solr sorting is based on TF/IDF algorithm; and sorting is not necessarily ‘relevancy’ Trivial solution for clothes store domain would be this one, better to explain using examples: Product 1 Name: "Russell Athletic Men's Basic Tank Top" Categories: “Shirt”, “Sleeveless Shirt”, “Tank Top” Product 2 Name: "Russell Athletic Men's Cotton Muscle Shirt" Categories: “Shirt”, “Sleeveless Shirt”, “Tank Top” You may notice that first product has “Top” repeated twice in product name and category; and second one has “Short” repeated twice. Now having this real-life example you can play with boost query, boosting results containing words from category name in their product name. category:”Tank Top” & bq:”name:tank^10 OR name:top^5" Solr provides “boost query” to tune sorting of output results, check “bq” parameter in the docs at https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser I went from real-life scenario; your scenario and possible solutions could be very different. I had recently assignment at well-known retail shop where we even designed pre-query custom boosts so that we can customize typical (most important for the business) queries as per business needs Thanks, -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Search Relevancy, Recommender Systems On November 4, 2016 at 10:57:02 AM, Gustatec (gusta...@gmail.com) wrote: Hello everyone! I'm currently using Solr in a project (pretty much an e-commerce POC) and came across with the following sort situation: I have two products one called Product1 and other one called Product2, both of them belongs to the same categories, Shirt(ID 1) and Tank-Top(ID 2) When i query for any of these categories, it returns both of the products, in the same order. Is it possible to do some kind of grouping sort in query? So when i query for category Shirt, it returns first Product1 then Product2 and when i do the same query for category Tank-Top it would return first Product2 then Product1? By asking that i wonder if its possible to make a product more relevant, based on the query. So product1 relevancy would be Category ID | Priority 1 | 1 2 | 2 And product2 would be Category ID | Priority 1 | 2 2 | 1 Is it possible to achieve this "elevate" funcionality in query? i thought in doing a _sort field for all categories, but we are actually talking about a few hundred categories, so i dont know if would be viable to create one sort field for each one of them in every single doc... Ps: I asks if its achievable that in query because i dont know if there is any other way of changing the elevate.xml file without having to restart my solr instance Sorry for my bad english, and thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/Different-Sorts-based-on-Different-Groups-tp4304516.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with Password Decryption in Data Import Handler
Then I can only guess that in current configuration decrypted password is empty string. Try to manually replace some characters in encpwd.txt file to see if you get different errors; try to delete this file completely to see if you get different errors. Try to add new line in this file; try to change password in config file. On November 2, 2016 at 5:23:33 PM, Jamie Jackson (jamieja...@gmail.com) wrote: I should have mentioned that I verified connectivity with plain passwords: From the same machine that Solr's running on: solr@000650cbdd5e:/opt/solr$ mysql -uroot -pOakton153 -h local.mysite.com mysite -e "select 'foo' as bar;" +-+ | bar | +-+ | foo | +-+ Also, if I add the plain-text password to the config, it connects fine: So that is why I claim to have a problem with encryptKeyFile, specifically, because I've eliminated general connectivity/authentication problems. Thanks, Jamie On Wed, Nov 2, 2016 at 4:58 PM, Fuad Efendi <f...@efendi.ca> wrote: > In MySQL, this command will explicitly allow to connect from > remote ICZ2002912 host, check MySQL documentation: > > GRANT ALL ON mysite.* TO 'root’@'ICZ2002912' IDENTIFIED BY ‘Oakton123’; > > > > On November 2, 2016 at 4:41:48 PM, Fuad Efendi (f...@efendi.ca) wrote: > > This is the root of the problem: > "Access denied for user 'root'@'ICZ2002912' (using password: NO) “ > > > First of all, ensure that plain (non-encrypted) password settings work for > you. > > Check that you can connect using MySQL client from ICZ2002912 to your > MySQL & Co. instance > > I suspect you need to allow MySQL & Co. to accept connections > from ICZ2002912. Plus, check DNS resolution, etc. > > > Thanks, > > > -- > Fuad Efendi > (416) 993-2060 > http://www.tokenizer.ca > Recommender Systems > > > On November 2, 2016 at 2:37:08 PM, Jamie Jackson (jamieja...@gmail.com) > wrote: > > I'm at a brick wall. Here's the latest status: > > Here are some sample commands that I'm using: > > *Create the encryptKeyFile and encrypted password:* > > > encrypter_password='this_is_my_encrypter_password' > plain_db_pw='Oakton153' > > cd /var/docker/solr_stage2/credentials/ > echo -n "${encrypter_password}" > encpwd.txt > echo -n "${plain_db_pwd}" > plaindbpwd.txt > openssl enc -aes-128-cbc -a -salt -in plaindbpwd.txt -k > "${encrypter_password}" > > rm plaindbpwd.txt > > That generated this as the password, by the way: > > U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o= > > *Configure DIH configuration:* > > > > driver="org.mariadb.jdbc.Driver" > url="jdbc:mysql://local.mysite.com:3306/mysite" > user="root" > password="U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o=" > encryptKeyFile="/opt/solr/credentials/encpwd.txt" > /> > ... > > > By the way, /var/docker/solr_stage2/credentials/ is mapped to > /opt/solr/credentials/ in the docker container, so that's why the paths > *seem* different (but aren't, really). > > > *Authentication error when data import is run:* > > Exception while processing: question document : > SolrInputDocument(fields: > []):org.apache.solr.handler.dataimport.DataImportHandlerException: > Unable to execute query: select 'foo' as bar; Processing > Document # 1 > at org.apache.solr.handler.dataimport.DataImportHandlerException. > wrapAndThrow(DataImportHandlerException.java:69) > at org.apache.solr.handler.dataimport.JdbcDataSource$ > ResultSetIterator.(JdbcDataSource.java:323) > at org.apache.solr.handler.dataimport.JdbcDataSource. > getData(JdbcDataSource.java:283) > at org.apache.solr.handler.dataimport.JdbcDataSource. > getData(JdbcDataSource.java:52) > at org.apache.solr.handler.dataimport.SqlEntityProcessor. > initQuery(SqlEntityProcessor.java:59) > at org.apache.solr.handler.dataimport.SqlEntityProcessor. > nextRow(SqlEntityProcessor.java:73) > at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow( > EntityProcessorWrapper.java:244) > at org.apache.solr.handler.dataimport.DocBuilder. > buildDocument(DocBuilder.java:475) > at org.apache.solr.handler.dataimport.DocBuilder. > buildDocument(DocBuilder.java:414) > at org.apache.solr.handler.dataimport.DocBuilder. > doFullDump(DocBuilder.java:329) > at org.apache.solr.handler.dataimport.DocBuilder.execute( > DocBuilder.java:232) > at org.apache.solr.handler.dataimport.DataImporter. > doFullImport(Dat
Re: Problem with Password Decryption in Data Import Handler
In MySQL, this command will explicitly allow to connect from remote ICZ2002912 host, check MySQL documentation: GRANT ALL ON mysite.* TO 'root’@'ICZ2002912' IDENTIFIED BY ‘Oakton123’; On November 2, 2016 at 4:41:48 PM, Fuad Efendi (f...@efendi.ca) wrote: This is the root of the problem: "Access denied for user 'root'@'ICZ2002912' (using password: NO) “ First of all, ensure that plain (non-encrypted) password settings work for you. Check that you can connect using MySQL client from ICZ2002912 to your MySQL & Co. instance I suspect you need to allow MySQL & Co. to accept connections from ICZ2002912. Plus, check DNS resolution, etc. Thanks, -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Recommender Systems On November 2, 2016 at 2:37:08 PM, Jamie Jackson (jamieja...@gmail.com) wrote: I'm at a brick wall. Here's the latest status: Here are some sample commands that I'm using: *Create the encryptKeyFile and encrypted password:* encrypter_password='this_is_my_encrypter_password' plain_db_pw='Oakton153' cd /var/docker/solr_stage2/credentials/ echo -n "${encrypter_password}" > encpwd.txt echo -n "${plain_db_pwd}" > plaindbpwd.txt openssl enc -aes-128-cbc -a -salt -in plaindbpwd.txt -k "${encrypter_password}" rm plaindbpwd.txt That generated this as the password, by the way: U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o= *Configure DIH configuration:* ... By the way, /var/docker/solr_stage2/credentials/ is mapped to /opt/solr/credentials/ in the docker container, so that's why the paths *seem* different (but aren't, really). *Authentication error when data import is run:* Exception while processing: question document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select 'foo' as bar; Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:323) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:283) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:52) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461) Caused by: java.sql.SQLInvalidAuthorizationSpecException: Could not connect: Access denied for user 'root'@'ICZ2002912' (using password: NO) at org.mariadb.jdbc.internal.util.ExceptionMapper.get(ExceptionMapper.java:123) at org.mariadb.jdbc.internal.util.ExceptionMapper.throwException(ExceptionMapper.java:71) at org.mariadb.jdbc.Driver.connect(Driver.java:109) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:192) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:172) at org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:503) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:313) ... 12 more Caused by: org.mariadb.jdbc.internal.util.dao.QueryException: Could not connect: Access denied for user 'root'@'ICZ2002912' (using password: NO) at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.authentication(AbstractConnectProtocol.java:524) at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.handleConnectionPhases(AbstractConnectProtocol.java:472) at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:374) at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:763) at org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:469) at org.mariadb.jdbc.Driver.connect(Driver.java:104) ... 16 more On Thu, Oct 6, 2016 at 2:42 PM, Jamie Jackson <jamieja...@gmail.com> wrote: > It happens to be ten characters. > > On Thu, Oct 6, 2016 at 12:44 PM, Alexandre Rafalovitch <arafa...@gmail.com > > wrote: > >> How long is the encryption key (file content)? Because the code I am >> looking at seems to expect it to be at most 100 ch
Re: Problem with Password Decryption in Data Import Handler
This is the root of the problem: "Access denied for user 'root'@'ICZ2002912' (using password: NO) “ First of all, ensure that plain (non-encrypted) password settings work for you. Check that you can connect using MySQL client from ICZ2002912 to your MySQL & Co. instance I suspect you need to allow MySQL & Co. to accept connections from ICZ2002912. Plus, check DNS resolution, etc. Thanks, -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Recommender Systems On November 2, 2016 at 2:37:08 PM, Jamie Jackson (jamieja...@gmail.com) wrote: I'm at a brick wall. Here's the latest status: Here are some sample commands that I'm using: *Create the encryptKeyFile and encrypted password:* encrypter_password='this_is_my_encrypter_password' plain_db_pw='Oakton153' cd /var/docker/solr_stage2/credentials/ echo -n "${encrypter_password}" > encpwd.txt echo -n "${plain_db_pwd}" > plaindbpwd.txt openssl enc -aes-128-cbc -a -salt -in plaindbpwd.txt -k "${encrypter_password}" rm plaindbpwd.txt That generated this as the password, by the way: U2FsdGVkX19pBVTeZaSl43gFFAlrx+Th1zSg1GvlX9o= *Configure DIH configuration:* ... By the way, /var/docker/solr_stage2/credentials/ is mapped to /opt/solr/credentials/ in the docker container, so that's why the paths *seem* different (but aren't, really). *Authentication error when data import is run:* Exception while processing: question document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select 'foo' as bar; Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:323) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:283) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:52) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461) Caused by: java.sql.SQLInvalidAuthorizationSpecException: Could not connect: Access denied for user 'root'@'ICZ2002912' (using password: NO) at org.mariadb.jdbc.internal.util.ExceptionMapper.get(ExceptionMapper.java:123) at org.mariadb.jdbc.internal.util.ExceptionMapper.throwException(ExceptionMapper.java:71) at org.mariadb.jdbc.Driver.connect(Driver.java:109) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:192) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:172) at org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:503) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:313) ... 12 more Caused by: org.mariadb.jdbc.internal.util.dao.QueryException: Could not connect: Access denied for user 'root'@'ICZ2002912' (using password: NO) at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.authentication(AbstractConnectProtocol.java:524) at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.handleConnectionPhases(AbstractConnectProtocol.java:472) at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:374) at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:763) at org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:469) at org.mariadb.jdbc.Driver.connect(Driver.java:104) ... 16 more On Thu, Oct 6, 2016 at 2:42 PM, Jamie Jackson <jamieja...@gmail.com> wrote: > It happens to be ten characters. > > On Thu, Oct 6, 2016 at 12:44 PM, Alexandre Rafalovitch <arafa...@gmail.com > > wrote: > >> How long is the encryption key (file content)? Because the code I am >> looking at seems to expect it to be at most 100 characters. >> >> Regards, >> Alex. >> >> Newsletter and res
Re: Timeout occured while waiting response from server at: http://***/solr/commodityReview
My 2 cents (rounded): Quote: "the size of our index data is more than 30GB every year now” - is it the size of *data* or the size of *index*? This is super important! You can have petabytes of data, growing terabytes a year, and your index files will grow only few gigabytes a year at most. Note also that Lucene index files are immutable: it means that, for example, if your index files total size is 25Gb in a filesystem, then having at least 25Gb+2Gb of free RAM available (for index files + for OS) will be beneficial (as already mentioned in this thread). However, caching of index files in a RAM won’t reduce search performance from minutes of response time to milliseconds. If you really have timeouts (and I believe you use at least 60 seconds timeout settings for SolrJ) then possible reasons could be: 1. “Shared VM” such as Amazon shared nodes, sometimes they just stop for few minutes 2. Garbage collection in Java 3. Sophisticated Solr query such as faceting and aggregations, with inadequately configured field cache and other caches Having 100Gb index files in a filesystem cannot cause more than a few milliseconds response times for trivial queries such as “text:Solr”! (Exception: faceting) You need to isolate (troubleshoot) your timeouts, and you mentioned it only happens during new queries to the new searcher after replication from Master to Slave. Which means Case #3: improperly configured cache parameters. You need warm-up query. New Solr searcher will become available after internal caches warmed up (prepopulated with data). Memory estimate example: suppose you configured Solr such a way that it will use field cache for SKU field. Suppose SKU field is 64 bytes in average (UTF8 will take 2 bytes per character), and you have 100 millions of documents. Then you will need 6,400,000,000 bytes for just this instance of a field cache, more than 4Gb! This is basic formula. If you have few such fields, then you will need ton of memory, and you need few minutes to warm-up field cache. Calculate it properly: 8Gb or 24Gb? Consider sharding / SolrCloud if you need huge memory just for field cache. And you will be forced to consider it if you gave more that 2 billions documents (am I right? Lucene internal limitation, Integer.MAX_INT) Thanks, -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Search Relevancy and Recommender Systems On November 2, 2016 at 1:11:10 PM, Erick Erickson (erickerick...@gmail.com) wrote: You need to move to SolrCloud when it's time to shard ;). More seriously, at some point simply adding more memory will not be adequate. Either your JVM heap will to grow to a point where you start encountering GC pauses or the time to serve requests will increase unacceptably. "when?" you ask? well unfortunately there are no guidelines that can be guaranteed, here's a long blog on the subject: https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ The short form is you need to stress-test your index and query patterns. Now, I've seen 20M docs strain a 32G Java heap. I've seen 300M docs give very nice response times with 12G of memory. It Depends (tm). Whether to put Solr on bare metal or not: There's inevitably some penalty for a VM. That said there are lots of places that use VMs successfully. Again, stress testing is the key. And finally, using docValues for any field that sorts, facets or groups will reduce the JVM requirements significantly, albeit by using OS memory space, see Uwe's excellent blog: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best, Erick On Tue, Nov 1, 2016 at 10:23 PM, Kent Mu <solr.st...@gmail.com> wrote: > Thanks, I got it, Erick! > > the size of our index data is more than 30GB every year now, and it is > still growing up, and actually our solr now is running on a virtual > machine. so I wonder if we need to deploy solr in a physical machine, or I > can just upgrade the physical memory of our Virtual machines? > > Best, > Kent > > 2016-11-02 11:33 GMT+08:00 Erick Erickson <erickerick...@gmail.com>: > >> Kent: OK, I see now. Then a minor pedantic point... >> >> It'll avoid confusion if you use master and slaves >> rather than master and replicas when talking about >> non-cloud setups. >> >> The equivalent in SolrCloud is leader and replicas. >> >> No big deal either way, just FYI. >> >> Best, >> Erick >> >> On Tue, Nov 1, 2016 at 8:09 PM, Kent Mu <solr.st...@gmail.com> wrote: >> > Thanks a lot for your reply, Shawn! >> > >> > no other applications on the server, I agree with you that we need to >> > upgrade physical memory, and allocat
Re: Timeout occured while waiting response from server at: http://***/solr/commodityReview
Quote: It takes place not often. after analysis, we find that only when the replicas Synchronous Data from master solr server. it seem that when the replicas block search requests when synchronizing data from master, is that true? Solr makes new searcher available after replication complete, and new *trivial* searches should take milliseconds of response time even with zero cache tunings including OS managed caches for filesystem. However, if first search coming uses faceting (which uses field caches) then it may takes from seconds to minutes to many minutes just to warm up internal caches. Solr has the way to warm up internal caches before making new searcher available: https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig Make this queries typical for your use cases (for instance, *:* with faceting): Thanks, -- Fuad Efendi (416) 993-2060 http://www.tokenizer.ca Search Relevancy and Recommender Systems On November 1, 2016 at 12:07:50 PM, Kent Mu (solr.st...@gmail.com) wrote: Hi friends! We come across an issue when we use the solrj(4.9.1) to connect to solr server, our deployment is one master with 10 replicas. we index data to the master, and search data from the replicas via load balancing. the error stack is as below: *Timeout occured while waiting response from server at: http://review.solrsearch3.cnsuning.com/solr/commodityReview <http://review.solrsearch3.cnsuning.com/solr/commodityReview>* org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://review.solrsearch3.cnsuning.com/solr/commodityReview at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562) ~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) ~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) ~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05] at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91) ~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05] at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:310) ~[solr-solrj-4.9.1.jar:4.9.1 1625909 - mike - 2014-09-18 04:09:05] It takes place not often. after analysis, we find that only when the replicas Synchronous Data from master solr server. it seem that when the replicas block search requests when synchronizing data from master, is that true? I wonder if it is because that our solr server hardware configuration is too low? the physical memory is 8G with 4 cores. and the JVM we set is Xms512m, Xmx7168m. looking forward to your reply. Thanks!
Foot, Inch: Stripping Out Special Characters: DisMax: WhitespaceTokenizer vs. Keyword Tokenizer
Hello, I finally got it work: search for 5’ 3” (5 feet 3 inches) It is strange for me that if I use WhitespaceTokenizer for field query-type analyzer then it will receive only 5 and 3 with special characters removed. It is also strange that EDisMax does not strips out odd number of quotes. But it works fine with KeywordTokenizer. Any idea why? Thanks, -- Fuad Efendi http://www.tokenizer.ca Data Mining, Vertical Search
Re: Stopping Solr JVM on OOM
The best practice: do not ever try to catch Throwable or its descendants Error, VirtualMachineError, OutOfMemoryError, and etc. Never ever. Also, do not swallow InterruptedException in a loop. Few simple rules to avoid hanging application. If we follow these, there will be no question "what is the best way to stop Solr when it gets in OOM” (or just becomes irresponsive because of swallowed exceptions) -- Fuad Efendi 416-993-2060(cell) On February 25, 2016 at 2:37:45 PM, CP Mishra (mishr...@gmail.com) wrote: Looking at the previous threads (and in our tests), oom script specified at command line does not work as OOM exception is trapped and converted to RuntimeException. So, what is the best way to stop Solr when it gets in OOM state? The only way I see is to override multiple handlers and do System.exit() from there. Is there a better way? We are using Solr with default Jetty container. Thanks, CP Mishra
RE: Solr HTTP client authentication
I can manually create an httpclient and set up authentication but then I can't use solrj. Yes; correct; except that you _can_ use solj with this custom HttpClient instance (which will intercept authentication, which will support cookies, SSL or plain HTTP, Keep-Alive, and etc.) You can provide to SolrJ custom HttpClient at construction: final HttpSolrServer myHttpSolrServer = new HttpSolrServer( SOLR_URL_BASE + / + SOLR_CORE_NAME, myHttpClient); Best Regards, http://www.tokenizer.ca -Original Message- From: Anurag Sharma [mailto:anura...@gmail.com] Sent: November-17-14 11:21 AM To: solr-user@lucene.apache.org Subject: Re: Solr HTTP client authentication I think Solr encourage SSL than authentication On Mon, Nov 17, 2014 at 6:08 PM, Bai Shen baishen.li...@gmail.com wrote: I am using solrj to connect to my solr server. However I need to authenticate against the server and can not find out how to do so using solrj. Is this possible or do I need to drop solrj? I can manually create an httpclient and set up authentication but then I can't use solrj. Thanks.
Please add me: FuadEfendi
Hi, Few months ago I was able to modify Wiki; I can't do it now, probably because http://wiki.apache.org/solr/ContributorsGroup Please add me: FuadEfendi Thanks! -- Fuad Efendi, PhD, CEO C: (416)993-2060 F: (416)800-6479 Tokenizer Inc., Canada http://www.tokenizer.ca
contributor group
Hi, Please add me: FuadEfendi Thanks! -- http://www.tokenizer.ca
RE: Can SOLR Index UTF-16 Text
Something is missing from the body of your Email... As I pointed in my previous message, in general Solr can index _everything_ (provided that you have Tokenizer for that); but, additionally to _indexing_ you need an HTTP-based _search_ which must understand UTF-16 (for instance) Easiest solution is to transfer files to UTF-8 before indexing and to use UTF-8 as a as default Java character encoding ( java -Dfile.encoding=UTF-8 ...; including even Tomcat HTTP settings). This is really the simplest... and fastest by performance... and you should be able to use Highlighter feature and etc... -Fuad Efendi http://www.tokenizer.ca -Original Message- From: vybe3142 [mailto:vybe3...@gmail.com] Sent: October-03-12 12:30 PM To: solr-user@lucene.apache.org Subject: Re: Can SOLR Index UTF-16 Text Thanks for all the responses. Problem partially solved (see below) 1. In a sense, my question is theoretical since the input to out SOLR server is (currently) UTF-8 files produced by a third party text extraction utility (not Tika). On the server side, we read and index the text via a custom data handler. Last week, I tried a UTF-16 file to see what would happen, and it wasn't handled correctly, as explained in my original question. 2. The file is UTF 16 3. We can either (a)stream the data to SOLR in the call or (b)use the stream.file parameter to provide the file path to the SOLR handler. Assuming case (a) Here's how the SOLRJ request is constructed (code edited for conciseness) If I replace the last line with things work What would I need to do in case (b), . wherer the raw file is loaded remotely i.e. my handler reads the file directly In this case, how can I control what the content type is ? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Can-SOLR-Index-UTF-16-Text-tp4010834p4011 634.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Can SOLR Index UTF-16 Text
Hi, my previous message was partially wrong: Please note that ANY IMAGINABLE SOLUTION will use encoding/decoding; and the real question is where should it happen? A. (Solr) Java Container is responsible for UTF-16 - Java String B. Client will do UTF-8 -UTF-16 before submitting data to (Solr) Java Container And the correct answer is A. Because Java internally stores everything in UTF-16. So that overhead of (Document)UTF16-(Java)UTF16 is absolutely minimal (and performance is the best possible; although file sizes could be higher...) You need to start SOLR (Tomcat Java) with the parameter java -Dfile.encoding=UTF-16 http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html And, possibly, configure HTTP Connector of Tomcat to UTF-16 Connector port=8080 URIEncoding=UTF-16/ (and use proper encoding HTTP Request Headers when you POST your file to Solr) -Fuad Efendi http://www.tokenizer.ca -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: October-03-12 1:30 PM To: solr-user@lucene.apache.org Subject: RE: Can SOLR Index UTF-16 Text Something is missing from the body of your Email... As I pointed in my previous message, in general Solr can index _everything_ (provided that you have Tokenizer for that); but, additionally to _indexing_ you need an HTTP-based _search_ which must understand UTF-16 (for instance) Easiest solution is to transfer files to UTF-8 before indexing and to use UTF-8 as a as default Java character encoding ( java -Dfile.encoding=UTF-8 ...; including even Tomcat HTTP settings). This is really the simplest... and fastest by performance... and you should be able to use Highlighter feature and etc... -Fuad Efendi http://www.tokenizer.ca -Original Message- From: vybe3142 [mailto:vybe3...@gmail.com] Sent: October-03-12 12:30 PM To: solr-user@lucene.apache.org Subject: Re: Can SOLR Index UTF-16 Text Thanks for all the responses. Problem partially solved (see below) 1. In a sense, my question is theoretical since the input to out SOLR server is (currently) UTF-8 files produced by a third party text extraction utility (not Tika). On the server side, we read and index the text via a custom data handler. Last week, I tried a UTF-16 file to see what would happen, and it wasn't handled correctly, as explained in my original question. 2. The file is UTF 16 3. We can either (a)stream the data to SOLR in the call or (b)use the stream.file parameter to provide the file path to the SOLR handler. Assuming case (a) Here's how the SOLRJ request is constructed (code edited for conciseness) If I replace the last line with things work What would I need to do in case (b), . wherer the raw file is loaded remotely i.e. my handler reads the file directly In this case, how can I control what the content type is ? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Can-SOLR-Index-UTF-16-Text-tp4010834p4011 634.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Can SOLR Index UTF-16 Text
Solr can index bytearrays too: unigram, bigram, trigram... even bitsets, tritsets, qatrisets ;- ) LOL I got strong cold... BTW, don't forget to configure UTF-8 as your default (Java) container encoding... -Fuad
Re: UnInvertedField limitations
Hi Jack, 24bit = 16M possibilities, it's clear; just to confirm... the rest is unclear, why 4-byte can have 4 million cardinality? I thought it is 4 billions... And, just to confirm: UnInvertedField allows 16M cardinality, correct? On 12-08-20 6:51 PM, Jack Krupansky j...@basetechnology.com wrote: It appears that there is a hard limit of 24-bits or 16M for the number of bytes to reference the terms in a single field of a single document. It takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes, that would allow 16/4 or 4 million unique terms - per document. Do you have such large documents? This appears to be a hard limit based of 24-bytes in a Java int. You can try facet.method=enum, but that may be too slow. What release of Solr are you running? -- Jack Krupansky -Original Message- From: Fuad Efendi Sent: Monday, August 20, 2012 4:34 PM To: Solr-User@lucene.apache.org Subject: UnInvertedField limitations Hi All, I have a problemŠ (Yonik, please!) help me, what is Term count limits? I possibly have 256,000,000 different terms in a fieldŠ or 16,000,000? Thanks! 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - : org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field enrich_keywords_string_mv at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField .j ava:668) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java :4 23) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.ja va :85) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHa nd ler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBas e. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) -- Fuad Efendi http://www.tokenizer.ca
Re: UnInvertedField limitations
Hi Lance, Use case is keyword extraction, and it could be 2- and 3-grams (2- and 3- words); so that theoretically we can have 10,000^3 = 1,000,000,000,000 3-grams for English only... of course my suggestion is to use statistics and to build a dictionary of such 3-word combinations (remove top, remove tail, using frequencies)... And to hard-limit this dictionary to 1,000,000... That was business requirement which technically impossible to implement (as a realtime query results); we don't even use word stemming etc... -Fuad On 12-08-20 7:22 PM, Lance Norskog goks...@gmail.com wrote: Is this required by your application? Is there any way to reduce the number of terms? A work around is to use shards. If your terms follow Zipf's Law each shard will have fewer than the complete number of terms. For N shards, each shard will have ~1/N of the singleton terms. For 2-count terms, 1/N or 2/N will have that term. Now I'm interested but not mathematically capable: what is the general probabilistic formula for splitting Zipf's Law across shards? On Mon, Aug 20, 2012 at 3:51 PM, Jack Krupansky j...@basetechnology.com wrote: It appears that there is a hard limit of 24-bits or 16M for the number of bytes to reference the terms in a single field of a single document. It takes 1, 2, 3, 4, or 5 bytes to reference a term. If it took 4 bytes, that would allow 16/4 or 4 million unique terms - per document. Do you have such large documents? This appears to be a hard limit based of 24-bytes in a Java int. You can try facet.method=enum, but that may be too slow. What release of Solr are you running? -- Jack Krupansky -Original Message- From: Fuad Efendi Sent: Monday, August 20, 2012 4:34 PM To: Solr-User@lucene.apache.org Subject: UnInvertedField limitations Hi All, I have a problemŠ (Yonik, please!) help me, what is Term count limits? I possibly have 256,000,000 different terms in a fieldŠ or 16,000,000? Thanks! 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - : org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field enrich_keywords_string_mv at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedFiel d.j ava:668) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.jav a:4 23) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206 ) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.j ava :85) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchH and ler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa se. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) -- Fuad Efendi http://www.tokenizer.ca -- Lance Norskog goks...@gmail.com
RE: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser
This is bug in Solr 4.0.0-Beta Schema Browser: Load Term Info shows 9682 News, but direct query shows 3577. /solr/core0/select?q=channel:Newsfacet=truefacet.field=channelrows=0 response lst name=responseHeader int name=status0/int int name=QTime1/int lst name=params str name=facettrue/str str name=qchannel:News/str str name=facet.fieldchannel/str str name=rows0/str /lst /lst result name=response numFound=3577 start=0/ lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=channel int name=News3577/int int name=Blogs0/int int name=Message Boards0/int int name=Video0/int /lst /lst lst name=facet_dates/ lst name=facet_ranges/ /lst /response -Original Message- Sent: August-24-12 11:29 PM To: solr-user@lucene.apache.org Cc: sole-...@lucene.apache.org Subject: RE: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser Importance: High Any news? CC: Dev -Original Message- Subject: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser Hi there, Load term Info shows 3650 for a specific term MyTerm, and when I execute query channel:MyTerm it shows 650 documents foundŠ possibly bugŠ it happens after I commit data too, nothing changes; and this field is single-valued non-tokenized string. -Fuad -- Fuad Efendi 416-993-2060 http://www.tokenizer.ca
Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser
Hi there, Load term Info shows 3650 for a specific term MyTerm, and when I execute query channel:MyTerm it shows 650 documents found possibly bug it happens after I commit data too, nothing changes; and this field is single-valued non-tokenized string. -Fuad -- Fuad Efendi 416-993-2060 http://www.tokenizer.ca
RE: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser
Any news? CC: Dev -Original Message- Subject: Solr-4.0.0-Beta Bug with Load Term Info in Schema Browser Hi there, Load term Info shows 3650 for a specific term MyTerm, and when I execute query channel:MyTerm it shows 650 documents foundŠ possibly bugŠ it happens after I commit data too, nothing changes; and this field is single-valued non-tokenized string. -Fuad -- Fuad Efendi 416-993-2060 http://www.tokenizer.ca
Re: Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set
NRT does not work because index updates hundreds times per second vs. cache warm-up time few minutes and we are in a loop allowing you to query your huge index in ms. Solr also allows to query in ms. What is the difference? No one can sort 1,000,000 terms in descending counts order faster than current Solr implementation, and FieldCache UnInvertedCache can't be used together with NRT cache discarded few times per second! - Fuad http://www.tokenizer.ca On 12-08-14 8:17 AM, Nagendra Nagarajayya nnagaraja...@transaxtions.com wrote: You should try realtime NRT available with Apache Solr 4.0 with RankingAlgorithm 1.4.4, allows faceting in realtime. RankingAlgorithm 1.4.4 also provides an age feature that allows you to retrieve the most recent changed docs in realtime, allowing you to query your huge index in ms. You can get more information and also download from here: http://solr-ra.tgels.org Regards - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. Note: Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external implementation On 8/13/2012 11:38 AM, Fuad Efendi wrote: SOLR-4.0 I am trying to implement this; funny idea to share: 1. http://wiki.apache.org/solr/HierarchicalFaceting unfortunately it does not support date ranges. However, workaround: use String type instead of *_tdt and define fields such as published_hour published_day published_week S( Of course you will need to stick with timezone; but you can add an index(es) for each timezone. And most important, string facets are much faster than Date Trie ranges. 2. Our index is overs 100 millions (from social networks) and rapidly grows (millions a day); cache warm up takes few minutes; Near-Real-Time does not work with faceting. HoweverS( another workaround: we can have Daily Core (optimized at midnight), plus Current Core (only today's data, optimized), plus Last Hour Core (near real time) Last Hour Data is small enough and we can use Facets with Near Real Time feature Service layer will accumulate search results from three layers, it will be near real time. Any thoughts? Thanks,
UnInvertedField limitations
Hi All, I have a problem (Yonik, please!) help me, what is Term count limits? I possibly have 256,000,000 different terms in a field or 16,000,000? Can I temporarily disable tho feature? Thanks! 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - : org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field enrich_keywords_string_mv at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j ava:668) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4 23) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java :85) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand ler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) -- Fuad Efendi http://www.tokenizer.ca
UnInvertedField limitations
Hi All, I have a problem (Yonik, please!) help me, what is Term count limits? I possibly have 256,000,000 different terms in a field or 16,000,000? Thanks! 2012-08-20 16:20:19,262 ERROR [solr.core.SolrCore] - [pool-1-thread-1] - : org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field enrich_keywords_string_mv at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:179) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.j ava:668) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:326) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:4 23) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:206) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java :85) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHand ler.java:204) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) -- Fuad Efendi http://www.tokenizer.ca
Near Real Time + Facets + Hierarchical Faceting (Pivot Table) with Date Range: huge data set
SOLR-4.0 I am trying to implement this; funny idea to share: 1. http://wiki.apache.org/solr/HierarchicalFaceting unfortunately it does not support date ranges. However, workaround: use String type instead of *_tdt and define fields such as published_hour published_day published_week Of course you will need to stick with timezone; but you can add an index(es) for each timezone. And most important, string facets are much faster than Date Trie ranges. 2. Our index is overs 100 millions (from social networks) and rapidly grows (millions a day); cache warm up takes few minutes; Near-Real-Time does not work with faceting. However another workaround: we can have Daily Core (optimized at midnight), plus Current Core (only today's data, optimized), plus Last Hour Core (near real time) Last Hour Data is small enough and we can use Facets with Near Real Time feature Service layer will accumulate search results from three layers, it will be near real time. Any thoughts? Thanks, -- Fuad Efendi 416-993-2060 Tokenizer Inc., Canada http://www.tokenizer.ca http://www.linkedin.com/in/lucene
RE: Using Solr 3.4 running on tomcat7 - very slow search
FWIW, when asked at what point one would want to split JVMs and shard, on the same machine, Grant Ingersoll mentioned 16GB, and precisely for GC cost reasons. You're way above that. - his index is 75G, and Grant mentioned RAM heap size; we can use terabytes of index with 16Gb memory.
Solr Consultant Available in Canada: Solr, HBase, Hadoop, Mahout, Lily
Hi, If anyone is interested, I am available for full-time assignments; I am involved in Hadoop/Lucene/Solr world since 2005 (Nutch). Recently implemented Lily-Framework-based distributed task executor which is currently used for Vertical Search by lead insurance companies and media: RSS, CVS, Web Services, Moreover, Web Ping, SQL-import, sitemaps-based, intranets, and more. Additionally to that, I can design super-rich UI extremely fast using tools such as Liferay Portal, Apache Wicket, Vaadin. Thanks, -- Fuad Efendi 416-993-2060 Tokenizer Inc., Canada http://www.tokenizer.ca http://www.tokenizer.ca/ http://www.linkedin.com/in/lucene
Solr Consultant Available in Canada: Solr, HBase, Hadoop, Lily
Hi, If anyone is interested, I am available for full-time assignments; I am involved in Hadoop/Lucene/Solr world since 2005 (Nutch). Recently implemented Lily-Framework-based distributed task executor which is currently used for Vertical Search by lead insurance companies and media: RSS, CVS, Web Services, Moreover, Web Ping, SQL-import, sitemaps-based, intranets, and more. Additionally to that, I can design super-rich UI extremely fast using tools such as Liferay Portal, Apache Wicket, Vaadin. Thanks, -- Fuad Efendi 416-993-2060 Tokenizer Inc., Canada http://www.tokenizer.ca http://www.tokenizer.ca/ http://www.linkedin.com/in/lucene
Re: How to accelerate your Solr-Lucene appication by 4x
I agree that SSD boosts performance... In some rare not-real-life scenario: - super frequent commits That's it, nothing more except the fact that Lucene compile time including tests takes up to two minutes on MacBook with SSD, or forty-fifty minutes on Windows with HDD. Of course, with non-empty maven repository in both scenario, to be fair. another scenario: imagine google file system is powered by SSD instead of cheapest HDD... HAHAHA!!! Can we expect response time 0.1 milliseconds instead of 30-50? And final question... Will SSD improve performance of fuzzy search? Range queries? Etc I just want to say that SSD is faster than HDD but it doesn't mean anything... -Fuad Sent from my iPad On 2012-01-19, at 9:40 AM, Peter Velikin pe...@velobit.com wrote: All, Point taken: my message should have been written more succinctly and just stuck to the facts. Sorry for the sales pitch! However, I believe that adding SSD as a means to accelerate the performance of your Solr cluster is an important topic to discuss on this forum. There are many options for you to consider. I believe VeloBit would be the best option for many, but you have choices, some of them completely free. If interested, send me a note and I'll be happy to tell you about the different options (free or paid) you can consider. Solr clusters are I/O bound. I am arguing that before you buy additional servers, replace your existing servers with new ones, or swap your hard disks, you should try adding SSD as a cache. If the promise is that adding 1 SSD could save you the cost of 3 additional servers, you should try it. Has anyone else tried adding SSDs as a cache to boost the performance of Solr clusters? Can you share your results? Best regards, Peter Velikin VP Online Marketing, VeloBit, Inc. pe...@velobit.com tel. 978-263-4800 mob. 617-306-7165 VeloBit provides plug play SSD caching software that dramatically accelerates applications at a remarkably low cost. The software installs seamlessly in less than 10 minutes and automatically tunes for fastest application speed. Visit www.velobit.com for details.
Re: jetty error, broken pipe
It's not Jetty. It is broken TCP pipe due to client-side. It happens when client closes TCP connection. And I even had this problem with recent Tomcat 6. Problem disappeared after I explicitly tuned keep-alive at Tomcat, and started using monitoring thread with HttpClient and SOLRJ... Fuad Efendi http://www.tokenizer.ca Sent from my iPad On 2011-11-19, at 9:14 PM, alx...@aim.com wrote: Hello, I use solr 3.4 with jetty that is included in it. Periodically, I see this error in the jetty output SEVERE: org.mortbay.jetty.EofException at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:296) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:140) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229) ... ... ... Caused by: java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at org.mortbay.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:368) at org.mortbay.io.bio.StreamEndPoint.flush(StreamEndPoint.java:129) at org.mortbay.io.bio.StreamEndPoint.flush(StreamEndPoint.java:161) at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:714) ... 25 more 2011-11-19 20:50:00.060:WARN::Committed before 500 null||org.mortbay.jetty.EofException|?at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791)|?at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)|?at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012)|?at sun.nio.cs.StreamEncoder.implFlush(S I searched web and the only advice I get is to upgrade to jetty 6.1, but I think the version included in solr is 6.1.26. Any advise is appreciated. Thanks. Alex.
Re: HBase Datasource
I am using Lily for atomic index updates ( implemented very nice; transactionally; plus MapReduce; plus auto-denormaluzing) http://www.lilyproject.org It slows down mean time 7-10 times, but TPS still the same - Fuad http://www.tokenizer.ca Sent from my iPad On 2011-11-10, at 9:59 PM, Mark static.void@gmail.com wrote: Has anyone had any success/experience with building a HBase datasource for DIH? Are there any solutions available on the web? Thanks.
Re: solr keeps dying every few hours.
EC2 7.5Gb (large CPU instance, $0.68/hour) sucks. Unpredictably, there are errors such as User time: 0 seconds Kernel time: 0 seconds Real time: 600 seconds How can clock time be higher in such extent? Only if _another_ user used 600 seconds CPU: _virtualization_ My client have had constant problems. We are moving to dedicated hardware (25 times cheaper in average; Amazon sells 1 Tb of EBS for $100/month, plus additional costs for I/O) I have a large ec2 instance(7.5 gb ram), it dies every few hours with out of heap memory issues. I started upping the min memory required, currently I use -Xms3072M . Large CPU instance is virtualization and behaviour is unpredictable. Choose cluster instance with explicit Intel XEON CPU (instead of CPU-Units) and compare behaviour; $1.60/hour. Please share results. Thanks, -- Fuad Efendi 416-993-2060 Tokenizer Inc., Canada Data Mining, Search Engines http://www.tokenizer.ca On 11-08-17 5:56 PM, Jason Toy jason...@gmail.com wrote: I've only set set minimum memory and have not set maximum memory. I'm doing more investigation and I see that I have 100+ dynamic fields for my documents, not the 10 fields I quoted earlier. I also sort against those dynamic fields often, I'm reading that this potentially uses a lot of memory. Could this be the cause of my problems and if so what options do I have to deal with this? On Wed, Aug 17, 2011 at 2:46 PM, Markus Jelsma markus.jel...@openindex.iowrote: Keep in mind that a commit warms up another searcher and potentially doubling RAM consumption in the back ground due to cache warming queries being executed (newSearcher event). Also, where is your Xmx switch? I don't know how your JVM will behave if you set Xms Xmx. 65m docs is quite a lot but it should run fine with 3GB heap allocation. It's a good practice to use a master for indexing without any caches and warm- up queries when you exceed a certain amount of documents, it will bite. I have a large ec2 instance(7.5 gb ram), it dies every few hours with out of heap memory issues. I started upping the min memory required, currently I use -Xms3072M . I insert about 50k docs an hour and I currently have about 65 million docs with about 10 fields each. Is this already too much data for one box? How do I know when I've reached the limit of this server? I have no idea how to keep control of this issue. Am I just supposed to keep upping the min ram used for solr? How do I know what the accurate amount of ram I should be using is? Must I keep adding more memory as the index size grows, I'd rather the query be a little slower if I can use constant memory and have the search read from disk. -- - sent from my mobile 6176064373
Re: solr keeps dying every few hours.
I agree with Yonik of course; But You should see OOM errors in this case. In case of virtualization however it is unpredictable and if JVM doesn't have few bytes to output OOM into log file (because we are catching throwable and trying to generate HTTP 500 instead !!! Freaky) Ok Sorry for not contributing a patch -Fuad (ZooKeeper) http://www.OutsideIQ.com On 11-08-17 6:01 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Wed, Aug 17, 2011 at 5:56 PM, Jason Toy jason...@gmail.com wrote: I've only set set minimum memory and have not set maximum memory. I'm doing more investigation and I see that I have 100+ dynamic fields for my documents, not the 10 fields I quoted earlier. I also sort against those dynamic fields often, I'm reading that this potentially uses a lot of memory. Could this be the cause of my problems and if so what options do I have to deal with this? Yes, that's most likely the problem. Sorting on an integer field causes a FieldCache entry with an int[maxDoc] (i.e. 4 bytes per document in the index, regardless of if it has a value for that field or not). Sorting on a string field is 4 bytes per doc in the index (the ords) plus the memory to store the actual unique string values. -Yonik http://www.lucidimagination.com On Wed, Aug 17, 2011 at 2:46 PM, Markus Jelsma markus.jel...@openindex.iowrote: Keep in mind that a commit warms up another searcher and potentially doubling RAM consumption in the back ground due to cache warming queries being executed (newSearcher event). Also, where is your Xmx switch? I don't know how your JVM will behave if you set Xms Xmx. 65m docs is quite a lot but it should run fine with 3GB heap allocation. It's a good practice to use a master for indexing without any caches and warm- up queries when you exceed a certain amount of documents, it will bite. I have a large ec2 instance(7.5 gb ram), it dies every few hours with out of heap memory issues. I started upping the min memory required, currently I use -Xms3072M . I insert about 50k docs an hour and I currently have about 65 million docs with about 10 fields each. Is this already too much data for one box? How do I know when I've reached the limit of this server? I have no idea how to keep control of this issue. Am I just supposed to keep upping the min ram used for solr? How do I know what the accurate amount of ram I should be using is? Must I keep adding more memory as the index size grows, I'd rather the query be a little slower if I can use constant memory and have the search read from disk. -- - sent from my mobile 6176064373
Re: solr keeps dying every few hours.
I forgot to add: company from UK, something log related (please have a look at recent LucidImagination -managed Solr Revolution conference blogs; company provides log analyzer service; http://loggly.com/) they have 16,000 cores per Solr instance (multi-tenancy); of course they have at least 100k fields per instance they don't have any problem outside Amazon ;))) -- Fuad Efendi 416-993-2060 Tokenizer Inc., Canada Data Mining, Search Engines http://www.tokenizer.ca On 11-08-17 11:08 PM, Fuad Efendi f...@efendi.ca wrote: more investigation and I see that I have 100+ dynamic fields for my documents, not the 10 fields I quoted earlier. I also sort against those
Solr Performance Tuning: -XX:+AggressiveOpts
Anyone tried this? I can not start Solr-Tomcat with following options on Ubuntu: JAVA_OPTS=$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn256m -XX:MaxPermSize=256m JAVA_OPTS=$JAVA_OPTS -Dsolr.solr.home=/data/solr -Dfile.encoding=UTF8 -Duser.timezone=GMT -Djava.util.logging.config.file=/data/solr/logging.properties -Djava.net.preferIPv4Stack=true JAVA_OPTS=$JAVA_OPTS -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+AggressiveOpts -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=77 -XX:+CMSParallelRemarkEnabled JAVA_OPTS=$JAVA_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/data/solr/solr-gc.log Tomcat log (something about PorterStemFilter; Solr 3.3.0): INFO: Server startup in 2683 ms # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f5c6f36716e, pid=7713, tid=140034519381760 # # JRE version: 6.0_26-b03 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops) # Problematic frame: # J org.apache.lucene.analysis.PorterStemFilter.incrementToken()Z # [thread 140034523637504 also had an error] [thread 140034520434432 also had an error] # An error report file with more information is saved as: # [thread 140034520434432 also had an error] # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # However, I can start it and run without any problems by removing -XX:+AggressiveOpts (which has to be default setting in upcoming releases Java 6) Do we need to disable -XX:-DoEscapeAnalysis as IBM suggests? http://www-01.ibm.com/support/docview.wss?uid=swg21422605 Thanks, Fuad Efendi http://www.tokenizer.ca
Re: Solr Performance Tuning: -XX:+AggressiveOpts
Thanks Robert!!! Submitted On 26-JUL-2011 - yesterday. This option was popular in Hbase On 11-07-27 3:58 PM, Robert Muir rcm...@gmail.com wrote: Don't use this option, these optimizations are buggy: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7070134 On Wed, Jul 27, 2011 at 3:56 PM, Fuad Efendi f...@efendi.ca wrote: Anyone tried this? I can not start Solr-Tomcat with following options on Ubuntu: JAVA_OPTS=$JAVA_OPTS -Xms2048m -Xmx2048m -Xmn256m -XX:MaxPermSize=256m JAVA_OPTS=$JAVA_OPTS -Dsolr.solr.home=/data/solr -Dfile.encoding=UTF8 -Duser.timezone=GMT -Djava.util.logging.config.file=/data/solr/logging.properties -Djava.net.preferIPv4Stack=true JAVA_OPTS=$JAVA_OPTS -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+AggressiveOpts -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=77 -XX:+CMSParallelRemarkEnabled JAVA_OPTS=$JAVA_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/data/solr/solr-gc.log Tomcat log (something about PorterStemFilter; Solr 3.3.0): INFO: Server startup in 2683 ms # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7f5c6f36716e, pid=7713, tid=140034519381760 # # JRE version: 6.0_26-b03 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops) # Problematic frame: # J org.apache.lucene.analysis.PorterStemFilter.incrementToken()Z # [thread 140034523637504 also had an error] [thread 140034520434432 also had an error] # An error report file with more information is saved as: # [thread 140034520434432 also had an error] # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # However, I can start it and run without any problems by removing -XX:+AggressiveOpts (which has to be default setting in upcoming releases Java 6) Do we need to disable -XX:-DoEscapeAnalysis as IBM suggests? http://www-01.ibm.com/support/docview.wss?uid=swg21422605 Thanks, Fuad Efendi http://www.tokenizer.ca -- lucidimagination.com
Re: 400 MB Fields
I think the question is strange... May be you are wondering about possible OOM exceptions? I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent... I believe thousands companies already indexed millions documents with average size few hundreds Mbytes... There should not be any limits (except InputSource vs. ByteArray) 100,000 _unique_ terms vs. single document containing 100,000,000,000,000 of non-unique terms (and trying to store offsets) What about Spell Checker feature? Is anyone tried to index single terabytes-like document? Personally, I indexed only small (up to 1000 bytes) documents-fields, but I believe 500Mb is very common use case with PDFs (which vendors use Lucene already? Eclipse? To index Eclipse Help file? Even Microsoft uses Lucene...) Fuad On 11-06-07 7:02 PM, Erick Erickson erickerick...@gmail.com wrote: From older (2.4) Lucene days, I once indexed the 23 volume Encyclopedia of Michigan Civil War Volunteers in a single document/field, so it's probably within the realm of possibility at least G... Erick On Tue, Jun 7, 2011 at 6:59 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, What are the biggest document fields that you've ever indexed in Solr or that you've heard of? Ah, it must be Tom's Hathi trust. :) I'm asking because I just heard of a case of an index where some documents having a field that can be around 400 MB in size! I'm curious if anyone has any experience with such monster fields? Crazy? Yes, sure. Doable? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: 400 MB Fields
Hi Otis, I am recalling pagination feature, it is still unresolved (with default scoring implementation): even with small documents, searching-retrieving documents 1 to 10 can take 0 milliseconds, but from 100,000 to 100,010 can take few minutes (I saw it with trunk version 6 months ago, and with very small documents, total 100 mlns docs); it is advisable to restrict search results to top-1000 in any case (as with Google)... I believe things can get wrong; yes, most plain-text retrieved from books should be 2kb per page, 500 pages, := 1,000,000 bytes (or double it for UTF-8) Theoretically, it doesn't make any sense to index BIG document containing all terms from dictionary without any terms frequency calcs, but even with it... I can't imagine we should index 1000s docs and each is just (different) version of whole Wikipedia, should be wrong design... Ok, use case: index single HUGE document. What will we do? Create index with _the_only_ document? And all search will return the same result (or nothing)? Paginate it; split into pages. I am pragmatic... Fuad On 11-06-07 8:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I think the question is strange... May be you are wondering about possible OOM exceptions? No, that's an easier one. I was more wondering whether with 400 MB Fields (indexed, not stored) it becomes incredibly slow to: * analyze * commit / write to disk * search I think we can pass to Lucene single document containing comma separated list of term, term, ... (few billion times)... Except stored and TermVectorComponent...
Re: URGENT HELP: Improving Solr indexing time
Hi Rohit, I am currently working on https://issues.apache.org/jira/browse/SOLR-2233 which fixes multithreading issues How complex is your dataimport schema? SOLR-2233 (multithreading, better connection handling) improves performance... Especially if SQL is extremely complex and uses few long-running CachedSqlEntityProcessors and etc. Also, check your SQL and indexes, in most cases you can _significantly_ improve performance by simply adding appropriate (for your specific SQL) indexes. I noticed that even very experienced DBAs sometimes create index KEY1, KEY2, and developer executes query WHERE KEY2=? ORDER BY KEY1 - check everything... Thanks, -- Fuad Efendi 416-993-2060 Tokenizer Inc., Canada Data Mining, Search Engines http://www.tokenizer.ca http://www.tokenizer.ca/ On 11-06-05 12:09 AM, Rohit Gupta ro...@in-rev.com wrote: No didn't double post, my be it was in my outbox and went out again. The queries outside solr dont take so long, to return around 50 rows it takes 250 seconds, so I am doing a delta import of around 500,000 rows at a time. I have tried turning auto commit on and things are moving a bit faster now. Are there any more tweeking i can do? Also, planning to move to master-salve model, but am failing to understand where to start exactly. Regards, Rohit From: lee carroll lee.a.carr...@googlemail.com To: solr-user@lucene.apache.org Sent: Sun, 5 June, 2011 4:59:44 AM Subject: Re: URGENT HELP: Improving Solr indexing time Rohit - you have double posted maybe - did Otis's answer not help with your issue or at least need a response to clarify ? On 4 June 2011 22:53, Chris Cowan chrisco...@plus3network.com wrote: How long does the query against the DB take (outside of Solr)? If that's slow then it's going to take a while to update the index. You might need to figure a way to break things up a bit, maybe use a delta import instead of a full import. Chris On Jun 4, 2011, at 6:23 AM, Rohit Gupta wrote: My Solr server takes very long to update index. The table it hits to index is huge with 10Million + records , but even in that case I feel this is very long time to index. Below is the snapshot of the /dataimport page str name=statusbusy/str str name=importResponseA command is still running.../str lst name=statusMessages str name=Time Elapsed1:53:39.664/str str name=Total Requests made to DataSource16276/str str name=Total Rows Fetched24237/str str name=Total Documents Processed16273/str str name=Total Documents Skipped0/str str name=Full Dump Started2011-06-04 11:25:26/str /lst How can i determine why this is happening and how can I improve this. During all our test on the local server before the migration we could index 5 million records in 4-5 hrs, but now its taking too long on the live server. Regards, Rohit
RE: DIH: Exception with Too many connections
Hi, There is existing bug in DataImportHandler described (and patched) at https://issues.apache.org/jira/browse/SOLR-2233 It is not used in a thread safe manner, and it is not appropriately closed reopened (why?); and new connection is opened unpredictably. It may cause Too many connections even for huge SQL-side max_connections. If you are interested, I can continue work on SOLR-2233. CC: dev@lucene (is anyone working on DIH improvements?) Thanks, Fuad Efendi http://www.tokenizer.ca/ -Original Message- From: François Schiettecatte [mailto:fschietteca...@gmail.com] Sent: May-31-11 7:44 AM To: solr-user@lucene.apache.org Subject: Re: DIH: Exception with Too many connections Hi You might also check the 'max_user_connections' settings too if you have that set: # Maximum number of connections, and per user max_connections = 2048 max_user_connections = 2048 http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html Cheers François So, if the number of threads in the process list is larger than max_connections, I would get the too many connections error. Am I thinking the right way?
WIKI alerts
Anyone noticed that it doesn't work? Already 2 weeks https://issues.apache.org/jira/browse/INFRA-3667 I don't receive WIKI change notifications. I CC to 'Apache Wiki' wikidi...@apache.org Something is bad. -Fuad
RE: Solr memory consumption
It could be environment specific (specific of your top command implementation, OS, etc) I have on CentOS 2986m virtual memory showing although -Xmx2g You have 10g virtual although -Xmx6g Don't trust it too much... top command may count OS buffers for opened files, network sockets, JVM DLLs itself, etc (which is outside Java GC responsibility); additionally to JVM memory... it counts all memory, not sure... if you don't have big values for 99.9%wa (which means WAIT I/O - disk swap usage) everyhing is fine... -Original Message- From: Denis Kuzmenok Sent: May-31-11 4:18 PM To: solr-user@lucene.apache.org Subject: Solr memory consumption I run multiple-core solr with flags: -Xms3g -Xmx6g -D64, but i see this in top after 6-8 hours and still raising: 17485 test214 10.0g 7.4g 9760 S 308.2 31.3 448:00.75 java -Xms3g -Xmx6g -D64 -Dsolr.solr.home=/home/test/solr/example/multicore/ -jar start.jar Are there any ways to limit memory for sure? Thanks
RE: Solr vs ElasticSearch
Interesting wordings: we want real-time search, we want simple multi-tenancy, and we want a solution that is built for the cloud And later, built on top of Lucene. Is that possible? :) (what does that mean real time search anyway... and what is cloud?) community is growing! P.S. I never used Elastic Search, but I used Compass before moving to SOLR. And Compass uses wordings like as real-time *transactional* search. Yes, it's good and it has own use case (small databases, reduced development time, junior-level staff, single-JVM environment) I'd consider requirements at first, then will see which tool simplifies my task (fulfils most requirements). It could be Elastic, or SOLR, or Compass, or direct Lucene, or even SQL, SequenceFile, SQL, in-memory TreeSet, and etc. Also depends on requirements, budget, teamskills. -Original Message- From: Mark Sent: May-31-11 10:33 PM To: solr-user@lucene.apache.org Subject: Solr vs ElasticSearch I've been hearing more and more about ElasticSearch. Can anyone give me a rough overview on how these two technologies differ. What are the strengths/weaknesses of each. Why would one choose one of the other? Thanks
Re: Solr vs ElasticSearch
Nice article... 2 ms better than 20 ms, but in another chart 50 seconds are not as good as 3 seconds... Sorry for my vision... SOLR pushed into Lucene Core huge amount of performance improvements... Sent on the TELUS Mobility network with BlackBerry -Original Message- From: Shashi Kant sk...@sloan.mit.edu Sender: shashi@gmail.com Date: Wed, 1 Jun 2011 01:01:51 To: solr-user@lucene.apache.org Reply-To: solr-user@lucene.apache.org Subject: Re: Solr vs ElasticSearch Here is a very interesting comparison http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/ -Original Message- From: Mark Sent: May-31-11 10:33 PM To: solr-user@lucene.apache.org Subject: Solr vs ElasticSearch I've been hearing more and more about ElasticSearch. Can anyone give me a rough overview on how these two technologies differ. What are the strengths/weaknesses of each. Why would one choose one of the other? Thanks
Re: Out of memory error
Related: SOLR-846 Sent on the TELUS Mobility network with BlackBerry -Original Message- From: Erick Erickson erickerick...@gmail.com Date: Tue, 7 Dec 2010 08:11:41 To: solr-user@lucene.apache.org Reply-To: solr-user@lucene.apache.org Subject: Re: Out of memory error Have you seen this page? http://wiki.apache.org/solr/DataImportHandlerFaq http://wiki.apache.org/solr/DataImportHandlerFaqSee especially batchsize, but it looks like you're already on to that. Do you have any idea how big the records are in the database? You might try adjusting the rambuffersize down, what is it at now? In general, what are our Solr commit options? Does anything get to Solr or is the OOM when the SQL is executed? The first question to answer is whether you index anything at all... There's a little-know DIH debug page you can access at: .../solr/admin/dataimport.jsp that might help, and progress can be monitored at: .../solr/dataimport DIH can be interesting, you get finer control with SolrJ and a direct JDBC connection. If you don't get anywhere with DIH. Scattergun response, but things to try... Best Erick On Tue, Dec 7, 2010 at 12:03 AM, sivaprasad sivaprasa...@echidnainc.comwrote: Hi, When i am trying to import the data using DIH, iam getting Out of memory error.The below are the configurations which i have. Database:Mysql Os:windows No Of documents:15525532 In Db-config.xml i made batch size as -1 The solr server is running on Linux machine with tomcat. i set tomcat arguments as ./startup.sh -Xms1024M -Xmx2048M Can anybody has idea, where the things are going wrong? Regards, JS -- View this message in context: http://lucene.472066.n3.nabble.com/Out-of-memory-error-tp2031761p2031761.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Out of memory error
Batch size -1??? Strange but could be a problem. Note also you can't provide parameters to default startup.sh command; you should modify setenv.sh instead --Original Message-- From: sivaprasad To: solr-user@lucene.apache.org ReplyTo: solr-user@lucene.apache.org Subject: Out of memory error Sent: Dec 7, 2010 12:03 AM Hi, When i am trying to import the data using DIH, iam getting Out of memory error.The below are the configurations which i have. Database:Mysql Os:windows No Of documents:15525532 In Db-config.xml i made batch size as -1 The solr server is running on Linux machine with tomcat. i set tomcat arguments as ./startup.sh -Xms1024M -Xmx2048M Can anybody has idea, where the things are going wrong? Regards, JS -- View this message in context: http://lucene.472066.n3.nabble.com/Out-of-memory-error-tp2031761p2031761.html Sent from the Solr - User mailing list archive at Nabble.com. Sent on the TELUS Mobility network with BlackBerry
Re: Dataimporthandler crashed raidcontroller
I experienced similar problems. It was because we didn't perform load stress tests properly, before going to production. Nothing is forever, replace controller, change hardware vendor, maintain low temperature inside a rack. Thanks --Original Message-- From: Robert Gründler To: solr-user@lucene.apache.org ReplyTo: solr-user@lucene.apache.org Subject: Dataimporthandler crashed raidcontroller Sent: Nov 4, 2010 7:21 PM Hi all, we had a severe problem with our raidcontroller on one of our servers today during importing a table with ~8 million rows into a solr index. After importing about 4 million documents, our server shutdown, and failed to restart due to a corrupt raid disk. The Solr data import was the only heavy process running on that machine during the crash. Has anyone experienced hdd/raid-related problems during indexing large sql databases into solr? thanks! -robert Sent on the TELUS Mobility network with BlackBerry
RE: Need feedback on solr security
You could set a firewall that forbid any connection to your Solr's server port to everyone, except the computer that host your application that connect to Solr. So, only your application will be able to connect to Solr. I believe firewalling is the only possible solution since SOLR doesn't use cookies/sessionIDs However, 'firewall' can be implemented as an Apache HTTPD Server (or any other front-end configured to authenticate users). (you can even configure CISCO PIX (etc.) Firewall to authenticate users.) HTTPD is easiest, but I haven't tried. But again, if your use case is many users, many IPs you need good front-end (web application); if it is not the case - just restrict access to specific IP. -Fuad http://www.tokenizer.ca
RE: Need feedback on solr security
For Making by solr admin password protected, I had used the Path Based Authentication form http://wiki.apache.org/solr/SolrSecurity. In this way my admin area,search,delete,add to index is protected.But Now when I make solr authenticated then for every update/delete from the fornt end is blocked without authentication. Correct, SOLR doesn't use HTTP Session (Session Cookies, Session IDs); and it shouldn't do that. If you have such use case (Authenticated Session) you will need front-end web application.
Range Queries, Geospatial
Hi, I've read very interesting interview with Ryan, http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and -Videos/Interview-Ryan-McKinley Another finding is https://issues.apache.org/jira/browse/SOLR-773 (lucene/contrib/spatial) Is there any more staff going on for SOLR 1.5 (and existing SOLR 1.4)? I need filtering on 2-dimension like x:[1 TO 10100] y:[7900 TO 8000] (that's why I need SOLR:))) Any thoughts? I'd love to implement something quick-simple-efficient if it doesn't exist yet, like R-Tree (http://en.wikipedia.org/wiki/R-tree), or Geohash (http://en.wikipedia.org/wiki/Geohash) I haven't tried Local Lucene and SOLR-773 yet. Thanks!
RE: For caches, any reason to not set initialSize and size to the same value?
Funny, Arrays.copy() for HashMap... but something similar... Anyway, I use same values for initial size and max size, to be safe... and to have OOP at startup :) -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: February-12-10 6:55 PM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: RE: For caches, any reason to not set initialSize and size to the same value? I always use initial size = max size, just to avoid Arrays.copyOf()... Initial (default) capacity for HashMap is 16, when it is not enough - array copy to new 32-element array, then to 64, ... - too much wasted space! (same for ConcurrentHashMap) Excuse me if I didn't understand the question... -Fuad http://www.tokenizer.ca -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: February-12-10 6:30 PM To: solr-user@lucene.apache.org Subject: Re: For caches, any reason to not set initialSize and size to the same value? On Fri, Feb 12, 2010 at 5:23 PM, Jay Hill jayallenh...@gmail.com wrote: If I've done a lot of research and have a very good idea of where my cache sizes are having monitored the stats right before commits, is there any reason why I wouldn't just set the initialSize and size counts to the same values? Is there any reason to set a smaller initialSize if I know reliably that where my limit will almost always be? Probably not much... The only savings will be the 8 bytes (on a 64 bit proc) per unused array slot (in the HashMap). Maybe we should consider removing the initialSize param from the example config to reduce the amount of stuff a user needs to think about. -Yonik http://www.lucidimagination.com
RE: expire/delete documents
or since you specificly asked about delteing anything older then X days (in this example i'm assuming x=7)... deletequerycreateTime:[NOW-7DAYS TO *]/query/delete createTime:[* TO NOW-7DAYS]
RE: analysing wild carded terms
hello *, quick question, what would i have to change in the query parser to allow wildcarded terms to go through text analysis? I believe it is illogical. wildcarded terms will go through terms enumerator.
RE: Solr integration with document management systems
SOLR doesn't come with such things... Look at www.liferay.com; they have plugin for SOLR (in SVN trunk) so that all documents / assets can be automatically indexed by SOLR (and you have full freedom with defining specific SOLR schema settings); their portlets support WebDAV, and Open Office looks almost like Sharepoint -Fuad -Original Message- From: ST ST [mailto:stst2...@gmail.com] Sent: February-06-10 6:46 PM To: solr-user@lucene.apache.org Subject: Solr integration with document management systems Folks, Does Solr 1.4 come with integration with existing document management systems ? Are there any other open source projects based on Solr which provide this capability ? Thanks
RE: Fundamental questions of how to build up solr for huge portals
- whats the best way to use solr to get the best performance for an huge portal with 5000 users that might expense fastly? 5000 users: 200 TPS, for instance, equal to 1200 concurrent users (each user makes 1 request per minute); so that single SOLR instance is more than enough. Why 200TPS? It is bottom line, for fuzzy search (I recently improved it). In real life, real hardware, 1000TPS (using caching, not frequently using fuzzy search, etc.) which is equal to 6 concurrent users, subsequently to more than 600,000 of total users. The rest depends on your design... If you have separate portals A, B, C - create a field with values A, B, C. Liferay Portal nicely integrates with SOLR... each kind of Portlet object (Forum Post, Document, Journal Article, etc.) can implement searchable and be automatically indexed. But Liferay is Java-based, JSR-168, JSR-286 (and it supports PHP-portlets, but I never tried). Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay -Original Message- From: Peter [mailto:zarato...@gmx.net] Sent: January-16-10 10:17 AM To: solr-user@lucene.apache.org Subject: Fundamental questions of how to build up solr for huge portals Hello! Our team wants to use solr for an community portal built up out of 3 and more sub portals. We are unsure in which way we sould build up the whole architecture, because we have more than one portal and we want to make them all connected and searchable by solr. Could some experts help us on these questions? - whats the best way to use solr to get the best performance for an huge portal with 5000 users that might expense fastly? - which client to use (Java,PHP...)? Now the portal is almost PHP/MySQL based. But we want to make solr as best as it could be in all ways (performace, accesibility, way of good programming, using the whole features of lucene - like tagging, facetting and so on...) We are thankful of every suggestions :) Thanks, Peter
RE: Solr response extremely slow
'!' :))) Plus, FastLRUCache (previous one was synchronized) (and of course warming-up time) := start complains after ensuring there are no complains :) (and of course OS needs time to cache filesystem blocks, and Java HotSpot, ... - few minutes at least...) On Feb 3, 2010, at 1:38 PM, Rajat Garg wrote: Solr Specification Version: 1.3.0 Solr Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12 11:06:47 There's the problem right there... that grantingersoll guy :) (kidding) Sounds like you're just hitting cache warming which can take a while. Have you tried Solr 1.4? Faceting performance, for example, is dramatically improved, among many other improvements. Erik
RE: fuzzy matching / configurable distance function?
Levenstein algo is currently hardcoded (FuzzyTermEnum class) in Lucene 2.9.1 and 3.0... There are samples of other distance in contrib folder If you want to play with distance, check http://issues.apache.org/jira/browse/LUCENE-2230 It works if distance is integer and follows metric space axioms: D(a,b)=D(b,a) D(a,b)+D(b,c)=D(a,c) Probably SOLR can provide more freedom with plugged-in distances... -Fuad -Original Message- From: Joe Calderon [mailto:calderon@gmail.com] Sent: February-04-10 2:34 PM To: solr-user@lucene.apache.org Subject: fuzzy matching / configurable distance function? is it possible to configure the distance formula used by fuzzy matching? i see there are other under the function query page under strdist but im wondering if they are applicable to fuzzy matching thx much --joe
SOLR Performance Tuning: Fuzzy Search
I was lucky to contribute an excellent solution: http://issues.apache.org/jira/browse/LUCENE-2230 Even 2nd edition of Lucene in Action advocates to use fuzzy search only in exceptional cases. Another solution would be 2-step indexing (it may work for many use cases), but it is not spellchecker 1. Create a regular index 2. Create a dictionary of terms 3. For each term, find nearest terms (for instance, stick with distance=2) 4. Use copyField in SOLR, or smth similar to synonym dictionary; or, for instance, generate specific Query Parser... 5. Of course, custom request handler and etc. It may work well (but only if query contains term from dictionary; it can't work as a spellchecker) Combination 2 algos can boost performance extremely... Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
RE: Comparison of Solr with Sharepoint Search
I can only tell that Liferay Portal (WebDAV) Document Library Portlet has same functionality as Sharepoint (it has even /servlet/ URL with suffix '/sharepoint'); Liferay also has plugin (web-hook) for SOLR (it has generic search wrapper; any kind of search service provider can be hooked in Liferay) All assets (web content, message board posts, documents, and etc.) can implement indexing interface and get indexed (Lucene, SOLR, etc) So far, it is the best approach. You can enjoy configuring SOLR analyzers/fields/language/stemmers/dictionaries/... You can't do it with MS-Sharepoint (or, for instance, their close competitors Alfresco)!!! -Fuad http://www.tokenizer.ca -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: January-26-10 7:49 PM To: solr-user@lucene.apache.org Subject: Re: Comparison of Solr with Sharepoint Search : Has anyone done a functionality comparison of Solr with Sharepoint/Fast : Search? there's been some discussion on this over the years comparing Solr with FAST if you go looking for it... http://old.nabble.com/SOLR-X-FAST-to14284618.html http://old.nabble.com/Replacing-FAST-functionality-at-sesam.no- td19186109.html http://old.nabble.com/Experiences-from-migrating-from-FAST-to-Solr- td26371613.html http://sesat.no/moving-from-fast-to-solr-review.html ...i have no idea about Sharepoint Search (isn't that actaully a seperate system? ... Microsoft Search Server or something?) -Hoss
RE: Solr vs. Compass
Why to embed indexing as a transaction dependency? Extremely weird idea. There is nothing weird about different use cases requiring different approaches If you're just thinking documents and text search ... then its less of an issue. If you have an online application where the indexing is being used to drive certain features (not just search), then the transactionality is quite useful. I mean: - Primary Key Constraint in RDBMS is not the same as an index - Index in RDBMS: data is still searchable, even if we don't have index Are you sure that index in RDBMS is part of transaction in current implementations of Oracle, IBM, SUN? I never heard such staff, there are no such requirements for transactions. I am talking about transactions and referential integrity, and not about indexed non-tokenized single-valued field Social Insurance Number. It could be done asynchronously outside of transaction, I can't imagine use case when it must be done inside transaction / failing transaction when it can't be done. Primary Key Constraint is different use case, it is not necessarily indexing of data. Especially for Hibernate where we mostly use surrogate auto-generated keys. -Fuad
RE: Solr vs. Compass
Even if commit takes 20 minutes? I've never seen a commit take 20 minutes... (anything taking that long is broken, perhaps in concept) index merge can take from few minutes to few hours. That's why nothing can beat SOLR Master/Slave and sharding for huge datasets. And reopening of IndexReader after each commit may take at least few seconds (although depends on usage patterns). IndexReader or IndexSearcher will only see the index as of the point in time that it was opened. Any changes committed to the index after the reader was opened are not visible until the reader is re-opened. I am wondering how Compass opens new instance of IndexReader (after each commit!) - is it really implemented? I can't believe! It will work probably fine for small datasets (less than 100k), and 1 TPD (transaction-per-day)... Very expensive and unnatural ACID... -Fuad
Is there limit on size of query string?
Is there limit on size of query string? Looks like I have exceptions when query string is higher than 400 characters (average) Thanks!
RE: Solr vs. Compass
Yes, transactional, I tried it: do we really need transactional? Even if commit takes 20 minutes? It's their selling point nothing more. HBase is not transactional, and it has specific use case; each tool has specific use case... in some cases Compass is the best! Also, note that Compass (Hibernate) ((RDBMS)) use specific business domain model terms with relationships; huge overhead to convert relational into object-oriented (why for? Any advantages?)... Lucene does it behind-the-scenes: you don't have to worry that field USA (3 characters) is repeated in few millions documents, and field Canada (6 characters) in another few; no any relational, it's done automatically without any Compass/Hibernate/Table(s) Don't think relational. I wrote this 2 years ago: http://www.theserverside.com/news/thread.tss?thread_id=50711#272351 Fuad Efendi +1 416-993-2060 http://www.tokenizer.ca/ -Original Message- From: Uri Boness [mailto:ubon...@gmail.com] Sent: January-21-10 11:35 AM To: solr-user@lucene.apache.org Subject: Re: Solr vs. Compass In addition, the biggest appealing feature in Compass is that it's transactional and therefore integrates well with your infrastructure (Spring/EJB, Hibernate, JPA, etc...). This obviously is nice for some systems (not very large scale ones) and the programming model is clean. On the other hand, Solr scales much better and provides a load of functionality that otherwise you'll have to custom build on top of Compass/Lucene. Lukáš Vlček wrote: Hi, I think that these products do not compete directly that much, each fit different business case. Can you tell us more about our specific situation? What do you need to search and where your data is? (DB, Filesystem, Web ...?) Solr provides some specific extensions which are not supported directly by Lucene (faceted search, DisMax... etc) so if you need these then your bet on Compass might not be perfect. On the other hand if you need to index persistent Java objects then Compass fits perfectly into this scenario (and if you are using Spring and JPA then setting up search can be matter of several modifications to configuration and annotations). Compass is more Hibernate search competitor (but Compass is not limited to Hibernate only and is not even limited to DB content as well). Regards, Lukas On Thu, Jan 21, 2010 at 4:40 PM, Ken Lane (kenlane) kenl...@cisco.comwrote: We are knee-deep in a Solr project to provide a web services layer between our Oracle DB's and a web front end to be named later to supplement our numerous Business Intelligence dashboards. Someone from a peer group questioned why we selected Solr rather than Compass to start development. The real reason is that we had not heard of Compass until that comment. Now I need to come up with a better answer. Does anyone out there have experience in both approaches who might be able to give a quick compare and contrast? Thanks in advance, Ken
RE: Solr vs. Compass
Of course, I understand what transaction means; have you guys been thinking some about what may happen if we transfer $123.45 from one banking account to another banking account, and MySQL forgets to index decimal during transaction, or DBA was weird and forgot to create an index? Absolutely nothing. Why to embed indexing as a transaction dependency? Extremely weird idea. But I understand some selling points... SOLR: it is faster than Lucene. Filtered queries run faster than traditional AND queries! And this is real selling point. Thanks, Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: January-22-10 11:23 PM To: solr-user@lucene.apache.org Subject: RE: Solr vs. Compass Yes, transactional, I tried it: do we really need transactional? Even if commit takes 20 minutes? It's their selling point nothing more. HBase is not transactional, and it has specific use case; each tool has specific use case... in some cases Compass is the best! Also, note that Compass (Hibernate) ((RDBMS)) use specific business domain model terms with relationships; huge overhead to convert relational into object-oriented (why for? Any advantages?)... Lucene does it behind-the-scenes: you don't have to worry that field USA (3 characters) is repeated in few millions documents, and field Canada (6 characters) in another few; no any relational, it's done automatically without any Compass/Hibernate/Table(s) Don't think relational. I wrote this 2 years ago: http://www.theserverside.com/news/thread.tss?thread_id=50711#272351 Fuad Efendi +1 416-993-2060 http://www.tokenizer.ca/ -Original Message- From: Uri Boness [mailto:ubon...@gmail.com] Sent: January-21-10 11:35 AM To: solr-user@lucene.apache.org Subject: Re: Solr vs. Compass In addition, the biggest appealing feature in Compass is that it's transactional and therefore integrates well with your infrastructure (Spring/EJB, Hibernate, JPA, etc...). This obviously is nice for some systems (not very large scale ones) and the programming model is clean. On the other hand, Solr scales much better and provides a load of functionality that otherwise you'll have to custom build on top of Compass/Lucene. Lukáš Vlček wrote: Hi, I think that these products do not compete directly that much, each fit different business case. Can you tell us more about our specific situation? What do you need to search and where your data is? (DB, Filesystem, Web ...?) Solr provides some specific extensions which are not supported directly by Lucene (faceted search, DisMax... etc) so if you need these then your bet on Compass might not be perfect. On the other hand if you need to index persistent Java objects then Compass fits perfectly into this scenario (and if you are using Spring and JPA then setting up search can be matter of several modifications to configuration and annotations). Compass is more Hibernate search competitor (but Compass is not limited to Hibernate only and is not even limited to DB content as well). Regards, Lukas On Thu, Jan 21, 2010 at 4:40 PM, Ken Lane (kenlane) kenl...@cisco.comwrote: We are knee-deep in a Solr project to provide a web services layer between our Oracle DB's and a web front end to be named later to supplement our numerous Business Intelligence dashboards. Someone from a peer group questioned why we selected Solr rather than Compass to start development. The real reason is that we had not heard of Compass until that comment. Now I need to come up with a better answer. Does anyone out there have experience in both approaches who might be able to give a quick compare and contrast? Thanks in advance, Ken
RE: SOLR Performance Tuning: Fuzzy Searches, Distance, BK-Tree
http://issues.apache.org/jira/browse/LUCENE-2230 Enjoy! -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: January-19-10 11:32 PM To: solr-user@lucene.apache.org Subject: SOLR Performance Tuning: Fuzzy Searches, Distance, BK-Tree Hi, I am wondering: will SOLR or Lucene use caches for fuzzy searches? I mean per-term caching or something, internal to Lucene, or may be SOLR (SOLR may use own query parser)... Anyway, I implemented BK-Tree and playing with it right now, I altered FuzzyTermEnum class of Lucene... http://en.wikipedia.org/wiki/BK-tree - it seems performance of fuzzy searches boosted at least hundred times, but I need to do more tests... repeated similar (slightly different) queries run with better performance, probably because of OS-level file caching... but it could be that of BK-Tree distance! (although I need to use classic int instead of float distance by Lucene/Levenstein etc.) Thanks, Fuad Efendi +1 416-993-2060 http://www.tokenizer.ca/ Data Mining, Vertical Search
SOLR Performance Tuning: Fuzzy Searches, Distance, BK-Tree
Hi, I am wondering: will SOLR or Lucene use caches for fuzzy searches? I mean per-term caching or something, internal to Lucene, or may be SOLR (SOLR may use own query parser)... Anyway, I implemented BK-Tree and playing with it right now, I altered FuzzyTermEnum class of Lucene... http://en.wikipedia.org/wiki/BK-tree - it seems performance of fuzzy searches boosted at least hundred times, but I need to do more tests... repeated similar (slightly different) queries run with better performance, probably because of OS-level file caching... but it could be that of BK-Tree distance! (although I need to use classic int instead of float distance by Lucene/Levenstein etc.) Thanks, Fuad Efendi +1 416-993-2060 http://www.tokenizer.ca/ Data Mining, Vertical Search
RE: SOLR: Replication
Thank you Yonik, excellent WIKI! I'll try without APR, I believe it's environmental issue; 100Mbps switched should do 10 times faster (current replica speed is 1Mbytes/sec) -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: January-03-10 10:03 AM To: solr-user@lucene.apache.org Subject: Re: SOLR: Replication On Sat, Jan 2, 2010 at 11:35 PM, Fuad Efendi f...@efendi.ca wrote: I tried... I set APR to improve performance... server is slow while replica; but top shows only 1% of I/O wait... it is probably environment specific; So you're saying that stock tomcat (non-native APR) was also 10 times slower? but the same happened in my home-based network, rsync was 10 times faster... I don't know details of HTTP-replica, it could be base64 or something like that; RAM-buffer, flush to disk, etc. The HTTP replication is using binary. If you look here, it was benchmarked to be nearly as fast as rsync: http://wiki.apache.org/solr/SolrReplication It does do a fsync to make sure that the files are on disk after downloading, but that shouldn't make too much difference. -Yonik http://www.lucidimagination.com
SOLR: Replication
I used RSYNC before, and 20Gb replica took less than an hour (20-40 minutes); now, HTTP, and it takes 5-6 hours... Admin screen shows 952Kb/sec average speed; 100Mbps network, full-duplex; I am using Tomcat Native for APR. 10x times slow... -Fuad http://www.tokenizer.ca
RE: SOLR: Replication
Hi Yonik, I tried... I set APR to improve performance... server is slow while replica; but top shows only 1% of I/O wait... it is probably environment specific; but the same happened in my home-based network, rsync was 10 times faster... I don't know details of HTTP-replica, it could be base64 or something like that; RAM-buffer, flush to disk, etc. -Fuad -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: January-02-10 5:52 PM To: solr-user@lucene.apache.org Subject: Re: SOLR: Replication On Sat, Jan 2, 2010 at 5:48 PM, Fuad Efendi f...@efendi.ca wrote: I used RSYNC before, and 20Gb replica took less than an hour (20-40 minutes); now, HTTP, and it takes 5-6 hours... Admin screen shows 952Kb/sec average speed; 100Mbps network, full- duplex; I am using Tomcat Native for APR. 10x times slow... Hmmm, did you try w/o native APR? -Yonik http://www.lucidimagination.com
SOLR: Portlet (Plugin) for Lifeay Portal
SOLR Users == I am in the middle of development of generic (configurable) _portlet_ (JSR-286) for Liferay Portal (MIT-like license) which I am going to share, Have a look at (my profile) powered by Liferay Portal: http://www.tokenizer.ca/web/guru :smile: Home page: http://www.tokenizer.ca/ http://www.liferay.com - native multi-hosting support (I can power multiple DNS with single Tomcat-Liferay instance; I can even assign DNS to personal profiles) BTW, Liferay Portal has generic wrapper around Lucene, and recently SOLR! All content (including Articles, BLOGs, Documents, Pages, WIKIs, Forum Posts) is automatically indexed. Having separate SOLR definitely helps: instead of hardcoding (with Lucene) we can now intelligently manage stop words, stemming, language settings, and more. Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
SOLR Performance Tuning: Pagination
I used pagination for a while till found this... I have filtered query ID:[* TO *] returning 20 millions results (no faceting), and pagination always seemed to be fast. However, fast only with low values for start=12345. Queries like start=28838540 take 40-60 seconds, and even cause OutOfMemoryException. I use highlight, faceting on nontokenized Country field, standard handler. It even seems to be a bug... Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
RE: SOLR Performance Tuning: Pagination
Grant, Eric, Walter, and SOLR, Thank you so much for very prompt responses (with links!) From time to time I try to share... Happy Holidays!!! -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: December-24-09 1:51 PM To: solr-user@lucene.apache.org Subject: Re: SOLR Performance Tuning: Pagination Some bots will do that, too. Maybe badly written ones, but we saw that at Netflix. It was causing search timeouts just before a peak traffic period, so we set a page limit in the front end, something like 200 pages. It makes sense for that to be very slow, because a request for hit 28838540 means that Solr has to calculate the relevance for 28838540 + 10 documents. Fuad: Why are you benchmarking this? What user is looking at 20M documents? wunder On Dec 24, 2009, at 10:44 AM, Erik Hatcher wrote: On Dec 24, 2009, at 11:36 AM, Walter Underwood wrote: When do users do a query like that? --wunder Well, SolrEntityProcessor users do :) http://issues.apache.org/jira/browse/SOLR-1499 (which by the way I plan on polishing and committing over the holidays) Erik On Dec 24, 2009, at 8:09 AM, Fuad Efendi wrote: I used pagination for a while till found this... I have filtered query ID:[* TO *] returning 20 millions results (no faceting), and pagination always seemed to be fast. However, fast only with low values for start=12345. Queries like start=28838540 take 40-60 seconds, and even cause OutOfMemoryException. I use highlight, faceting on nontokenized Country field, standard handler. It even seems to be a bug... Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
RE: SOLR Performance Tuning: Pagination
Not users... robots! Slurp/Yahoo, Googlebot, etc. I had friendly URLs for query with filters like http://.../USA/ showing all documents from SOLR with country=USA, with pagination; I disabled it now. But URLs like http://.../?q=USA are still dangerous, I need to limit pagination programmatically. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: December-24-09 11:37 AM To: solr-user@lucene.apache.org Subject: Re: SOLR Performance Tuning: Pagination When do users do a query like that? --wunder On Dec 24, 2009, at 8:09 AM, Fuad Efendi wrote: I used pagination for a while till found this... I have filtered query ID:[* TO *] returning 20 millions results (no faceting), and pagination always seemed to be fast. However, fast only with low values for start=12345. Queries like start=28838540 take 40-60 seconds, and even cause OutOfMemoryException. I use highlight, faceting on nontokenized Country field, standard handler. It even seems to be a bug... Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
RE: SOLR Performance Tuning: Pagination
Hi Walter, you are right, it were mostly robots (Googlebot, Yahoo/Slurp, etc); I have friendly URLs like http://www.tokenizer.org/USA/?page=7 (30mlns docs, 3mlns pages) http://www.tokenizer.org/www.newegg.com/ http://www.tokenizer.org/www.newegg.com/?sort=linkdir=ascq=Opteron And even this: http://www.tokenizer.org/AMD/Opteron/8350/ I disabled processing for URLs with no query parameter (empty results); but I should really limit pagination programmatically... fortunately http://www.tokenizer.org/?q=USA returns 50k documents (search doesn't use Country field). But some queries may return huge nuber of documents (better is to tune stop-word list) -Fuad -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: December-24-09 1:51 PM To: solr-user@lucene.apache.org Subject: Re: SOLR Performance Tuning: Pagination Some bots will do that, too. Maybe badly written ones, but we saw that at Netflix. It was causing search timeouts just before a peak traffic period, so we set a page limit in the front end, something like 200 pages. It makes sense for that to be very slow, because a request for hit 28838540 means that Solr has to calculate the relevance for 28838540 + 10 documents. Fuad: Why are you benchmarking this? What user is looking at 20M documents? wunder On Dec 24, 2009, at 10:44 AM, Erik Hatcher wrote: On Dec 24, 2009, at 11:36 AM, Walter Underwood wrote: When do users do a query like that? --wunder Well, SolrEntityProcessor users do :) http://issues.apache.org/jira/browse/SOLR-1499 (which by the way I plan on polishing and committing over the holidays) Erik On Dec 24, 2009, at 8:09 AM, Fuad Efendi wrote: I used pagination for a while till found this... I have filtered query ID:[* TO *] returning 20 millions results (no faceting), and pagination always seemed to be fast. However, fast only with low values for start=12345. Queries like start=28838540 take 40-60 seconds, and even cause OutOfMemoryException. I use highlight, faceting on nontokenized Country field, standard handler. It even seems to be a bug... Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
RE: SOLR Performance Tuning: Disable INFO Logging.
Can you quickly explain what you did to disable INFO-Level? I am from a PHP background and am not so well versed in Tomcat or Java. Is this a section in solrconfig.xml or did you have to edit Solr Java source and recompile? 1. Create a file called logging.properties with following content (I created it in /home/tomcat/solr folder): .level=INFO handlers= java.util.logging.ConsoleHandler, java.util.logging.FileHandler java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter java.util.logging.FileHandler.level = INFO java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter java.util.logging.ConsoleHandler.level = ALL org.apache.solr.level=SEVERE 2. Modify file tomcat_installation/bin/catalina.sh to include following (as a first line in script): JAVA_OPTS=... ... ... -Djava.util.logging.config.file=/home/tomcat/solr/logging.properties (this line may include more parameters such as -Xmx8196m for memory, -Dfile.encoding=UTF8 -Dsolr.solr.home=/home/tomcat/solr -Dsolr.data.dir=/home/tomcat/solr for SOLR, etc.) With these settings, SOLR (and Tomcat) will use standard Java 5/6 logging capabilities. Log output will default to standard /logs folder of Tomcat. You may find additional logging configuration settings by google for Java 5 Logging etc. 2009/12/20 Fuad Efendi f...@efendi.ca: After researching how to configure default SOLR Tomcat logging, I finally disabled INFO-level for SOLR. And performance improved at least 7 times!!! ('at least 7' because I restarted server 5 minutes ago; caches are not prepopulated yet) Before that, I had 300-600 ms in HTTPD log files in average, and 4%-8% I/O wait whenever top commands shows SOLR on top. Now, I have 50ms-100ms in average (total response time logged by HTTPD). P.S. Of course, I am limited in RAM, and I use slow SATA... server is moderately loaded, 5-10 requests per second. P.P.S. And suddenly synchronous I/O by Java/Tomcat Logger slows down performance much higher than read-only I/O of Lucene. Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
SOLR Performance Tuning: Disable INFO Logging.
After researching how to configure default SOLR Tomcat logging, I finally disabled INFO-level for SOLR. And performance improved at least 7 times!!! ('at least 7' because I restarted server 5 minutes ago; caches are not prepopulated yet) Before that, I had 300-600 ms in HTTPD log files in average, and 4%-8% I/O wait whenever top commands shows SOLR on top. Now, I have 50ms-100ms in average (total response time logged by HTTPD). P.S. Of course, I am limited in RAM, and I use slow SATA... server is moderately loaded, 5-10 requests per second. P.P.S. And suddenly synchronous I/O by Java/Tomcat Logger slows down performance much higher than read-only I/O of Lucene. Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
RE: SOLR Performance Tuning: Disable INFO Logging.
We were talking about GC options a lot; don't forget to enclose following into if (log.isInfoEnabled()): ... final NamedListObject responseHeader = new SimpleOrderedMapObject(); rsp.add(responseHeader, responseHeader); NamedList toLog = rsp.getToLog(); //toLog.add(core, getName()); toLog.add(webapp, req.getContext().get(webapp)); toLog.add(path, req.getContext().get(path)); toLog.add(params, { + req.getParamString() + }); handler.handleRequest(req,rsp); setResponseHeaderValues(handler,req,rsp); StringBuilder sb = new StringBuilder(); for (int i=0; itoLog.size(); i++) { String name = toLog.getName(i); Object val = toLog.getVal(i); sb.append(name).append(=).append(val).append( ); } log.info(logid + sb.toString());... ... -Fuad -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: December-20-09 2:54 PM To: solr-user@lucene.apache.org Subject: SOLR Performance Tuning: Disable INFO Logging. After researching how to configure default SOLR Tomcat logging, I finally disabled INFO-level for SOLR. And performance improved at least 7 times!!! ('at least 7' because I restarted server 5 minutes ago; caches are not prepopulated yet) Before that, I had 300-600 ms in HTTPD log files in average, and 4%-8% I/O wait whenever top commands shows SOLR on top. Now, I have 50ms-100ms in average (total response time logged by HTTPD). P.S. Of course, I am limited in RAM, and I use slow SATA... server is moderately loaded, 5-10 requests per second. P.P.S. And suddenly synchronous I/O by Java/Tomcat Logger slows down performance much higher than read-only I/O of Lucene. Fuad Efendi +1 416-993-2060 http://www.linkedin.com/in/liferay Tokenizer Inc. http://www.tokenizer.ca/ Data Mining, Vertical Search
RE: solr stops running periodically
By that I mean that the java/tomcat process just disappears. I had similar problem when I started Tomcat via SSH, and then I improperly closed SSH without exit command. In some cases (OutOfMemory) memory is not enough to generate log (or CPU can be overloaded by Garbage Collector to such extent that you will have to wait few days until LOG will be generated) - but process cant' disappear... Process can't simply disappear... if it is JVM crash you should see dump file (you may need to set specific option for JVM to generate dump file in case of crash) -Original Message- From: athir nuaimi [mailto:at...@nuaim.com] Sent: November-15-09 1:46 PM To: solr-user@lucene.apache.org Subject: solr stops running periodically We have 4 machines running solr. On one of the machines, every 2-3 days solr stops running. By that I mean that the java/tomcat process just disappears. If I look at the catalina logs, I see normal log entries and then nothing. There is no shutdown messages like you would normally see if you sent a SIGTERM to the process. Obviously this is a problem. I''m new to solr/java so if there are more diagnostic things I can do I'd appreciate any tips/advice. thanks in advance Athir
RE: Lucene FieldCache memory requirements
Sorry Mike, Mark, I am confused again... Yes, I need some more memory for processing (while FieldCache is being loaded), obviously, but it was not main subject... With StringIndexCache, I have 10 arrays (cardinality of this field is 10) storing (int) Lucene Document ID. Except: as Mark said, you'll also need transient memory = pointer (4 or 8 bytes) * (1+maxdoc), while the FieldCache is being loaded. Ok, I see it: final int[] retArray = new int[reader.maxDoc()]; String[] mterms = new String[reader.maxDoc()+1]; I can't track right now (limited in time), I think mterms is local variable and will size down to 0... So that correct formula is... weird one... if you don't want unexpected OOM or overloaded GC (WeakHashMaps...): [some heap] + [Non-Tokenized_Field_Count] x [maxdoc] x [4 bytes + 8 bytes] (for 64-bit) -Fuad -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: November-03-09 5:00 AM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements On Mon, Nov 2, 2009 at 9:27 PM, Fuad Efendi f...@efendi.ca wrote: I believe this is correct estimate: C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] same as [String1_Document_Count + ... + String10_Document_Count + ...] x [4 bytes per DocumentID] That's right. Except: as Mark said, you'll also need transient memory = pointer (4 or 8 bytes) * (1+maxdoc), while the FieldCache is being loaded. After it's done being loaded, this sizes down to the number of unique terms. But, if Lucene did the basic int packing, which really we should do, since you only have 10 unique values, with a naive 4 bits per doc encoding, you'd only need 1/8th the memory usage. We could do a bit better by encoding more than one document at a time... Mike
RE: Lucene FieldCache memory requirements
Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 100 millions docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad
RE: Lucene FieldCache memory requirements
I am not using Lucene API directly; I am using SOLR which uses Lucene FieldCache for faceting on non-tokenized fields... I think this cache will be lazily loaded, until user executes sorted (by this field) SOLR query for all documents *:* - in this case it will be fully populated... Subject: Re: Lucene FieldCache memory requirements Which FieldCache API are you using? getStrings? or getStringIndex (which is used, under the hood, if you sort by this field). Mike On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote: Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 100 millions docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad
RE: Lucene FieldCache memory requirements
Thank you very much Mike, I found it: org.apache.solr.request.SimpleFacets ... // TODO: future logic could use filters instead of the fieldcache if // the number of terms in the field is small enough. counts = getFieldCacheCounts(searcher, base, field, offset,limit, mincount, missing, sort, prefix); ... FieldCache.StringIndex si = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName); final String[] terms = si.lookup; final int[] termNum = si.order; ... So that 64-bit requires more memory :) Mike, am I right here? [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)] (64-bit JVM) 1.2Gb RAM for this... Or, may be I am wrong: For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. [8 bytes (64bit)] x [number of documents (100mlns)]? 0.8Gb Kind of Map between String and DocSet, saving 4 bytes... Key is String, and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit JVM)? I always thought it is (int) documentId... Am I right? Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990! Note that for your use case, this is exceptionally wasteful. This is probably very common case... I think it should be confirmed by Lucene developers too... FieldCache is warmed anyway, even when we don't use SOLR... -Fuad -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: November-02-09 6:00 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements OK I think someone who knows how Solr uses the fieldCache for this type of field will have to pipe up. For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. (Each also consume negligible (for your case) memory to hold the actual string values). Note that for your use case, this is exceptionally wasteful. If Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this) then it'd take much fewer bits to reference the values, since you have only 10 unique string values. Mike On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote: I am not using Lucene API directly; I am using SOLR which uses Lucene FieldCache for faceting on non-tokenized fields... I think this cache will be lazily loaded, until user executes sorted (by this field) SOLR query for all documents *:* - in this case it will be fully populated... Subject: Re: Lucene FieldCache memory requirements Which FieldCache API are you using? getStrings? or getStringIndex (which is used, under the hood, if you sort by this field). Mike On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote: Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 100 millions docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad
RE: Lucene FieldCache memory requirements
Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no difference between maxdoc and maxdoc + 1 for such estimate... difference is between 0.4Gb and 1.2Gb... So, let's vote ;) A. [maxdoc] x [8 bytes ~ pointer to String object] B. [maxdoc] x [8 bytes ~ pointer to Document object] C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] - same as [String1_Document_Count + ... + String10_Document_Count] x [4 bytes ~ DocumentID] D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...] Please confirm that it is Pointer to Object and not Lucene Document ID... I hope it is (int) Document ID... -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: November-02-09 6:52 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements It also briefly requires more memory than just that - it allocates an array the size of maxdoc+1 to hold the unique terms - and then sizes down. Possibly we can use the getUnuiqeTermCount method in the flexible indexing branch to get rid of that - which is why I was thinking it might be a good idea to drop the unsupported exception in that method for things like multi reader and just do the work to get the right number (currently there is a comment that the user should do that work if necessary, making the call unreliable for this). Fuad Efendi wrote: Thank you very much Mike, I found it: org.apache.solr.request.SimpleFacets ... // TODO: future logic could use filters instead of the fieldcache if // the number of terms in the field is small enough. counts = getFieldCacheCounts(searcher, base, field, offset,limit, mincount, missing, sort, prefix); ... FieldCache.StringIndex si = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName); final String[] terms = si.lookup; final int[] termNum = si.order; ... So that 64-bit requires more memory :) Mike, am I right here? [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)] (64-bit JVM) 1.2Gb RAM for this... Or, may be I am wrong: For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. [8 bytes (64bit)] x [number of documents (100mlns)]? 0.8Gb Kind of Map between String and DocSet, saving 4 bytes... Key is String, and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit JVM)? I always thought it is (int) documentId... Am I right? Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990! Note that for your use case, this is exceptionally wasteful. This is probably very common case... I think it should be confirmed by Lucene developers too... FieldCache is warmed anyway, even when we don't use SOLR... -Fuad -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: November-02-09 6:00 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements OK I think someone who knows how Solr uses the fieldCache for this type of field will have to pipe up. For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. (Each also consume negligible (for your case) memory to hold the actual string values). Note that for your use case, this is exceptionally wasteful. If Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this) then it'd take much fewer bits to reference the values, since you have only 10 unique string values. Mike On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote: I am not using Lucene API directly; I am using SOLR which uses Lucene FieldCache for faceting on non-tokenized fields... I think this cache will be lazily loaded, until user executes sorted (by this field) SOLR query for all documents *:* - in this case it will be fully populated... Subject: Re: Lucene FieldCache memory requirements Which FieldCache API are you using? getStrings? or getStringIndex (which is used, under the hood, if you sort by this field). Mike On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote: Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000
RE: Lucene FieldCache memory requirements
I just did some tests in a completely new index (Slave), sort by low-distributed non-tokenized Field (such as Country) takes milliseconds, but sort (ascending) on tokenized field with heavy distribution took 30 seconds (initially). Second sort (descending) took milliseconds. Generic query *.*; FieldCache is not used for tokenized fields... how it is sorted :) Fortunately, no any OOM. -Fuad
RE: Lucene FieldCache memory requirements
Mark, I don't understand this: so with a ton of docs and a few uniques, you get a temp boost in the RAM reqs until it sizes it down. Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is not cache? And this: A pointer for each doc. Why can't we use (int) DocumentID? For me, it is natural; 64-bit pointer to an object in RAM is not natural (in Lucene world)... So, is it [maxdoc]x[4-bytes], or [maxdoc]x[8-bytes]?... -Fuad
RE: Lucene FieldCache memory requirements
I believe this is correct estimate: C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] same as [String1_Document_Count + ... + String10_Document_Count + ...] x [4 bytes per DocumentID] So, for 100 millions docs we need 400Mb for each(!) non-tokenized field. Although FieldCacheImpl is based on WeakHashMap (somewhere...), we can't rely on sizing down with SOLR faceting features I think I finally found the answer... /** Expert: Stores term text values and document ordering data. */ public static class StringIndex { ... /** All the term values, in natural order. */ public final String[] lookup; /** For each document, an index into the lookup array. */ public final int[] order; ... } Another API: /** Checks the internal cache for an appropriate entry, and if none * is found, reads the term values in codefield/code and returns an array * of size codereader.maxDoc()/code containing the value each document * has in the given field. * @param reader Used to get field values. * @param field Which field contains the strings. * @return The values in the given field for each document. * @throws IOException If any error occurs. */ public String[] getStrings (IndexReader reader, String field) throws IOException; Looks similar; cache size is [maxdoc]; however values stored are 8-byte pointers for 64-bit JVM. private MapClass?,Cache caches; private synchronized void init() { caches = new HashMapClass?,Cache(7); ... caches.put(String.class, new StringCache(this)); caches.put(StringIndex.class, new StringIndexCache(this)); ... } StringCache and StringIndexCache use WeakHashMap internally... but objects won't be ever garbage collected in a faceted production system... SOLR SimpleFacets don't use getStrings API, so the hope is memory requirements are minimized. However, Lucene may use it internally for some queries (or, for instance, to get access to a nontokenized cached field without reading index)... to be safe, use this in your basic memory estimates: [512Mb ~ 1Gb] + [non_tokenized_fields_count] x [maxdoc] x [8 bytes] -Fuad -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: November-02-09 7:37 PM To: solr-user@lucene.apache.org Subject: RE: Lucene FieldCache memory requirements Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no difference between maxdoc and maxdoc + 1 for such estimate... difference is between 0.4Gb and 1.2Gb... So, let's vote ;) A. [maxdoc] x [8 bytes ~ pointer to String object] B. [maxdoc] x [8 bytes ~ pointer to Document object] C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] - same as [String1_Document_Count + ... + String10_Document_Count] x [4 bytes ~ DocumentID] D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...] Please confirm that it is Pointer to Object and not Lucene Document ID... I hope it is (int) Document ID... -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: November-02-09 6:52 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements It also briefly requires more memory than just that - it allocates an array the size of maxdoc+1 to hold the unique terms - and then sizes down. Possibly we can use the getUnuiqeTermCount method in the flexible indexing branch to get rid of that - which is why I was thinking it might be a good idea to drop the unsupported exception in that method for things like multi reader and just do the work to get the right number (currently there is a comment that the user should do that work if necessary, making the call unreliable for this). Fuad Efendi wrote: Thank you very much Mike, I found it: org.apache.solr.request.SimpleFacets ... // TODO: future logic could use filters instead of the fieldcache if // the number of terms in the field is small enough. counts = getFieldCacheCounts(searcher, base, field, offset,limit, mincount, missing, sort, prefix); ... FieldCache.StringIndex si = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName); final String[] terms = si.lookup; final int[] termNum = si.order; ... So that 64-bit requires more memory :) Mike, am I right here? [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)] (64-bit JVM) 1.2Gb RAM for this... Or, may be I am wrong: For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. [8 bytes (64bit)] x [number of documents (100mlns)]? 0.8Gb Kind of Map between String and DocSet, saving 4 bytes... Key is String, and Value is array of 64-bit pointers to Document. Why 64-bit (for 64
RE: Lucene FieldCache memory requirements
Hi Mark, Yes, I understand it now; however, how will StringIndexCache size down in a production system faceting by Country on a homepage? This is SOLR specific... Lucene specific: Lucene doesn't read from disk if it can retrieve field value for a specific document ID from cache. How will it size down in purely Lucene-based heavy-loaded production system? Especially if this cache is used for query optimizations. -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: November-02-09 8:53 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements static final class StringIndexCache extends Cache { StringIndexCache(FieldCache wrapper) { super(wrapper); } @Override protected Object createValue(IndexReader reader, Entry entryKey) throws IOException { String field = StringHelper.intern(entryKey.field); final int[] retArray = new int[reader.maxDoc()]; String[] mterms = new String[reader.maxDoc()+1]; TermDocs termDocs = reader.termDocs(); TermEnum termEnum = reader.terms (new Term (field)); int t = 0; // current term number // an entry for documents that have no terms in this field // should a document with no terms be at top or bottom? // this puts them at the top - if it is changed, FieldDocSortedHitQueue // needs to change as well. mterms[t++] = null; try { do { Term term = termEnum.term(); if (term==null || term.field() != field) break; // store term text // we expect that there is at most one term per document if (t = mterms.length) throw new RuntimeException (there are more terms than + documents in field \ + field + \, but it's impossible to sort on + tokenized fields); mterms[t] = term.text(); termDocs.seek (termEnum); while (termDocs.next()) { retArray[termDocs.doc()] = t; } t++; } while (termEnum.next()); } finally { termDocs.close(); termEnum.close(); } if (t == 0) { // if there are no terms, make the term array // have a single null entry mterms = new String[1]; } else if (t mterms.length) { // if there are less terms than documents, // trim off the dead array space String[] terms = new String[t]; System.arraycopy (mterms, 0, terms, 0, t); mterms = terms; } StringIndex value = new StringIndex (retArray, mterms); return value; } }; The formula for a String Index fieldcache is essentially the String array of unique terms (which does indeed size down at the bottom) and the int array indexing into the String array. Fuad Efendi wrote: To be correct, I analyzed FieldCache awhile ago and I believed it never sizes down... /** * Expert: The default cache implementation, storing all values in memory. * A WeakHashMap is used for storage. * * pCreated: May 19, 2004 4:40:36 PM * * @since lucene 1.4 */ Will it size down? Only if we are not faceting (as in SOLR v.1.3)... And I am still unsure, Document ID vs. Object Pointer. I don't understand this: so with a ton of docs and a few uniques, you get a temp boost in the RAM reqs until it sizes it down. Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is not cache? -- - Mark http://www.lucidimagination.com
RE: Lucene FieldCache memory requirements
Even in simplistic scenario, when it is Garbage Collected, we still _need_to_be_able_ to allocate enough RAM to FieldCache on demand... linear dependency on document count... Hi Mark, Yes, I understand it now; however, how will StringIndexCache size down in a production system faceting by Country on a homepage? This is SOLR specific... Lucene specific: Lucene doesn't read from disk if it can retrieve field value for a specific document ID from cache. How will it size down in purely Lucene-based heavy-loaded production system? Especially if this cache is used for query optimizations.
RE: Lucene FieldCache memory requirements
FieldCache uses internally WeakHashMap... nothing wrong, but... no any Garbage Collection tuning will help in case if allocated RAM is not enough for replacing Weak** with Strong**, especially for SOLR faceting... 10%-15% CPU taken by GC were reported... -Fuad
Lucene FieldCache memory requirements
Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 1 million docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad
RE: Too many open files
I had extremely specific use case; about 5000 documents-per-second (small documents) update rate, some documents can be repeatedly sent to SOLR with different timestamp field (and same unique document ID). Nothing breaks, just a great performance gain which was impossible with 32GB Buffer (- it caused constant index merge, 5 times more CPU than index update). Nothing breaks... with indexMerge=10 I don't have ANY merge during 24 hours; segments are large (few of 4Gb-8Gb, and one large union); I have merge explicitly only, at night, when I issue commit. Of course, it depends on use case, for applications such as Content Management System we don't need high remBufferSizeMB (few updates a day sent to SOLR)... -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: October-23-09 5:28 PM To: solr-user@lucene.apache.org Subject: Re: Too many open files 8 GB is much larger than is well supported. Its diminishing returns over 40-100 and mostly a waste of RAM. Too high and things can break. It should be well below 2 GB at most, but I'd still recommend 40-100. Fuad Efendi wrote: Reason of having big RAM buffer is lowering frequency of IndexWriter flushes and (subsequently) lowering frequency of index merge events, and (subsequently) merging of a few larger files takes less time... especially if RAM Buffer is intelligent enough (and big enough) to deal with 100 concurrent updates of existing document without 100-times flushing to disk of 100 document versions. I posted here thread related; I had 1:5 timing for Update:Merge (5 minutes merge, and 1 minute update) with default SOLR settings (32Mb buffer). I increased buffer to 8Gb on Master, and it triggered significant indexing performance boost... -Fuad http://www.linkedin.com/in/liferay -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: October-23-09 3:03 PM To: solr-user@lucene.apache.org Subject: Re: Too many open files I wouldn't use a RAM buffer of a gig - 32-100 is generally a good number. Fuad Efendi wrote: I was partially wrong; this is what Mike McCandless (Lucene-in-Action, 2nd edition) explained at Manning forum: mergeFactor of 1000 means you will have up to 1000 segments at each level. A level 0 segment means it was flushed directly by IndexWriter. After you have 1000 such segments, they are merged into a single level 1 segment. Once you have 1000 level 1 segments, they are merged into a single level 2 segment, etc. So, depending on how many docs you add to your index, you'll could have 1000s of segments w/ mergeFactor=1000. http://www.manning-sandbox.com/thread.jspa?threadID=33784tstart=0 So, in case of mergeFactor=100 you may have (theoretically) 1000 segments, 10-20 files each (depending on schema)... mergeFactor=10 is default setting... ramBufferSizeMB=1024 means that you need at least double Java heap, but you have -Xmx1024m... -Fuad I am getting too many open files error. Usually I test on a server that has 4GB RAM and assigned 1GB for tomcat(set JAVA_OPTS=-Xms256m -Xmx1024m), ulimit -n is 256 for this server and has following setting for SolrConfig.xml useCompoundFiletrue/useCompoundFile ramBufferSizeMB1024/ramBufferSizeMB mergeFactor100/mergeFactor maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength -- - Mark http://www.lucidimagination.com -- - Mark http://www.lucidimagination.com
RE: Too many open files
Thanks for pointing to it, but it is so obvious: 1. Buffer is used as a RAM storage for index updates 2. int has 2 x Gb different values (2^^32) 3. We can have _up_to_ 2Gb of _Documents_ (stored as key-value pairs, inverted index) In case of 5 fields which I have, I need 5 arrays (up to 2Gb of size for each) to store inverted pointers, so that there is no any theoretical limit: Also, from the javadoc in IndexWriter: * p bNOTE/b: because IndexWriter uses * codeint/codes when managing its internal storage, * the absolute maximum value for this setting is somewhat * less than 2048 MB. The precise limit depends on * various factors, such as how large your documents are, * how many fields have norms, etc., so it's best to set * this value comfortably under 2048./p Note also, I use norms etc...