How to index on basis of a condition?
Hi I want to index a particular field on one if() condition. Can i do it through DIH? Please suggest. -- Thanks, Pawan Darira
AW: FieldCache
I don't think it is an XY problem. I indexed about 90 million sentences and the PAS (predicate argument structures) they consist of (which are about 500 million). Then I try to do NER (named entity recognition) by searching about 5 million entities. For each entity I need the all search results, not just the top X. Since about 10 percent of the entities are high frequent (i. e. there are more than 5 million hits for human), it takes very long to obtain the data from the index. Very long means about a day with 15 distributed Katta nodes. Katta is just a distribution and shard balancing solution on top of Lucene. Initially, I tried distributed search with Solr. But it was too slow to retrieve a large set of documents. Then I switch to Lucene and made some improvements. I enabled the field cache for my ID field and another single char field (PAS type) to get the benefit of accessing the fields with an array. Unfortunately, the IDs are too large to fit in memory. I gave 12 GB of RAM to each node and also tried to use the MMapDirectory and/or CompressedOops. Lucene always runs out of memory. Then I investigated the storage of the fields. String fields are stored in UTF-8 encoding. But my ID will never contain UTF8 characters. It consists of number schema but does not fit into a single long. I encoded it into a byte array of 11 bytes (compared to 30 bytes of UTF-8 encoding). Then I changed the field description in schema.xml to binary. I still use the EmbeddedSolrServer to create the indices. Also, I had to remove the uniquekey node because binary fields cannot be indexed, which is the requirement for the unique key. After reindexing I discovered that nonindexed or binary fields cannot be used with the FieldCache. Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. The size was increased to 7 characters (= 14 bytes) which is still a gain of more than 50 percent compared to the UTF8 encoding. BTW: I found no sample how to use the IndexableBinaryStringTools class except in the unit tests. Unfortunately, I was not able use it with the EmbeddedSolrServer and the Lucene client. The search result never looked identical compared to the IDs used to create the SolrInputDocument. I assume that the char[] returned form IndexableBinaryStringTools.encode is encoded in UTF-8 again and then stored. At some point the information is lost and cannot be recovered. Recently I upgraded to trunk (4.0) and tried to use the ByteRefs from FieldCache.DEFAULT.getTerms directly. But the bytes are encoded in an unknown form (unknown to me) and cannot be decoded with IndexableBinaryStringTools.decode. The question is now, how to increase the performance of the binary field retrieval by not exploding the memory? I also read some comments which suggest using of payloads. But I never tried this approach. Also, the column-stride fields approach (LUCENE-2186) looks promising but is not released yet. BTW: I made some tests with a smaller index and the ID encoded as string. Using the field cache improves the hit retrieval dramatically (from 18 seconds down to 2 seconds per query, with a large number of results). -- Kind regards, Mathias -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Samstag, 23. Oktober 2010 21:40 An: solr-user@lucene.apache.org Betreff: Re: FieldCache Why do you want to? Basically, the caches are there to improve #searching#. To search something, you must index it. Retrieving it is usually a rare enough operation that caching is irrelevant. This smells like an XY problem, see: http://people.apache.org/~hossman/#xyproblem If this seems like gibberish, could you explain your problem a little more? Best Erick On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter mathias.wal...@gmx.netwrote: Hi, does a field which should be cached needs to be indexed? I have a binary field which is just stored. Retrieving it via FieldCache.DEFAULT.getTerms returns empty ByteRefs. Then I found the following post: http://www.mail-archive.com/d...@lucene.apache.org/msg05403.html How can I use the FieldCache with a binary field? -- Kind regards, Mathias
Re: How to index on basis of a condition?
Do you want to use a field's content do decide whether the document should be indexed or not? You could write an UpdateProcessor for that, simply aborting the chain for the docs that don't pass your test. @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); String value = (String) doc.getFieldValue(myfield); String condition = foobar; if(value == condition) { super.processAdd(cmd); } } But if what you meant was to skip only that field if it does not match condition, you could use doc.removeField(name) instead. Now you can feed your content using whatever method you like. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 25. okt. 2010, at 08.38, Pawan Darira wrote: Hi I want to index a particular field on one if() condition. Can i do it through DIH? Please suggest. -- Thanks, Pawan Darira
Re: a bug of solr distributed search
On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: But itshows a problem of distrubted search without common idf. A doc will get different score in different shard. Bingo. I really don't understand why this fundamental problem with sharding isn't mentioned more often. Every time the advice use sharding is given, it should be followed with a but be aware that it will make relevance ranking unreliable. Regards, Toke Eskildsen
Seattle Scalability Meetup: Rackspace OpenStack, Karmasphere Hadoop, Wed Oct 27
Link/Details: http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/calendar/13704371/ This meetup focuses on Scalability and technologies to enable handling large amounts of data: Hadoop, HBase, distributed NoSQL databases, and more! There's not only a focus on technology, but also everything surrounding it including operations, management, business use cases, etc. We've had great success in the past, and are growing quickly! Including guests from LinkedIn, Amazon, Twitter, Facebook, Cloudant, and 10gen/MongoDB. This month's guests: Mike Mayo, Rackspace, Learn details on Rackspace's new Open Cloud offering -- a complete scalable cloud stack, but open source! Abe Taha, VP Engineering, Karmasphere: Karmasphere produces a Hadoop development environment. Learn more about working with Hadoop effectively, and see their exciting new offerings. Location: Amazon HQ, Von Vorst Building, 426 Terry Ave N., Seattle, WA 98109-5210 Afterparty: Fierabend, 422 Yale Ave N -- Bradford Stephens, Founder, Drawn to Scale drawntoscalehq.com 727.697.7528 http://www.drawntoscalehq.com -- The intuitive, cloud-scale data solution. Process, store, query, search, and serve all your data. http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
Re: Import From MYSQL database
Why don't you paste log excerpt here which is generated when you are trying to import the data. -- View this message in context: http://lucene.472066.n3.nabble.com/Import-From-MYSQL-database-tp1738753p1766375.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
On 2010-10-25 11:22, Toke Eskildsen wrote: On Thu, 2010-07-22 at 04:21 +0200, Li Li wrote: But itshows a problem of distrubted search without common idf. A doc will get different score in different shard. Bingo. I really don't understand why this fundamental problem with sharding isn't mentioned more often. Every time the advice use sharding is given, it should be followed with a but be aware that it will make relevance ranking unreliable. The reason is twofold, I think: * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. However, this means that now for every query you need to make two calls instead of one, which potentially doubles the time to return results (for simple common queries - for rare complex queries the time will be still dominated by the query runtime on shard servers). * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. So then the question is whether you can accept an N% variance in scores across shards, or whether you want to bear the cost of an additional distributed RPC for every query... To summarize, I would qualify your statement with: ...if the composition of your shards is drastically different. Otherwise the cost of using global IDF is not worth it, IMHO. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Solr Javascript+JSON not optimized for SEO
The solution is to offer both, and provide fallback for browsers that don't support javascript (e.g. Googlebot) I would also ponder the question how does this ajax feature help my users?. If you can't find a good answer to that, you should probably just not use ajax. (NB: it's faster is not a valid answer!) -Nick On Sun, Oct 24, 2010 at 12:30 AM, PeterKerk vettepa...@hotmail.com wrote: Unfortunately its not online yet, but is there anything I can clarify in more detail? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Javascript-JSON-not-optimized-for-SEO-tp1751641p1758054.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Javascript+JSON not optimized for SEO
Offering both...that sounds to me like duplicating development efforts? Or am I overseeing something here? Nick Jenkin-2 wrote: NB: it's faster is not a valid answer! Why is it not valid? Because its not necessarily faster or...? And what about user experience? Instead of needing to refresh the entire page I can now do partial page updates? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Javascript-JSON-not-optimized-for-SEO-tp1751641p1766762.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: a bug of solr distributed search
On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote: * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. I must admit that I have not tried the patch myself. Looking at https://issues.apache.org/jira/browse/SOLR-1632 i see that the last comment is from LiLi with a failed patch, but as there are no further comments it is unclear if the problem is general or just with LiLi's setup. I might be a bit harsh here, but the other comments for the JIRA issue also indicate that one would have to be somewhat adventurous to run this in production. * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. While I agree on the validity of the solution, it does put some serious constraints on the shard-setup. To summarize, I would qualify your statement with: ...if the composition of your shards is drastically different. Otherwise the cost of using global IDF is not worth it, IMHO. Do you know of any studies of the differences in ranking with regard to indexing-distribution by hashing, logical grouping and distributed IDF? Regards, Toke Eskildsen
solr 1.4 suggester component
hi I was looking into using solr suggester component as described in http://wiki.apache.org/solr/Suggester I have a file which has words, phrases in it. I was wondering how to make following possible. file has - rebate form form when i look for form or even for i would like to have rebate form to be included too. I tried using str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str but no luck, wiki suggests some one liner change to get fuzzy suggestions. But not sure whats that one liner change would be Also wiki suggests * If you want to use a dictionary file that contains phrases (actually, strings that can be split into multiple tokens by the default QueryConverter) then define a different QueryConverter but i dont see the desired result here is my solrconfig.xml searchComponent class=solr.SpellCheckComponent name=suggest lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str !-- str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str -- str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str str name=sourceLocationamerican-english.txt/str str name=fieldname/str !-- the indexed field to derive suggestions from -- float name=threshold0.005/float str name=buildOnCommitfalse/str queryConverter name=queryConverter class=org.apache.solr.spelling.MySpellingQueryConverter/ /lst /searchComponent requestHandler class=org.apache.solr.handler.component.SearchHandler name=/suggest lst name=defaults str name=spellcheckfalse/str str name=spellcheck.dictionarysuggest/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.count5/str str name=spellcheck.collatetrue/str /lst arr name=components strsuggest/str /arr /requestHandler -- View this message in context: http://lucene.472066.n3.nabble.com/solr-1-4-suggester-component-tp1766915p1766915.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: AW: FieldCache
On Mon, 2010-10-25 at 09:41 +0200, Mathias Walter wrote: [...] I enabled the field cache for my ID field and another single char field (PAS type) to get the benefit of accessing the fields with an array. Unfortunately, the IDs are too large to fit in memory. I gave 12 GB of RAM to each node and also tried to use the MMapDirectory and/or CompressedOops. Lucene always runs out of memory. That is a known problem with Lucene 3-. The cache uses Strings for the terms, which has a lot of overhead. As you discovered, reducing the length of the ID's does not help much. [Encoding ID as 11 stored bytes] Recently I upgraded to trunk (4.0) and tried to use the ByteRefs from FieldCache.DEFAULT.getTerms directly. But the bytes are encoded in an unknown form (unknown to me) and cannot be decoded with IndexableBinaryStringTools.decode. It depends on what you put into it, but if you represent your IDs as normal Strings at index time, they will be stored in UTF-8 encoding. Since you're using 11 ASCII characters for an ID, this means 11 bytes. You can get your Strings back by calling myBytesRef.utf8ToString(). The overhead for BytesRefs is a lot lower than Strings, so simply indexing your ID's and using the field cache might solve your problem when you're using trunk. - Toke
Re: Modelling Access Control
Many thanks for all the responses. I now plan on benchmarking and validating both the filter query approach, and maintaining the ACL entirely outside of Solr. I'll decide from there. Paul
Re: FieldCache
On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter mathias.wal...@gmx.net wrote: I indexed about 90 million sentences and the PAS (predicate argument structures) they consist of (which are about 500 million). Then I try to do NER (named entity recognition) by searching about 5 million entities. For each entity I need the all search results, not just the top X. Since about 10 percent of the entities are high frequent (i. e. there are more than 5 million hits for human), it takes very long to obtain the data from the index. Very long means about a day with 15 distributed Katta nodes. Katta is just a distribution and shard balancing solution on top of Lucene. if you aren't getting top-N results/doing search, are you sure a search engine library/server is the right tool for this job? Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. The size was increased to 7 characters (= 14 bytes) which is still a gain of more than 50 percent compared to the UTF8 encoding. BTW: I found no sample how to use the IndexableBinaryStringTools class except in the unit tests. it is deprecated in trunk, because you can index binary terms (your own byte[]) directly if you want. To do this, you need to use a custom AttributeFactory. See src/test/org/apache/lucene/index/Test2BTerms or https://issues.apache.org/jira/browse/LUCENE-2551 for examples of how to do this.
Re: Integrating Carrot2/Solr Deafult Example
On Oct 24, 2010, at 1:45 PM, Eric Martin wrote: Hello, Welcome to all. I am a very basic user. I have limited knowledge. I read the documentation, I have an 'example' Solr installation working on my server. I have Drupal 6. I have Drupal using Solr (apachesolr) as its default search engine. I have 1 document in the database that is searchable for testing purposes. I would like to know, if I am using all default paths in my Solr installation, how do I enable Carrot2? Once enabled, how do I verify that it is clustering properly? You would verify it is working by asking it to do some clustering and getting back cluster results. Can you run the example in the wiki page and get results? Carrot2 doc I read: http://download.carrot2.org/head/manual/index.html#chapter.application-suite Clustering Wiki Solr I read: http://wiki.apache.org/solr/ClusteringComponent I know this is really basic stuff and I really appreciate the help. I fumbled my way through installing Solr on my own, setting up Drupal, etc. I am a former Natural V2 3270 programmer (basic flat file OO) and have limited experience in PHP, Java, Jetty etc. However, I can read code, decipher what it is doing, and find a solution and then implement it. I just really have no foundation for Carrot2/Solr, yet. Any help, pointers and look here's would very much be appreciated.
RE: FieldCache
Hi Mathias, [...] I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. The size was increased to 7 characters (= 14 bytes) which is still a gain of more than 50 percent compared to the UTF8 encoding. BTW: I found no sample how to use the IndexableBinaryStringTools class except in the unit tests. IndexableBinaryStringTools will eventually be deprecated and then dropped, in favor of native indexable/searchable binary terms. More work is required before these are possible, though. Well-maintained unit tests are not a bad way to describe functionality... I assume that the char[] returned form IndexableBinaryStringTools.encode is encoded in UTF-8 again and then stored. At some point the information is lost and cannot be recovered. Can you give an example? This should not happen. Steve
RE: FieldCache
Hi Robert, On 10/25/2010 at 8:20 AM, Robert Muir wrote: it is deprecated in trunk, because you can index binary terms (your own byte[]) directly if you want. To do this, you need to use a custom AttributeFactory. It's not actually deprecated yet. See src/test/org/apache/lucene/index/Test2BTerms or https://issues.apache.org/jira/browse/LUCENE-2551 for examples of how to do this. AFAICT, Test2BTerms only deals with the indexing side of this issue, and doesn't test searching. LUCENE-2551 does, however, test searching. Why hasn't this been committed yet? I had just assumed that it was because fully indexable/searchable binary terms were not yet ready for prime time. I hadn't realized that native binary terms were fully functional - is there any reason why integers (for example) could not be directly indexable/searchable? Steve
Re: Modelling Access Control
On Mon, Oct 25, 2010 at 8:16 AM, Paul Carey paul.p.ca...@gmail.com wrote: Many thanks for all the responses. I now plan on benchmarking and validating both the filter query approach, and maintaining the ACL entirely outside of Solr. I'll decide from there. Paul Great. I am looking forward for some feedback on the benchmarks. -- °O° Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
Re: EmbeddedSolrServer with one core and schema.xml loaded via ClassLoader, is it possible?
I've found two ways which allow me to load all the config files from a jar file, however with the first solution I cannot specify the dataDir. This is the first way: System.setProperty(solr.solr.home, solrHome); CoreContainer.Initializer initializer = new CoreContainer.Initializer(); CoreContainer coreContainer = initializer.initialize(); EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, coreName); This is what http://wiki.apache.org/solr/Solrj suggests, however using this way it's not possible to specify the dataDir which is, by default, ${solr.solr.home}/data/index. This is my attempt to do the same, but in a way I can specify the dataDir: System.setProperty(solr.solr.home, solrHome); System.setProperty(solr.core.dataDir, dataDir); CoreContainer coreContainer = new CoreContainer(); SolrConfig solrConfig = new SolrConfig(); IndexSchema indexSchema = new IndexSchema(solrConfig, null, null); SolrCore core = new SolrCore(dataDir, indexSchema); core.setName(coreName); coreContainer.register(core, false); EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, coreName); Do you see any problems with the second solution? Is there a better way? Paolo Paolo Castagna wrote: Hi, I am trying to use EmbeddedSolrServer with just one core and I'd like to load solrconfig.xml, schema.xml and other configuration files from a jar via getResourceAsStream(...). I've tried to use SolrResourceLoader, but all my attempts failed with a RuntimeException: Can't find resource [...]. Is it possible to construct an EmbeddedSolrServer loading all the config files from a jar file? Thank you in advance for your help, Paolo
RE: How to index on basis of a condition?
Assuming you're talking about data that comes from a DB, I find it easiest to do this kind of logic on the DB's side (mssql example): SELECT IF(someField = someValue, desiredValue, NULL) AS desiredName from someTable If that's not possible, you can use RegexTransformer(http://wiki.apache.org/solr/DataImportHandler#RegexTransformer) or (worst case and worst performance) ScriptTransformer(http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer) and actually write a JS script to do your logic. Ephraim Ofir -Original Message- From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] Sent: Monday, October 25, 2010 10:23 AM To: solr-user@lucene.apache.org Subject: Re: How to index on basis of a condition? Do you want to use a field's content do decide whether the document should be indexed or not? You could write an UpdateProcessor for that, simply aborting the chain for the docs that don't pass your test. @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); String value = (String) doc.getFieldValue(myfield); String condition = foobar; if(value == condition) { super.processAdd(cmd); } } But if what you meant was to skip only that field if it does not match condition, you could use doc.removeField(name) instead. Now you can feed your content using whatever method you like. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 25. okt. 2010, at 08.38, Pawan Darira wrote: Hi I want to index a particular field on one if() condition. Can i do it through DIH? Please suggest. -- Thanks, Pawan Darira
Re: FieldCache
On Mon, Oct 25, 2010 at 9:00 AM, Steven A Rowe sar...@syr.edu wrote: It's not actually deprecated yet. you are right! only in my patch! AFAICT, Test2BTerms only deals with the indexing side of this issue, and doesn't test searching. LUCENE-2551 does, however, test searching. Why hasn't this been committed yet? I had just assumed that it was because fully indexable/searchable binary terms were not yet ready for prime time. I hadn't realized that native binary terms were fully functional - is there any reason why integers (for example) could not be directly indexable/searchable? they are! Term itself now holds a BytesRef behind the scenes, and pretty much everything is fully-functional (for example, the collated sort use case works with the patch in LUCENE-2551) But, the short answer is we still need to fix TermRangeQuery to just work on bytes. The problem is i didnt link the dependent issue: LUCENE-2514 (I just did this) There is a patch to fix all the range query stuff there... its not finished but not far. The basic idea is to make using [ICU]CollationAnalyzer the supported way of doing this, including queryparser support, etc. The long answer is even after LUCENE-2514 is resolved, there are still some things to figure out: for example how should we properly expose stuff like this in Solr? Do we really need to modify the TokenizerFactories to take AttributeFactory and add AttributeFactoryFactory? Or is it better to add a Solr fieldtype for these kind of things, and do it that way? Or we could just add a special CollatedKeywordTokenizerFactory with the current model that supports the sorting use case easily, but we still want range query support I think...
ApacheCon Atlanta next week
Hi All, Just a couple of notes about ApacheCon next week for those who either are attending or are thinking of attending. 1. There will be Lucene and Solr 2 day trainings done by Erik Hatcher (Solr) and me (Lucene). It's not too late to sign up. See http://na.apachecon.com/c/acna2010/schedule/grid 2. We've got a good deal of content on Lucene, Solr, Tika, Mahout, etc. planned for the week (Thursday and Friday) Again, see http://na.apachecon.com/c/acna2010/schedule/grid 3. There will be a Meetup on Tuesday night. See http://wiki.apache.org/apachecon/ApacheMeetupsNa10. On this front, we are looking for people interested in giving 20-30 min. presentations on what they are doing with any of the Lucene ecosystem technologies. If you are interested, let me know. Otherwise, we will likely make it more informal as a networking/QA meetup. Hope to see you there, Grant
Re: solr 1.4 suggester component
Try here: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/For the infix-type match you're using, you might not want the edge version of ngram... Best Erick On Mon, Oct 25, 2010 at 8:16 AM, abhayd ajdabhol...@hotmail.com wrote: hi I was looking into using solr suggester component as described in http://wiki.apache.org/solr/Suggester I have a file which has words, phrases in it. I was wondering how to make following possible. file has - rebate form form when i look for form or even for i would like to have rebate form to be included too. I tried using str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str but no luck, wiki suggests some one liner change to get fuzzy suggestions. But not sure whats that one liner change would be Also wiki suggests * If you want to use a dictionary file that contains phrases (actually, strings that can be split into multiple tokens by the default QueryConverter) then define a different QueryConverter but i dont see the desired result here is my solrconfig.xml searchComponent class=solr.SpellCheckComponent name=suggest lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str !-- str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str -- str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str str name=sourceLocationamerican-english.txt/str str name=fieldname/str !-- the indexed field to derive suggestions from -- float name=threshold0.005/float str name=buildOnCommitfalse/str queryConverter name=queryConverter class=org.apache.solr.spelling.MySpellingQueryConverter/ /lst /searchComponent requestHandler class=org.apache.solr.handler.component.SearchHandler name=/suggest lst name=defaults str name=spellcheckfalse/str str name=spellcheck.dictionarysuggest/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.count5/str str name=spellcheck.collatetrue/str /lst arr name=components strsuggest/str /arr /requestHandler -- View this message in context: http://lucene.472066.n3.nabble.com/solr-1-4-suggester-component-tp1766915p1766915.html Sent from the Solr - User mailing list archive at Nabble.com.
How to use AND as opposed to OR as the default query operator.
Hi Everybody, I simply want to use AND as the default operator in queries. When a user searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez query. On the other hand I want solr to treat this query as Jennifer AND Lopez and not as Jennifer OR Lopez. In other words I want a default AND behavior in phrase queries instead of OR. I have seen in this presentation http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52 that this OR behavior is configurable. Could you please tell me where this configuration is located? I could not locate it in schema.xml. Swapnonil Mukherjee +91-40092712 +91-9007131999
Re: How to use AND as opposed to OR as the default query operator.
http://wiki.apache.org/solr/SchemaXml#Default_query_parser_operator On Monday 25 October 2010 15:41:50 Swapnonil Mukherjee wrote: Hi Everybody, I simply want to use AND as the default operator in queries. When a user searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez query. On the other hand I want solr to treat this query as Jennifer AND Lopez and not as Jennifer OR Lopez. In other words I want a default AND behavior in phrase queries instead of OR. I have seen in this presentation http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52 that this OR behavior is configurable. Could you please tell me where this configuration is located? I could not locate it in schema.xml. Swapnonil Mukherjee +91-40092712 +91-9007131999 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
DataImporter using pure solr add XML
Looking at DataImporter I'm not sure if it's possible to import using a standard adddoc... xml document representing a document add operation. Generating adddoc is quite expensive in my application and I have cached all those documents into a text column into MySQL database. It will be easier for me to push all updated documents directly from Database instead passing via multiple xml files posted in stream mode to Solr. Thank you. Dario.
Re: How to use AND as opposed to OR as the default query operator.
Which query handler are you using? For a standard query handler you can set q.op per request or set defaultOperator in schema.xml. For a dismax handler you will have to work with min should match. On Mon, Oct 25, 2010 at 6:41 AM, Swapnonil Mukherjee swapnonil.mukher...@gettyimages.com wrote: Hi Everybody, I simply want to use AND as the default operator in queries. When a user searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez query. On the other hand I want solr to treat this query as Jennifer AND Lopez and not as Jennifer OR Lopez. In other words I want a default AND behavior in phrase queries instead of OR. I have seen in this presentation http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52 that this OR behavior is configurable. Could you please tell me where this configuration is located? I could not locate it in schema.xml. Swapnonil Mukherjee +91-40092712 +91-9007131999
Re: How to use AND as opposed to OR as the default query operator.
Hi Pradeep, I am using the standard query parser. I made the changes in schema.xml and it works. It is also good to know that this can done on a per query basis as well. Swapnonil Mukherjee On 25-Oct-2010, at 7:48 PM, Pradeep Singh wrote: Which query handler are you using? For a standard query handler you can set q.op per request or set defaultOperator in schema.xml. For a dismax handler you will have to work with min should match. On Mon, Oct 25, 2010 at 6:41 AM, Swapnonil Mukherjee swapnonil.mukher...@gettyimages.com wrote: Hi Everybody, I simply want to use AND as the default operator in queries. When a user searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez query. On the other hand I want solr to treat this query as Jennifer AND Lopez and not as Jennifer OR Lopez. In other words I want a default AND behavior in phrase queries instead of OR. I have seen in this presentation http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52 that this OR behavior is configurable. Could you please tell me where this configuration is located? I could not locate it in schema.xml. Swapnonil Mukherjee +91-40092712 +91-9007131999
Re: solr 1.4 suggester component
hi erick, Thanks for the link. Problem is we dont want to have another solr core for implementing this, So was trying suggester component as it allows file based auto suggest. It works fine only issue is how to get prefix ignored . Any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-1-4-suggester-component-tp1766915p1767639.html Sent from the Solr - User mailing list archive at Nabble.com.
London open-source search social - 28th Oct - NEW VENUE
Just a reminder that we're meeting this Thursday near St James Park/Westminster. Details on the Meetup page: http://www.meetup.com/london-search-social/ Rich -- Richard Marr
Re: OutOfMemory and auto-commit
Yes, that's my question too. Anyone? Dennis Gearon wrote: How is this avoided? Dennis Gearon --- On Thu, 10/21/10, Lance Norskog goks...@gmail.com wrote: From: Lance Norskog goks...@gmail.com Subject: Re: OutOfMemory and auto-commit To: solr-user@lucene.apache.org Date: Thursday, October 21, 2010, 9:53 PM Yes. Indexing activity suspends until the commit finishes, then starts. Having both queries and indexing on the same Solr will have this memory problem. Lance On Thu, Oct 21, 2010 at 1:16 PM, Jonathan Rochkind rochk...@jhu.edu wrote: If I do _not_ have any auto-commit enabled, and add 500k documents and commit at end, no problem. If I instead set auto-commit maxDocs to 10 (pretty large number), and try to add 500k docs, with autocommits theoretically happening every 100k... I run into an OutOfMemory error. Can anyone think of any reasons that would cause this, and how to resolve it? All I can think of is that in the first case, my newSearcher and firstSearcher warming queries don't run until the 'document add' is completely done. In the second case, there are newSearcher and firstSearcher warming queries happening at the same time another process is continuing to stream 'add's to Solr. Although at a maxDocs of 10, I shouldn't (I think) get _overlapping_ warming queries, the warming queries should be done before the next commit. I think. But nonetheless, just the fact that warming queries are happening at the same time 'add's are continuing to stream, could that be enough to somehow increase memory usage enough to run into OOM? -- Lance Norskog goks...@gmail.com
Re: Modelling Access Control
Dennis Gearon wrote: why use filter queries? Wouldn't reducing the set headed into the filters by putting it in the main query be faster? (A question to learn, since I do NOT know :-) No. At least as I understand it. In the best case, the filter query will be a lot faster, because filter queries are cached seperately in the filter cache. So if the existing filter query can be found in the cache, it'll be a lot faster. If it's not in the cache, the performance should be pretty much the same as if you had included it as an additional clause in the main q query. The reasons to put it in a fq filter are: 1) The caching behavior. You can have that certain part of the query be cached on it's own, speeding up any subsequent queries that use that same fq. 2) Simplification of client code. You can leave your 'q' however you want it, using whatever kind of query parser you want too (dismax, whatever), and just add on the 'fq' without touching the 'q'. This is a lot easier to do, and especially when you're using it for access control like this, a lot harder for a bug to creep in. Jonathan
Re: How to use AND as opposed to OR as the default query operator.
However, for user entered queries, I suggest you take a look at dismax, a lot more suitable for user-entered queries than the standard solr-lucene query parsers. Markus Jelsma wrote: http://wiki.apache.org/solr/SchemaXml#Default_query_parser_operator On Monday 25 October 2010 15:41:50 Swapnonil Mukherjee wrote: Hi Everybody, I simply want to use AND as the default operator in queries. When a user searches for Jennifer Lopez solr converts this to a Jennifer OR Lopez query. On the other hand I want solr to treat this query as Jennifer AND Lopez and not as Jennifer OR Lopez. In other words I want a default AND behavior in phrase queries instead of OR. I have seen in this presentation http://www.slideshare.net/pittaya/using-apache-solr on Slide number 52 that this OR behavior is configurable. Could you please tell me where this configuration is located? I could not locate it in schema.xml. Swapnonil Mukherjee +91-40092712 +91-9007131999
Re: Modelling Access Control
I'll also be interested in how that works for you. Bringing out the whole dataset not filtered for some kind of access control will mean that you will have then do the filtering of the result set in your server side/command line program. So the speed comparison with the filter query vs the outside langauge environement will be very interesting :-) I will also do this, but in about 3-5 months. I will report it then. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Mon, 10/25/10, Paul Carey paul.p.ca...@gmail.com wrote: From: Paul Carey paul.p.ca...@gmail.com Subject: Re: Modelling Access Control To: solr-user@lucene.apache.org Date: Monday, October 25, 2010, 5:16 AM Many thanks for all the responses. I now plan on benchmarking and validating both the filter query approach, and maintaining the ACL entirely outside of Solr. I'll decide from there. Paul
Re: Modelling Access Control
Thanks for that insight, a lot. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Mon, 10/25/10, Jonathan Rochkind rochk...@jhu.edu wrote: From: Jonathan Rochkind rochk...@jhu.edu Subject: Re: Modelling Access Control To: solr-user@lucene.apache.org solr-user@lucene.apache.org Date: Monday, October 25, 2010, 8:19 AM Dennis Gearon wrote: why use filter queries? Wouldn't reducing the set headed into the filters by putting it in the main query be faster? (A question to learn, since I do NOT know :-) No. At least as I understand it. In the best case, the filter query will be a lot faster, because filter queries are cached seperately in the filter cache. So if the existing filter query can be found in the cache, it'll be a lot faster. If it's not in the cache, the performance should be pretty much the same as if you had included it as an additional clause in the main q query. The reasons to put it in a fq filter are: 1) The caching behavior. You can have that certain part of the query be cached on it's own, speeding up any subsequent queries that use that same fq. 2) Simplification of client code. You can leave your 'q' however you want it, using whatever kind of query parser you want too (dismax, whatever), and just add on the 'fq' without touching the 'q'. This is a lot easier to do, and especially when you're using it for access control like this, a lot harder for a bug to creep in. Jonathan
Does anyone notice this site?
I happen to bump into this site: http://www.solr.biz/ They said they are also developing a search engine? Is this any connection to open source Solr?
RE: Does anyone notice this site?
This is not legal advice. Take this as it is. Just off my head and what I know. I did not research this, but could, if Solr wants me to. From a marketing stand-point, probably. From a legal standpoint. They can do whatever they want with the name Solr so long as they maintain a distance between any trademarked name and the fundamental use of the trademark, unless there is substantial connection between the trademark name and recognition. Of course, that is to be determined by a few factors, length in business, trademarks carried, whether or not the offending trademark makes a claim (not making a claim limits your recovery substantially and may even null it.). They are also in South Africa. So, throw in international law. Of course, you also have fair use law. Well, this can get tricky. Here is an example: myspace.com and moremyspace.com. If moremysapce.com is used as a social networking site than myspace has a claim. If it is used as a social networking site in parody then mysapce has no legal claim whatsoever. Another example is booble.com (not work safe link!) That case lasted many years and google lost. Trademarks are a very tricky business and one that I will never practice. Anyway, seeing as how they are making a search engine, they are using a lower level FQDN and they have not made a dent in the industry it would be futile to do anything but send them an email laying cliam to the name Solr. *If you do not send them a letter/email laying claim to Solr you will lose your rights to fight that battle with IANA, etc or the ability to seek legal remedy.* Eric Law Student - Second Year -Original Message- From: scott chu [mailto:scott@udngroup.com] Sent: Monday, October 25, 2010 9:55 AM To: solr-user@lucene.apache.org Subject: Does anyone notice this site? I happen to bump into this site: http://www.solr.biz/ They said they are also developing a search engine? Is this any connection to open source Solr?
Re: Does anyone notice this site?
On Oct 25, 2010, at 12:54 PM, scott chu wrote: I happen to bump into this site: http://www.solr.biz/ They said they are also developing a search engine? Is this any connection to open source Solr? No, it is not a connection and they likely should not be using the name that way, as Solr is a TM of the ASF.
Re: a bug of solr distributed search
On 2010-10-25 13:37, Toke Eskildsen wrote: On Mon, 2010-10-25 at 11:50 +0200, Andrzej Bialecki wrote: * there is an exact solution to this problem, namely to make two distributed calls instead of one (first call to collect per-shard IDFs for given query terms, second call to submit a query rewritten with the global IDF-s). This solution is implemented in SOLR-1632, with some caching to reduce the cost for common queries. I must admit that I have not tried the patch myself. Looking at https://issues.apache.org/jira/browse/SOLR-1632 i see that the last comment is from LiLi with a failed patch, but as there are no further comments it is unclear if the problem is general or just with LiLi's setup. I might be a bit harsh here, but the other comments for the JIRA issue also indicate that one would have to be somewhat adventurous to run this in production. Oh, definitely this is not production quality yet - there are known bugs, for example, that I need to fix, and then it needs to be forward-ported to trunk. It shouldn't be too much work to bring it back into usable state. * another reason is that in many many cases the difference between using exact global IDF and per-shard IDFs is not that significant. If shards are more or less homogenous (e.g. you assign documents to shards by hash(docId)) then term distributions will be also similar. While I agree on the validity of the solution, it does put some serious constraints on the shard-setup. True. But this is the simplest setup that just may be enough. To summarize, I would qualify your statement with: ...if the composition of your shards is drastically different. Otherwise the cost of using global IDF is not worth it, IMHO. Do you know of any studies of the differences in ranking with regard to indexing-distribution by hashing, logical grouping and distributed IDF? Unfortunately, this information is surprisingly scarce - research predating year 2000 is often not applicable, and most current research concentrates on P2P systems, which are really a different ball of wax. Here's a few papers that I found that are related to this issue: * Global Term Weights in Distributed Environments, H. Witschel, 2007 (Elsevier) * KLEE: A Framework for Distributed Top-k Query Algorithms, S. Michel, P. Triantafillou, G. Weikum, VLDB'05 (ACM) * Exploring the Stability of IDF Term Weighting, Xin Fu and Miao Chen, 2008 (Springer Verlag) * A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web, M. Klein, M. Nelson, WIDM'08 (ACM) * Comparison of dierent Collection Fusion Models in Distributed Information Retrieval, Alexander Steidinger - this paper gives a nice comparison framework for different strategies for joining partial results; apparently we use the most primitive strategy explained there, based on raw scores... These papers likely don't fully answer your question, but at least they provide a broader picture of the issue... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
DIH wiht several Cores
Hello. I have 7 Cores. Each Core has his own index and his own import. i want one DIH with an url like http://host/solr/dih. is this possible that the DIH is using different index-folder ? or its nessecary that each core use his own DIH with the solrconfig from each core ? -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-wiht-several-Cores-tp1767883p1767883.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Does anyone notice this site?
fwiw, our proxy server has blocked this site for malicious content. Peter On Mon, Oct 25, 2010 at 1:25 PM, Grant Ingersoll gsing...@apache.orgwrote: On Oct 25, 2010, at 12:54 PM, scott chu wrote: I happen to bump into this site: http://www.solr.biz/ They said they are also developing a search engine? Is this any connection to open source Solr? No, it is not a connection and they likely should not be using the name that way, as Solr is a TM of the ASF.
Re: Solr ExtractingRequestHandler with Compressed files
There was this issue with the previous version of Solr, wherein only the file names from the zip used to get indexed. We had faced the same issue and ended up using the Solr trunk which has the Tika version upgraded and works fine. The Solr version 1.4.1 should also have the fix included. Try using it. Regards, Jayendra On Fri, Oct 22, 2010 at 6:02 PM, Joey Hanzel phan...@nearinfinity.comwrote: Hi, Has anyone had success using ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) ? I am sending solr the archived.tar file using curl. curl http://localhost:8983/solr/update/extract?literal.id=doc1fmap.content=body_textscommit=true -H 'Content-type:application/octet-stream' --data-binary @/home/archived.tar The result I get when I query the document is that the filenames inside the archive are indexed as the body_texts, but the content of those files is not extracted or included. This is not the behvior I expected. Ref: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example . When I send 1 of the actual documents inside the archive using the same curl command the extracted content is then stored in the body_texts field. Am I missing a step for the compressed files? I have added all the extraction depednenices as indicated by mat in http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and am able to succesfully extract data from MS Word, PDF, HTML documents. I'm using the following library versions. Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4 Given everything I have read this version of Tika should support extracting data from all files within a compressed file. Any help or suggestions would be appreciated.
RE: FieldCache
Hi, On Mon, Oct 25, 2010 at 3:41 AM, Mathias Walter mathias.wal...@gmx.net wrote: I indexed about 90 million sentences and the PAS (predicate argument structures) they consist of (which are about 500 million). Then I try to do NER (named entity recognition) by searching about 5 million entities. For each entity I need the all search results, not just the top X. Since about 10 percent of the entities are high frequent (i. e. there are more than 5 million hits for human), it takes very long to obtain the data from the index. Very long means about a day with 15 distributed Katta nodes. Katta is just a distribution and shard balancing solution on top of Lucene. if you aren't getting top-N results/doing search, are you sure a search engine library/server is the right tool for this job? No, I'm not sure, but I didn't find another solution. Any other solution also has to create some kind of index and has to provide some search API. Because I need SpanNearQuery and PhraseQuery to find some multi-term entities, I think Solr/Lucene is a good starting point. Also, I need the classic top-N results for the web application. So a single solution is preferred. Then I tried to use IndexableBinaryStringTools to re-encode my 11 byte array. The size was increased to 7 characters (= 14 bytes) which is still a gain of more than 50 percent compared to the UTF8 encoding. BTW: I found no sample how to use the IndexableBinaryStringTools class except in the unit tests. it is deprecated in trunk, because you can index binary terms (your own byte[]) directly if you want. To do this, you need to use a custom AttributeFactory. How do I use it with Solr, i. e. how to set up a schema.xml using a custom AttributeFactory? -- Kind regards, Mathias
Re: FieldCache
On Mon, Oct 25, 2010 at 3:41 PM, Mathias Walter mathias.wal...@gmx.net wrote: How do I use it with Solr, i. e. how to set up a schema.xml using a custom AttributeFactory? at the moment there is no way to specify an AttributeFactory (AttributeFactoryFactory? heh) in the schema.xml, nor do the TokenizerFactories have any way to use any but the default. So, in order to do this at the moment, you need to make a custom TokenizerFactory hardwired to your AttributeFactory... take a look at KeywordTokenizerFactory, you could make MyKeywordTokenizerFactory that instead of invoking: new KeywordTokenizer(input); in its create() method, would use the KeywordTokenizer(AttributeFactory, Reader, int) ctor with your custom AttributeFactory.
command line to check if Solr is up running
As we know we can use browser to check if Solr is running by going to http://$hostName:$portNumber/$masterName/admin, say http://localhost:8080/solr1/admin. My questions is: are there any ways to check it using command line? I used curl http://localhost:8080; to check my Tomcat, it worked fine. However, no response if I try curl http://localhost:8080/solr1/admin; (even when my Solr is running). Does anyone know any command line alternatives? Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Re: command line to check if Solr is up running
you could look at the ping stuff: http://wiki.apache.org/solr/SolrConfigXml#The_Admin.2BAC8-GUI_Section cheers, rob On Mon, Oct 25, 2010 at 3:56 PM, Xin Li x...@book.com wrote: As we know we can use browser to check if Solr is running by going to http://$hostName:$portNumber/$masterName/admin, say http://localhost:8080/solr1/admin. My questions is: are there any ways to check it using command line? I used curl http://localhost:8080; to check my Tomcat, it worked fine. However, no response if I try curl http://localhost:8080/solr1/admin; (even when my Solr is running). Does anyone know any command line alternatives? Thanks, Xin This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
Re: command line to check if Solr is up running
My questions is: are there any ways to check it using command line? I used curl http://localhost:8080; to check my Tomcat, it worked fine. However, no response if I try curl http://localhost:8080/solr1/admin; (even when my Solr is running). Does anyone know any command line alternatives? What about curl solr/admin/ping?echoParams=noneomitHeader=on
RE: command line to check if Solr is up running
Thanks Bob and Ahmet, curl http://localhost:8080/solr1/admin/ping; works fine :) Xin -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Monday, October 25, 2010 4:03 PM To: solr-user@lucene.apache.org Subject: Re: command line to check if Solr is up running My questions is: are there any ways to check it using command line? I used curl http://localhost:8080; to check my Tomcat, it worked fine. However, no response if I try curl http://localhost:8080/solr1/admin; (even when my Solr is running). Does anyone know any command line alternatives? What about curl solr/admin/ping?echoParams=noneomitHeader=on This electronic mail message contains information that (a) is or may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM DISCLOSURE, and (b) is intended only for the use of the addressee(s) named herein. If you are not an intended recipient, please contact the sender immediately and take the steps necessary to delete the message completely from your computer system. Not Intended as a Substitute for a Writing: Notwithstanding the Uniform Electronic Transaction Act or any other law of similar effect, absent an express statement to the contrary, this e-mail message, its contents, and any attachments hereto are not intended to represent an offer or acceptance to enter into a contract and are not otherwise intended to bind this sender, barnesandnoble.com llc, barnesandnoble.com inc. or any other person or entity.
error in Solr log when adding documents?
Anyone seen anything like this before, the error message does not give me very much information, not sure what's going on. Oct 25, 2010 4:11:02 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: ERROR adding document SolrInputDocument [lengthy serialized hash of document being added is here] at org.apache.solr.handler.BinaryUpdateRequestHandler$2.document(BinaryUpdateRequestHandler.java:81) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:136) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readIterator(JavaBinUpdateRequestCodec.java:126) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:210) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$2.readNamedList(JavaBinUpdateRequestCodec.java:112) at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:175) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:101) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:141) at org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:68) at org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:46) at org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:55) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:852) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:619)
Re: DataImporter using pure solr add XML
On Mon, Oct 25, 2010 at 10:12 AM, Dario Rigolin dario.rigo...@comperio.itwrote: Looking at DataImporter I'm not sure if it's possible to import using a standard adddoc... xml document representing a document add operation. Generating adddoc is quite expensive in my application and I have cached all those documents into a text column into MySQL database. It will be easier for me to push all updated documents directly from Database instead passing via multiple xml files posted in stream mode to Solr. Thank you. Dario. Dario, Technically nothing is stopping you from using the DIH to import your XML document(s). However, note that the docadd/add/doc structure is not required. In fact, you can make up your own structure for the documents, so long as you configure the DIH to recognize them. At minimum, you should be able to use something to the effect of: dataSource type=FileDataSource encoding=UTF-8 / document entity name=some_unique_name_for_the_entity rootEntity=false dataSource=null processor=FileListEntityProcessor fileName=some_regex_matching_your_files.*\.xml$ baseDir=/path/to/xml/files newerThan=${dataimporter.some_unique_name_for_the_entity.last_index_time} entity name=another_unique_entity_name dataSource=some_unique_name_for_the_entity processor=XPathEntityProcessor url=${some_unique_name_for_the_entity.fileAbsolutePath} forEach=/XMLROOT/CHILD_NODE stream=true !-- An optional list of field / definitions if your XML schema does not match that of SOLR -- /entity /entity /document The break down is as follows: The dataSource / defines the document encoding that SOLR should use for your XML files. The top-level entity / creates the list of files to parse (hence why the fileName attribute supports regex expressions). The dataSource attribute needs to be set null here (I'm using 1.4.1, and AFAIK this is the same as 1.3 as well). The rootEntity=false is important to tell SOLR that it should not try to define fields from this entity. The second-level entity / is where the documents found in the file list are processed and parsed. The dataSource attribute needs to be the name of the top-level entity /. The url attribute is defined as the absolute path to the file generated by the top-level entity. The forEach is the key component here; this is the minimum xPath needed to iterate over your document structure. So, if by example you had: XMLROOT CHILD_NODE field1data/field1 field2more data/field2 ... /CHILD_NODE /XMLROOT Also note that, in my experience, case sensitivity matters when parsing your xpath instructions. I hope this helps! - Ken Stanley
replication with multicores
On my master for the forum core I have the following in forum/conf/solrconfig.xml requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=replicateAfterstartup/str str name=replicateAftercommit/str str name=replicateAfteroptimize/str /lst /requestHandler Then on the slave for the forum core I have the following in forum/conf/solrconfig.xml requestHandler name=/replication class=solr.ReplicationHandler lst name=slave str name=masterUrlhttp://host.domain.com:8983/solr/forum/replication/str str name=pollInterval00:00:20/str str name=compressioninternal/str /lst /requestHandler Then if I hit the following url for my master I see http://host.domain.com:8983/solr/forum/admin/replication/index.jsp Local Index Index Version: 1278007696445, Generation: 38534 Location: /data/solr/product/index It is replicating another core's data.. not sure how or why.. any pointers to what I might be doing wrong? And replication is working for the product core but i don't have anything setup in that core Mike
Re: DIH wiht several Cores
Unfortunately, what you are asking for is not possible. The DIH needs to be configured separately for each core. I have a similar situation with my Solr application. I am solving it by creating a custom index feeder that is aware of all of the cores and which documents to send to which cores. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-wiht-several-Cores-tp1767883p1769794.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Failing to successfully import international characters via DIH
As it turns out issue was somewhere in mysql. Not sure exactly where, but something to do to with BLOB. Now, I changed text field from BLOB to varchar and started using mysql_real_escape_string in my php code and all started working just fine. Thanks for the help -- View this message in context: http://lucene.472066.n3.nabble.com/Failing-to-successfully-import-international-characters-via-DIH-tp1753190p1770533.html Sent from the Solr - User mailing list archive at Nabble.com.
after the slave node pull index from master, when will solr del the tmp index dir
I noticed that the slave node have some tmp Index.x dir that created during the index sync with master, but they are not removed even after serval days. So when will solr del the tmp index dir?