Re: Preserve XML hierarchy
On Thu, Jul 14, 2011 at 8:43 PM, Lucas Miguez lucas.mig...@gmail.com wrote: Thanks for your help! DIH XPathEntityProcessor helps me to index the XML Files, but, does it help to me to know from where the node comes? Following the example in my previous post: example: Imagine that the user search the word zona, then I have to show the TitleP, the TextP, the TitlePart, the TextPart and all the TextSubPart that are childs of gSubPart. Well, I tried to create TextPart, TitlePart, etc with the XPath expression of the location in the original XML, using dynamic fields, for example: dynamic field=TextPart * multivalued=true indexed=true ... / There should not be a space between TextPart and * to have the XPath associated with the field, but I don't know how to search in all TextPart * fields... [...] You can search in individual fields, e.g., with ?q=TitlePart:myterm. For searching in all TextPart* fields, the easiest way probably is to copy the fields into a full-text search field. With the default Solr schema, this can be done by adding a directive like copyField source=TextPart* dest=text / This copies all fields into the field text, which is searched by default. Thus, ?q=myterm will find myterm in all TextPart* fields. Regards, Gora
Re: SolrJ Collapsable Query Fails
Hi, Thanks for the information. However, I still have one more problem. I am iterating over the values of the NamedList. I have 2 values, one being 'responseHeader' and the other one being 'grouped'. I would like to access some information stored within the grouped section, which has data structured like so: grouped={attr_directory={matches=4,groups=[{groupValue=C:\Users\rvassallo\Desktop\Index,doclist={numFound=2,start=0,docs=[SolrDocument[{attr_meta=[Author, kcook, Last-Modified, 2011-03-02T14:14:18Z... With the 'get(group)' method I am only able to access the entire '{attr_directory={matches=4,g...' section. Is there some functionality which allows me to get other data? Something like this for instance: 'get(group.matches)' or maybe 'get(group.attr_directory.matches)' (which will yield the value of 4), or do I need to process the String that the 'get(...)' returns to get what I need? Thanks :) On Thu, Jul 14, 2011 at 12:52 PM, Ahmet Arslan iori...@yahoo.com wrote: See the Yonik's reply : http://search-lucene.com/m/tCmky1v94D92/ In short you need to use NamedListObject getResponse(). I am currently trying to run a collapsable query using SolrJ using SolR 3.3. The problem is that when I run the query through the web interface, with this url: http://localhost:8080/solr/select/?q=attr_content%3Alynxsort=attr_location+descgroup=truegroup.field=attr_directory I am able to see the XML which is returned. The problem though, is that when I try to run the same query through SolrJ, using this code: SolrQuery queryString = new SolrQuery(); for (String param : query.keySet()) { if (param.equals(fq)) { queryString.addFilterQuery(query.get(param)); } else { queryString.setParam(param, query.get(param)); } } System.out.println(queryString.toString()); QueryResponse response = server.query(queryString); //Exception takes place at this line SolrDocumentList docList = response.getResults(); Which constructs a URL like so: q=attr_content%3Alynxsort=attr_location+descgroup=truegroup.field=attr_directory This throws an exception: Caused by: org.apache.solr.common.SolrException: parsing error at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:145) at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:106) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:477) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 3 more Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[3,30088] Message: error reading value:LST at org.apache.solr.client.solrj.impl.XMLResponseParser.readArray(XMLResponseParser.java:324) at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:245) at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:244) at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:244) at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:130) I have tried it with both Jetty and Tomcat, the error is the same for both. I have managed to get other queries to run (with both servers), so I presume that the problem lies with this particular type of query. Any insight on this problem will be highly appreciated, Thanks :)
POST VS GET and NON English Characters
Hello, We have implemented solr search in several languages .Intially we used the GET method for querying ,but later moved to POST method to accomodate lengthy queries . When we moved form GET TO POSt method ,the german characteres could no longer be searched and I had to use the fucntion utf8_decode in my application for the search to work for german characters. Currently I am doing this while quering using the POST method ,we are using the standard Request Handler $this-_queryterm=iconv(UTF-8, ISO-8859-1//TRANSLIT//IGNORE, $this-_queryterm); This makes the query work for german characters and other languages but does not work for certain charactes in Lithuvanian and spanish.Example: *Not working - *Iš - Estremadūros - sNaująjį - MEDŽIAGOTYRA - MEDŽIAGOS - taškuose *Working - *garbę - ieškoti - ispanų Any ideas /input ? Regards Sujatha
Re: POST VS GET and NON English Characters
Hi Arun, This looks like an Encoding issue to me. Can you change your browser settinsg to UTF-8 and hit the search url via GET method. We faced the similar problem with chienese,korean languages, this solved the problem. / Pankaj Bhatt. 2011/7/15 Sujatha Arun suja.a...@gmail.com Hello, We have implemented solr search in several languages .Intially we used the GET method for querying ,but later moved to POST method to accomodate lengthy queries . When we moved form GET TO POSt method ,the german characteres could no longer be searched and I had to use the fucntion utf8_decode in my application for the search to work for german characters. Currently I am doing this while quering using the POST method ,we are using the standard Request Handler $this-_queryterm=iconv(UTF-8, ISO-8859-1//TRANSLIT//IGNORE, $this-_queryterm); This makes the query work for german characters and other languages but does not work for certain charactes in Lithuvanian and spanish.Example: *Not working - *Iš - Estremadūros - sNaująjį - MEDŽIAGOTYRA - MEDŽIAGOS - taškuose *Working - *garbę - ieškoti - ispanų Any ideas /input ? Regards Sujatha
Re: SolrCloud Shardding
Thanks Shalin. I don't necessarily have an issue running off this patch but before I do that or implement my own shardding logic I wonder if you could let me know your thoughts on the stability of the patch? How well it works basically. On Thu, Jul 14, 2011 at 4:51 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Jul 14, 2011 at 12:29 AM, Jamie Johnson jej2...@gmail.com wrote: Reading the SolrCloud wiki I see that there are goals to support different shardding algorithms, what is currently implemented today? Is the shardding logic the responsibility of the application doing the index? Nothing has been committed to trunk yet. So, right now, sharding is the responsibility of the client. You may want to follow the jira issue: https://issues.apache.org/jira/browse/SOLR-2341 -- Regards, Shalin Shekhar Mangar.
Need Suggestion
I am facing some performance issues on my Solr Installation (3core server). I am indexing live twitter data based on certain keywords, as you can imagine, the rate at which documents are received is very high and so the updates to the core is very high and regular. Given below are the document size on my three core. Twitter - 26874747 Core2- 3027800 Core3- 6074253 My Server configuration has 8GB RAM, but now we are experiencing server performance drop. What can be done to improve this? Also, I have a few questions. 1. Does the number of commit takes high memory? Will reducing the number of commits per hour help? 2. Most of my queries are field or date faceting based? how to improve those? Regards, Rohit
Need Suggestion
I am facing some performance issues on my Solr Installation (3core server). I am indexing live twitter data based on certain keywords, as you can imagine, the rate at which documents are received is very high and so the updates to the core is very high and regular. Given below are the document size on my three core. Twitter - 26874747 Core2- 3027800 Core3- 6074253 My Server configuration has 8GB RAM, but now we are experiencing server performance drop. What can be done to improve this? Also, I have a few questions. 1. Does the number of commit takes high memory? Will reducing the number of commits per hour help? 2. Most of my queries are field or date faceting based? how to improve those? Regards, Rohit Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg
Re: Need Suggestion
below are certain things to do for search latency. 1) Do bulk insert. 2) Commit after every ~5000 docs. 3) Do optimization once in a day. 4) in search query use fq parameter. What is the size of JVM you are using ??? On 15 July 2011 17:44, Rohit ro...@in-rev.com wrote: I am facing some performance issues on my Solr Installation (3core server). I am indexing live twitter data based on certain keywords, as you can imagine, the rate at which documents are received is very high and so the updates to the core is very high and regular. Given below are the document size on my three core. Twitter - 26874747 Core2- 3027800 Core3- 6074253 My Server configuration has 8GB RAM, but now we are experiencing server performance drop. What can be done to improve this? Also, I have a few questions. 1. Does the number of commit takes high memory? Will reducing the number of commits per hour help? 2. Most of my queries are field or date faceting based? how to improve those? Regards, Rohit Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg -- Thanks and Regards Mohammad Shariq
Re: SolrJ Collapsable Query Fails
Thanks for the information. However, I still have one more problem. I am iterating over the values of the NamedList. I have 2 values, one being 'responseHeader' and the other one being 'grouped'. I would like to access some information stored within the grouped section, which has data structured like so: grouped={attr_directory={matches=4,groups=[{groupValue=C:\Users\rvassallo\Desktop\Index,doclist={numFound=2,start=0,docs=[SolrDocument[{attr_meta=[Author, kcook, Last-Modified, 2011-03-02T14:14:18Z... With the 'get(group)' method I am only able to access the entire '{attr_directory={matches=4,g...' section. Is there some functionality which allows me to get other data? Something like this for instance: 'get(group.matches)' or maybe 'get(group.attr_directory.matches)' (which will yield the value of 4), or do I need to process the String that the 'get(...)' returns to get what I need? Thanks :) I think accessing the relevant portion in a NamedList is troublesome. I suggest you to look at existing codes in solrJ. e.g. How facet info is extracted from NamedList. I am sending you the piece of code that I used to access grouped info. Hopefully It can give you some idea. NamedList signature = (NamedList) groupedInfo.get(attr_directory); if (signature == null) return new ArrayList(0); matches.append(signature.get(matches)); @SuppressWarnings(unchecked) ArrayListNamedList groups = (ArrayListNamedList) signature.get(groups); ArrayList resultItems = new ArrayList(groups.size()); StringBuilder builder = new StringBuilder(); for (NamedList res : groups) { ResultItem resultItem = null; String hash = null; Integer found = null; for (int i = 0; i res.size(); i++) { String n = res.getName(i); Object o = res.getVal(i); if (groupValue.equals(n)) { hash = (String) o; } else if (doclist.equals(n)) { DocList docList = (DocList) o; found = docList.matches(); try { final SolrDocumentList list = SolrPluginUtils.docListToSolrDocumentList(docList, searcher, fields, null); builder.setLength(0); if (list.size() 0) resultItem = solrDocumentToResultItem(list.get(0), debug); for (final SolrDocument document : list) builder.append(document.getFieldValue(id)).append(','); } catch (final IOException e) { LOG.error(Unexpected Error, e); } } } if (found != null found 1 resultItem != null) { resultItem.setHash(hash); resultItem.setFound(found); builder.setLength(builder.length() - 1); resultItem.setId(builder.toString()); } // debug resultItems.add(resultItem); } return resultItems;
Re: deletedPkQuery fails
Hi Elaine, I think you have a syntax error in your query. I'd recommend you to first try the query using a SQL client, until you get it right. This part seems strange to me: and pl.deleted='' having count(*)=0 *Juan* On Wed, Jul 13, 2011 at 5:09 PM, Elaine Li elaine.bing...@gmail.com wrote: Hi Folks, I am trying to use the deletedPkQuery to enable deltaImport to remove the inactive products from solr. I am keeping getting the syntax error saying the query syntax is not right. I have tried many alternatives to the following query. Although all of them work in the mysql prompt directly, no one works in solr handler. Can anyone give me some hint to debug this type of problem? Is there anything special about deletedPkQuery I am not aware of? deletedPkQuery=select p.pId as id from products p join products_large pl on p.pId=pl.pId where p.pId= ${dataimporter.delta.id} and pl.deleted='' having count(*)=0 Jul 13, 2011 4:02:23 PM org.apache.solr.handler.dataimport.DataImporter doDeltaImport SEVERE: Delta Import Failed org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select p.pId as id from products p join products_large pl on p.pId=pl.pI d where p.pId= and pl.deleted='' having count(*)=0 Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextDeletedRowKey(SqlEntityProcessor.java:91) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextDeletedRowKey(EntityProcessorWrapper.java:258) at org.apache.solr.handler.dataimport.DocBuilder.collectDelta(DocBuilder.java:636) at org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:258) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:172) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:352) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370) Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL serv er version for the right syntax to use near 'and pl.deleted='' having count(*)=0' at line 1 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at com.mysql.jdbc.Util.handleNewInstance(Util.java:407) at com.mysql.jdbc.Util.getInstance(Util.java:382) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1052) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3603) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3535) at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1989) at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2150) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2620) at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2570) at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:779) at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:622) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:246) Elaine
High Query Volume
Hello, I am using solr MoreLikeThis for finding similar result. I have all data indexed into my Solr server. And the indexed data is also too huge. The data ranges to millions. What I am trying to do is, given the ID, it should check the contents of that respective ID and give me the result similar to the contents of that ID. What my problem is , since the contents of the ID is too large. The term vectors/Term frequency becomes too huge. Also the maximum number of query terms (mlt.maxqt) that will be included in generated query depends upon the ID , as some contents are hundreds and some are millions. As I have the ID and its contents, I can find and pass the mlt.maxqt depending upon the ID. So, depending upon the contents, my query limit is sometimes mlt.maxqt=100 , sometimes mlt.maxqt=1000, and sometimes even mlt.maxqt=10…….. If my mlt.maxqt=100 , then the result comes pretty fast. But when its mlt.maxqt=1000 or more, its too too slow obviously…... Is there any way that I can use to solve this issue...scale solr in anyway…Is there any way that I can handle huge Query volume in searching. I know the default query term is 25 but I need lot more than that. Am I using the right tool(solr morelikethis this) ? Also, I have my solr running with 2GB and my application running with 2GB. Any thoughts and help would be real helpful.Thank you in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/High-Query-Volume-tp3172274p3172274.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH full-import - when is commit() actally triggered?
Hello, I am running a full import with a quite plain data-config (a root entity with three sub entities ) from a jdbc datasource. This import is expected to add approximately 10 mio documents What I now see from my logfiles is, that a newSearcher event is fired about every five seconds. This causes a lot load on the machine. While searching *:* via the admin interface it appears, that on every new commit about 1.000 docs are newly added. This the batchSize I configured in the datasource definition, but I don't think that this related. in solrconfig I have updateHandler class=solr.DirectUpdateHandler2 enable=true maxPendingDeletes10/maxPendingDeletes autoCommit maxDocs10/maxDocs !-- maximum uncommited docs before autocommit triggered -- maxTime30/maxTime /autoCommit /updateHandler What other parameters in solrconfig.xml or in my data-config may be related to this behaviour? Any hint is appreciated. Thanks frank -- mit freundlichem Gruß, Frank Wesemann Fotofinder GmbH USt-IdNr. DE812854514 Software EntwicklungWeb: http://www.fotofinder.com/ Potsdamer Str. 96 Tel: +49 30 25 79 28 90 10785 BerlinFax: +49 30 25 79 28 999 Sitz: Berlin Amtsgericht Berlin Charlottenburg (HRB 73099) Geschäftsführer: Ali Paczensky
Data Import from a Queue
Does anyone know of any existing examples of importing data from a queue into Solr? Thank you.
RE: ' invisible ' words
Hi deniz You can use luke ( http://www.getopt.org/luke/) and see how that field is indexed..which words are there in that field. That may help you figure out how you indexed you field. Thanks. Jagdish -Original Message- From: deniz [mailto:denizdurmu...@gmail.com] Sent: Thursday, July 14, 2011 2:57 PM To: solr-user@lucene.apache.org Subject: Re: ' invisible ' words well i know it is totally weird... i have tried many things , including the ones in this forum, but the result is the same... somehow some words are just invisible... - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/invisible-words-tp3158060p3168598.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Is it possible to extract all the tokens from solr?
Check the LukeRequestHandler at - http://wiki.apache.org/solr/LukeRequestHandler This will give you all you need. Thanks, Jagdish -Original Message- From: pravesh [mailto:suyalprav...@yahoo.com] Sent: Thursday, July 14, 2011 2:50 PM To: solr-user@lucene.apache.org Subject: Re: Is it possible to extract all the tokens from solr? You can use lucene for doing this. It provides TermEnum API to enumerate all terms of field(s). SOLR-1.4.+ also provides a special request handler for this purpose. Check it if that helps Thanx Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-extract-all-the-tokens-from-solr-tp3168362p3168589.html Sent from the Solr - User mailing list archive at Nabble.com.
Max Rows
Hi guys! For the past year I¹ve been using Solr with Coldfusion as a search engine for a Library, so far so good I¹ve managed to index different collections using only the 5 custom fields (category, custom1 ... 5) available, since It has worked so good I decided to use Solr to make something like a General Search of all the collections, what I did was that in the cfsearch tag under the ³collection² attribute I separated all of the collections using a comma, Works like a charm, but there is one problem, the maxrows attribute is set to 10, this means 10 results per page, but when you put several collections in the attribute, what it does is that adds 10 results per page per collections, so if a have 4 comma separeted collections the max rows forces itself to 40 results per page instead of the 10 total I¹m aiming to. My question is: Is there a way to fix this? Is there a way to make the maxrow attribute global and prevent it to add more rows per collection? Thanks in advance. Alex.
Re: Need Suggestion
I am using -Xms2g and -Xmx6g What would be the ideal JVM size? Regards, Rohit From: Mohammad Shariq shariqn...@gmail.com To: solr-user@lucene.apache.org Sent: Fri, 15 July, 2011 7:27:38 PM Subject: Re: Need Suggestion below are certain things to do for search latency. 1) Do bulk insert. 2) Commit after every ~5000 docs. 3) Do optimization once in a day. 4) in search query use fq parameter. What is the size of JVM you are using ??? On 15 July 2011 17:44, Rohit ro...@in-rev.com wrote: I am facing some performance issues on my Solr Installation (3core server). I am indexing live twitter data based on certain keywords, as you can imagine, the rate at which documents are received is very high and so the updates to the core is very high and regular. Given below are the document size on my three core. Twitter - 26874747 Core2- 3027800 Core3- 6074253 My Server configuration has 8GB RAM, but now we are experiencing server performance drop. What can be done to improve this? Also, I have a few questions. 1. Does the number of commit takes high memory? Will reducing the number of commits per hour help? 2. Most of my queries are field or date faceting based? how to improve those? Regards, Rohit Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg -- Thanks and Regards Mohammad Shariq
Re: DIH full-import - when is commit() actally triggered?
I am running a full import with a quite plain data-config (a root entity with three sub entities ) from a jdbc datasource. This import is expected to add approximately 10 mio documents What I now see from my logfiles is, that a newSearcher event is fired about every five seconds. This is triggered by autoCommit in every 300,000 milli seconds. You need to remove maxTime30/maxTime to disable this mechanism.
Re: How to use solr.PatternReplaceFilterFactory with ampersand in pattern
That works. Thanks. From: Markus Jelsma markus.jel...@openindex.io To: solr-user@lucene.apache.org Cc: M Singh mans6si...@yahoo.com Sent: Thu, July 14, 2011 4:37:57 PM Subject: Re: How to use solr.PatternReplaceFilterFactory with ampersand in pattern You're in XML so you must escape it properly with amp; etc. Hi: I am using the solr.PatternReplaceFilterFactory with pattern as follows to escape ampersand and $ signs: filter class=solr.PatternReplaceFilterFactory pattern=() replacement= / I am getting error due to embedded ampersand [Fatal Error] schema.xml:82:71: The entity name must immediately follow the '' in the entity reference. Exception in thread main org.xml.sax.SAXParseException: The entity name must immediately follow the '' in the entity reference. at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) Is there anyway to make it work ? Appreciate your help. Thanks.
Re: Solr Ecosystem / Integration wiki pages
: integrating Solr with other applications. What isn't there is a list of : what web/email/file crawlers exist, data integration pipelines, and there : are some other odds and ends like distributions/forks of Solr (Lucid : Constellio), and Solandra.So I started to put together this page: : http://wiki.apache.org/solr/SolrEcosystem instead of essentially : duplicating what's on SolrIntegration I linked to it. I suspect that some : might feel that all this information should live on SolrIntegration and so I : should move this. Yes? I really liked the idea of naming this Solr : Ecosystem but I admit that when it comes down to it, it's basically about : integrating with Solr. : : Any thoughts on this from anyone? Looks fine to me. I think it makes sense to have a distinction between the ecosystem of tools and such that might be of interest to people using Solr (which may or may not know about solr directly), and tools that exist specificly to integrate Solr with other things. I updated both pages to try and clarify their purpose. One thing that would be nice on hte EcoSystem page is to better call out when/how these things can be used with solr by linking to info about that rather then just putting a *S* next to them -- if there isn't a document somewhere on those sites mentioning Solr, then claiming they have some level of Solr integration is kind of missleading. -Hoss
Re: Extending Solr Highlighter to pull information from external source
Boy it's been a long time since I first wrote this, sorry for the delay I think I have this working as I expect with a test implementation. I created the following interface public interface SolrExternalFieldProvider extends NamedListInitializedPlugin { public String[] getFieldContent(String key, SchemaField field, SolrQueryRequest request); } I then added to DefaultSolrHighlighter the following: in init() SolrExternalFieldProvider defaultProvider = solrCore.initPlugins(info.getChildren(externalFieldProvider) , externalFieldProviders,SolrExternalFieldProvider.class,null); if(defaultProvider != null){ externalFieldProviders.put(, defaultProvider); externalFieldProviders.put(null, defaultProvider); } then in doHighlightByHighlighter I added the following if(schemaField != null !schemaField.stored()){ SolrExternalFieldProvider externalFieldProvider = this.getExternalFieldProvider(fieldName, params); if(externalFieldProvider != null){ SchemaField keyField = schema.getUniqueKeyField(); String key = doc.getValues(keyField.getName())[0]; //I know this field exists and is not multivalued if(key != null key.length() 0){ docTexts = externalFieldProvider.getFieldContent(key, schemaField, req); } } else { docTexts = new String[]{}; } } else { docTexts = doc.getValues(fieldName); } This worked for me. I needed to include the req because there are some additional thing that I need to have from it, I figure this is probably something else folks will need as well. I tried to follow the pattern used for the other highlighter pieces in that you can have different externalFieldProviders for each field. I'm more than happy to share the actual classes with the community or add them to one of the JIRA issues mentioned below, I haven't done so yet because I don't know how to build patches. On Mon, Jun 20, 2011 at 11:47 PM, Michael Sokolov soko...@ifactory.com wrote: I found https://issues.apache.org/jira/browse/SOLR-1397 but there is not much going on there LUCENE-1522 https://issues.apache.org/jira/browse/LUCENE-1522has a lot of fascinating discussion on this topic though There is a couple of long lived issues in jira for this (I'd like to try to search them, but I couldn't access jira now). For FVH, it is needed to be modified at Lucene level to use external data. koji Koji - is that really so? It appears to me that would could extend BaseFragmentsBuilder and override createFragments(IndexReader reader, int docId, String fieldName, FieldFragList fieldFragList, int maxNumFragments, String[] preTags, String[] postTags, Encoder encoder ) providing a version that retrieves text from some external source rather than from Lucene fields. It sounds to me like a really useful modification in Lucene core would be to retain match points that have already been computed during scoring so the highlighter doesn't have to attempt to reinvent all that logic! This has all been discussed at length in LUCENE-1522 already, but is there is any recent activity? My hope is that since (at least in my test) search code seems to spend 80% of its time highlighting, folks will take up this banner and do the plumbing needed to improve it - should lead to huge speed-ups for searching! I'm continuing to read, but not really capable of making a meaningful contribution at this point. -Mike
Re: Solr Ecosystem / Integration wiki pages
Thanks for offering feedback; if nobody commented I was going to send an FYI post to the dev list. Comments below. On Jul 15, 2011, at 3:39 PM, Chris Hostetter wrote: : integrating Solr with other applications. What isn't there is a list of : what web/email/file crawlers exist, data integration pipelines, and there : are some other odds and ends like distributions/forks of Solr (Lucid : Constellio), and Solandra.So I started to put together this page: : http://wiki.apache.org/solr/SolrEcosystem instead of essentially : duplicating what's on SolrIntegration I linked to it. I suspect that some : might feel that all this information should live on SolrIntegration and so I : should move this. Yes? I really liked the idea of naming this Solr : Ecosystem but I admit that when it comes down to it, it's basically about : integrating with Solr. : : Any thoughts on this from anyone? Looks fine to me. I think it makes sense to have a distinction between the ecosystem of tools and such that might be of interest to people using Solr (which may or may not know about solr directly), and tools that exist specificly to integrate Solr with other things. I updated both pages to try and clarify their purpose. I noticed your change on IntegratingSolr but not SolrEcosystem, which is still at rev#3. One thing that would be nice on hte EcoSystem page is to better call out when/how these things can be used with solr by linking to info about that rather then just putting a *S* next to them -- if there isn't a document somewhere on those sites mentioning Solr, then claiming they have some level of Solr integration is kind of missleading. I agree that adding a link would be helpful. By the way, every place there is an *S* was deliberately placed there by me because I identified the existence of Solr-specific integration. Do you believe I misattributed an *S*? ~ David
Re: Extending Solr Highlighter to pull information from external source
I added the highlighting code I am using to this JIRA (https://issues.apache.org/jira/browse/SOLR-1397). Afterwards I noticed this JIRA (https://issues.apache.org/jira/browse/SOLR-1954) which talks about another solution. I think David's patch would have worked equally well for my problem, just would require later doing the highlighting on the clients end. I'll have to give this a whirl over the weekend. On Fri, Jul 15, 2011 at 3:55 PM, Jamie Johnson jej2...@gmail.com wrote: Boy it's been a long time since I first wrote this, sorry for the delay I think I have this working as I expect with a test implementation. I created the following interface public interface SolrExternalFieldProvider extends NamedListInitializedPlugin { public String[] getFieldContent(String key, SchemaField field, SolrQueryRequest request); } I then added to DefaultSolrHighlighter the following: in init() SolrExternalFieldProvider defaultProvider = solrCore.initPlugins(info.getChildren(externalFieldProvider) , externalFieldProviders,SolrExternalFieldProvider.class,null); if(defaultProvider != null){ externalFieldProviders.put(, defaultProvider); externalFieldProviders.put(null, defaultProvider); } then in doHighlightByHighlighter I added the following if(schemaField != null !schemaField.stored()){ SolrExternalFieldProvider externalFieldProvider = this.getExternalFieldProvider(fieldName, params); if(externalFieldProvider != null){ SchemaField keyField = schema.getUniqueKeyField(); String key = doc.getValues(keyField.getName())[0]; //I know this field exists and is not multivalued if(key != null key.length() 0){ docTexts = externalFieldProvider.getFieldContent(key, schemaField, req); } } else { docTexts = new String[]{}; } } else { docTexts = doc.getValues(fieldName); } This worked for me. I needed to include the req because there are some additional thing that I need to have from it, I figure this is probably something else folks will need as well. I tried to follow the pattern used for the other highlighter pieces in that you can have different externalFieldProviders for each field. I'm more than happy to share the actual classes with the community or add them to one of the JIRA issues mentioned below, I haven't done so yet because I don't know how to build patches. On Mon, Jun 20, 2011 at 11:47 PM, Michael Sokolov soko...@ifactory.com wrote: I found https://issues.apache.org/jira/browse/SOLR-1397 but there is not much going on there LUCENE-1522 https://issues.apache.org/jira/browse/LUCENE-1522has a lot of fascinating discussion on this topic though There is a couple of long lived issues in jira for this (I'd like to try to search them, but I couldn't access jira now). For FVH, it is needed to be modified at Lucene level to use external data. koji Koji - is that really so? It appears to me that would could extend BaseFragmentsBuilder and override createFragments(IndexReader reader, int docId, String fieldName, FieldFragList fieldFragList, int maxNumFragments, String[] preTags, String[] postTags, Encoder encoder ) providing a version that retrieves text from some external source rather than from Lucene fields. It sounds to me like a really useful modification in Lucene core would be to retain match points that have already been computed during scoring so the highlighter doesn't have to attempt to reinvent all that logic! This has all been discussed at length in LUCENE-1522 already, but is there is any recent activity? My hope is that since (at least in my test) search code seems to spend 80% of its time highlighting, folks will take up this banner and do the plumbing needed to improve it - should lead to huge speed-ups for searching! I'm continuing to read, but not really capable of making a meaningful contribution at this point. -Mike
Indexing PDF documents with no UniqueKey
I want to index PDF (and other rich) documents. I am using the DataImportHandler. Here is how my schema.xml looks: . . field name=title type=text indexed=true stored=true multiValued=false/ field name=description type=text indexed=true stored=true multiValued=false/ field name=date_published type=string indexed=false stored=true multiValued=false/ field name=link type=string indexed=true stored=true multiValued=false required=false/ dynamicField name=attr_* type=textgen indexed=true stored=true multiValued=false/ uniqueKeylink/uniqueKey As you can see I have set link as the unique key so that when the indexing happens documents are not duplicated again. Now I have the file paths stored in a database and I have set the DataImportHandler to get a list of all the file paths and index each document. To test it I used the tutorial.pdf file that comes with example docs in Solr. The problem is of course this pdf document won't have a field 'link'. I am thinking of way how I can manually set the file path as link when indexing these documents. I tried the data-config settings as below, entity name=fileItems rootEntity=false dataSource=dbSource query=select path from file_paths entity name=tika-test processor=TikaEntityProcessor url=${fileItems.path} dataSource=fileSource field column=title name=title meta=true/ field column=Creation-Date name=date_published meta=true/ entity name=filePath dataSource=dbSource query=SELECT path FROM file_paths as link where path = '${fileItems.path}' field column=link name=link/ /entity /entity /entity where I create a sub-entity which queries for the path name and makes it return the results in a column titled 'link'. But I still see this error: WARNING: Error creating document : SolrInputDocument[{date_published=date_published(1.0)={2011-06-23T12:47:45Z}, title=title(1.0)={Solr tutorial}}] org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: link Is there anyway for me to create a field called link for the pdf documents? This was already asked http://lucene.472066.n3.nabble.com/Trouble-with-exception-Document-Null-missing-required-field-DocID-td1641048.html here before but the solution provided uses ExtractRequestHandler but I want to use it through the DataImportHandler. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-PDF-documents-with-no-UniqueKey-tp3173272p3173272.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Extending Solr Highlighter to pull information from external source
I tried the patch at SOLR-1397 but it didn't work as I'd expect. lst name=highlighting lst name=1 arr name=subject_phonetic stremTest/em subject message/str /arr arr name=subject_phonetic_startPosint0/int/arr arr name=subject_phonetic_endPosint29/int/arr /lst /lst The start position is right, but the end position seems to be the length of the field. On Fri, Jul 15, 2011 at 4:25 PM, Jamie Johnson jej2...@gmail.com wrote: I added the highlighting code I am using to this JIRA (https://issues.apache.org/jira/browse/SOLR-1397). Afterwards I noticed this JIRA (https://issues.apache.org/jira/browse/SOLR-1954) which talks about another solution. I think David's patch would have worked equally well for my problem, just would require later doing the highlighting on the clients end. I'll have to give this a whirl over the weekend. On Fri, Jul 15, 2011 at 3:55 PM, Jamie Johnson jej2...@gmail.com wrote: Boy it's been a long time since I first wrote this, sorry for the delay I think I have this working as I expect with a test implementation. I created the following interface public interface SolrExternalFieldProvider extends NamedListInitializedPlugin { public String[] getFieldContent(String key, SchemaField field, SolrQueryRequest request); } I then added to DefaultSolrHighlighter the following: in init() SolrExternalFieldProvider defaultProvider = solrCore.initPlugins(info.getChildren(externalFieldProvider) , externalFieldProviders,SolrExternalFieldProvider.class,null); if(defaultProvider != null){ externalFieldProviders.put(, defaultProvider); externalFieldProviders.put(null, defaultProvider); } then in doHighlightByHighlighter I added the following if(schemaField != null !schemaField.stored()){ SolrExternalFieldProvider externalFieldProvider = this.getExternalFieldProvider(fieldName, params); if(externalFieldProvider != null){ SchemaField keyField = schema.getUniqueKeyField(); String key = doc.getValues(keyField.getName())[0]; //I know this field exists and is not multivalued if(key != null key.length() 0){ docTexts = externalFieldProvider.getFieldContent(key, schemaField, req); } } else { docTexts = new String[]{}; } } else { docTexts = doc.getValues(fieldName); } This worked for me. I needed to include the req because there are some additional thing that I need to have from it, I figure this is probably something else folks will need as well. I tried to follow the pattern used for the other highlighter pieces in that you can have different externalFieldProviders for each field. I'm more than happy to share the actual classes with the community or add them to one of the JIRA issues mentioned below, I haven't done so yet because I don't know how to build patches. On Mon, Jun 20, 2011 at 11:47 PM, Michael Sokolov soko...@ifactory.com wrote: I found https://issues.apache.org/jira/browse/SOLR-1397 but there is not much going on there LUCENE-1522 https://issues.apache.org/jira/browse/LUCENE-1522has a lot of fascinating discussion on this topic though There is a couple of long lived issues in jira for this (I'd like to try to search them, but I couldn't access jira now). For FVH, it is needed to be modified at Lucene level to use external data. koji Koji - is that really so? It appears to me that would could extend BaseFragmentsBuilder and override createFragments(IndexReader reader, int docId, String fieldName, FieldFragList fieldFragList, int maxNumFragments, String[] preTags, String[] postTags, Encoder encoder ) providing a version that retrieves text from some external source rather than from Lucene fields. It sounds to me like a really useful modification in Lucene core would be to retain match points that have already been computed during scoring so the highlighter doesn't have to attempt to reinvent all that logic! This has all been discussed at length in LUCENE-1522 already, but is there is any recent activity? My hope is that since (at least in my test) search code seems to spend 80% of its time highlighting, folks will take up this banner and do the plumbing needed to improve it - should lead to huge speed-ups for searching! I'm continuing to read, but not really capable of making a meaningful contribution at this point. -Mike
Re: SolrCloud Shardding
On Fri, Jul 15, 2011 at 4:51 PM, Jamie Johnson jej2...@gmail.com wrote: Thanks Shalin. I don't necessarily have an issue running off this patch but before I do that or implement my own shardding logic I wonder if you could let me know your thoughts on the stability of the patch? How well it works basically. To be frank, I've no idea. This is just the beginning of this feature so you have to assume that the final result that goes into Solr can be very different. -- Regards, Shalin Shekhar Mangar.
Analysis page output vs. actually getting search matches, a discrepency?
I a problem searching for one mfg name (out of our 10mm product titles) and it is indexed in a text type field having about the same analyzer settings as the solr example text field definition, and most everything works fine but we found this one example which I cannot get a direct hit on. In the Field Analysis page, It sure looks like it would *have* to match but sadly during searches it just doesn't. I can get it to match by turning off 'split on case change' but that breaks many other searches like 'appleTV' which need to split on case change to match 'apple tv' in our content! If I search for SterlingTek's anything I get zero results. If I change the casing to Sterlingtek's in my query, I get all the results. If I turn off 'split on case change then the first gets results also. See verbose analysis output to see actual filter settings, I put non-verbose first for easier reading (hope the tables don't get lost during posting to this group) but the analysis shows complete matchup, that is what I don't get: Field Analysis Top of Form Field Field value (Index) verbose output highlight matches SterlingTek's NB-2LH Field value (Query) verbose output SterlingTek's NB-2LH Bottom of Form Index Analyzer SterlingTek's NB-2LH SterlingTek's NB-2LH SterlingTek's NB-2LH Sterling Tek NB 2 LH SterlingTek sterling tek nb 2 lh sterlingtek sterling tek nb 2 lh sterlingtek sterling tek nb 2 lh sterlingtek Note every field is highlighted in the last line above meaning all have a match, right??? Query Analyzer SterlingTek's NB-2LH SterlingTek's NB-2LH SterlingTek's NB-2LH Sterling Tek NB 2 LH sterling tek nb 2 lh sterling tek nb 2 lh sterling tek nb 2 lh VERBOSE OUTPUT FOLLOWS: Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=index_synonyms.txt, expand=true, ignoreCase=true} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0, splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1} term position 1 2 3 4 5 term text Sterling Tek NB 2 LH SterlingTek term type word word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 0,11 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 1 2 3 4 5 term text sterling tek nb 2 lh sterlingtek term type word word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 0,11 payload com.lucidimagination.solrworks.analysis.LucidKStemFilterFactory {protected=protwords.txt} term position 1 2 3 4 5 term text sterling tek nb 2 lh sterlingtek term type word word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 0,11 payload org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} term position 1 2 3 4 5 term text sterling tek nb 2 lh sterlingtek term type word word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 0,11 payload Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=query_synonyms.txt, expand=true, ignoreCase=true} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true} term position 1 2 term text SterlingTek's NB-2LH term type word word source start,end 0,13 14,20 payload org.apache.solr.analysis.WordDelimiterFilterFactory {preserveOriginal=0, splitOnCaseChange=1, generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0} term position 1 2 3 4 5 term text Sterling Tek NB 2 LH term type word word word word word source start,end 0,8 8,11 14,16 17,18 18,20 payload
Re: Analysis page output vs. actually getting search matches, a discrepency?
: Subject: Analysis page output vs. actually getting search matches, : a discrepency? 99% of the time when people ask questions like this, it's because of confusion about how/when QueryParsing comes into play (as opposed to analysis) -- analysis.jsp only shows you part of the equation, it doesn't know what query parser you are using. you mentioned that you aren't getting matches when you expect them, and you provided the analysis.jsp output, but you didn't mention anything about the request you are making, the query parser used etc it owuld be good to know the full query URL, along with the debugQuery output showing the final query toString info. if that info doesn't clear up the discrepency, you should also take a look at the explainOther info for the doc that you expect to match that isn't -- if you still aren't sure what's going on, post all of that info to solr-user and folks can probably help you make sense of it. (all that said: in some instances this type of problem is simply that someone changed the schema and didn't reindex everything, so the indexed terms don't really match what you think they do) -Hoss
Re: Getting the indexed value rather than the stored value
: However, when I get the value of the field from a Solr query, I get the : original sentence (some sentence like this) which is not what I want (in : this particular case). the stored field is allways the original stored value -- analysis is only used for producing the indexed terms. : For now, i ended up creating a custom updateprocessor and configured it in : solrconfig.xml, but I would still like to know if there's a way through the : SOLR API to get the actual indexed value (like the way the SOLR api does it) an updateprocessor is definiltey the right way to go about a problem like this. Solr actually doesn't have an efficient way to get the indexed values for a document, the very nature of hte indexed values is that they are an *inverted* index -- it's efficient to go from indexed term - doc, not the other way arround. The caveat to this is that things like the FieldCache and UnInvertedField can be used internally for fast lookup of indexed terms but they have heavy initialization cost to build up these data structures for each newSearcher. Bottom line: an updateprocessor (or generating this value in your indexing code) is the way to go. -Hoss
RE: Analysis page output vs. actually getting search matches, a discrepency?
Hi Chris, Well to start from the bottom of your list there, I restrict my testing to one sku while continuously reindexing the sku after every indexer side change, and reload the core every time also. I just search from the admin page using the word in question and the exact match on the sku field (the unique one) like this: response lst name=responseHeader int name=status0/int int name=QTime6/int lst name=params str name=indenton/str str name=start0/str str name=qSterlingTek's NB-2LH sku:216473417/str str name=bbba/str str name=rows10/str str name=version2.2/str /lst /lst I will have to find out more about query parsers before I can answer the rest, Will reply to that later... and it's Friday after all! :) Thanks -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Friday, July 15, 2011 4:36 PM To: solr-user@lucene.apache.org Subject: Re: Analysis page output vs. actually getting search matches, a discrepency? : Subject: Analysis page output vs. actually getting search matches, : a discrepency? 99% of the time when people ask questions like this, it's because of confusion about how/when QueryParsing comes into play (as opposed to analysis) -- analysis.jsp only shows you part of the equation, it doesn't know what query parser you are using. you mentioned that you aren't getting matches when you expect them, and you provided the analysis.jsp output, but you didn't mention anything about the request you are making, the query parser used etc it owuld be good to know the full query URL, along with the debugQuery output showing the final query toString info. if that info doesn't clear up the discrepency, you should also take a look at the explainOther info for the doc that you expect to match that isn't -- if you still aren't sure what's going on, post all of that info to solr-user and folks can probably help you make sense of it. (all that said: in some instances this type of problem is simply that someone changed the schema and didn't reindex everything, so the indexed terms don't really match what you think they do) -Hoss
Index rows with NULL value
Hi It seems that solr does not index a row when some column of this row has NULL value. How can I make solr index these rows? Thanks Ruixiang
how to get one word frequency from a document
Hi All, I am trying to use TermVectorComponent to get the word frequency from a particular document. Here is the url I used: q=someword+id%3Asomedocqt=tvrhtv.all=true. But the result includes all the words' frequency in that document. Are there any query filters or request parameters that I can use to get one particular word's frequency from a particular document? Thanks a lot. -- Allen
Re: POST VS GET and NON English Characters
It works fine with GET method ,but I am wondering why it does not with POST method. 2011/7/15 pankaj bhatt panbh...@gmail.com Hi Arun, This looks like an Encoding issue to me. Can you change your browser settinsg to UTF-8 and hit the search url via GET method. We faced the similar problem with chienese,korean languages, this solved the problem. / Pankaj Bhatt. 2011/7/15 Sujatha Arun suja.a...@gmail.com Hello, We have implemented solr search in several languages .Intially we used the GET method for querying ,but later moved to POST method to accomodate lengthy queries . When we moved form GET TO POSt method ,the german characteres could no longer be searched and I had to use the fucntion utf8_decode in my application for the search to work for german characters. Currently I am doing this while quering using the POST method ,we are using the standard Request Handler $this-_queryterm=iconv(UTF-8, ISO-8859-1//TRANSLIT//IGNORE, $this-_queryterm); This makes the query work for german characters and other languages but does not work for certain charactes in Lithuvanian and spanish.Example: *Not working - *Iš - Estremadūros - sNaująjį - MEDŽIAGOTYRA - MEDŽIAGOS - taškuose *Working - *garbę - ieškoti - ispanų Any ideas /input ? Regards Sujatha