Re: DataImport TXT file entity processor
an EntityProcessor looks right to me. It may help us add more attributes if needed. PlainTextEntityProcessor looks like a good name. It can also be used to read html etc. --Noble On Sat, Jan 24, 2009 at 12:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Sat, Jan 24, 2009 at 5:56 AM, Nathan Adams na...@umich.edu wrote: Is there a way to us Data Import Handler to index non-XML (i.e. simple text) files (either via HTTP or FileSystem)? I need to put the entire contents of a text file into a single field of a document and the other fields are being pulled out of Oracle... Not yet. But I think it will be nice to have. Can you open an issue in Jira? I think importing from HTTP was something another user had asked for recently. How do you get the url/path of this text file? That would help decide if we need a Transformer or EntityProcessor for these tasks. -- Regards, Shalin Shekhar Mangar. -- --Noble Paul
Re: Should I extend DIH to handle POST too?
That does not look like a great option. DIH looks like an overkill for this usecase. You can write a simple UpdateHandler to do that . All that you need to do is to extent ContentStreamHandlerBase and register it as an UpdateHandler On Sat, Jan 24, 2009 at 12:34 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: There's another option. Using DIH with Solrj. Take a look at: https://issues.apache.org/jira/browse/SOLR-853 There's a patch there but it hasn't been updated to trunk. A contribution would be most welcome. On Sat, Jan 24, 2009 at 3:11 AM, Gunaranjan Chandraraju chandrar...@apple.com wrote: Hi I had earlier described my requirement of needing to 'post XMLs as-is' to SOLR and have it handled just as the DIH would do on import using the mapping in data-config.xml. I got multiple answers for the 'post approach' - the top two being - Use SOLR CELL - Use SOLRJ In general I would like to keep all the 'data conversion' inside the SOLR powered search system rather than having clients do the XSL and transforming the XML before sending them (CELL approach). My question is? How should I design this - Tomcat Servlet that provides this 'post' endpoint. Accepts the XML over HTTP, transforms it and calls SOLRJ to update. This is the same TOMCAT that houses SOLR. - SOLR Handler (Is this the right way?) - Take this a step further and implement it as an extension to DIH - a handler that will refer to DIH data-config xml and use the same transformation. This way I can invoke an import for 'batched files' or do a 'post 'for the same XML with the same data-config mapping being applied. Maybe it can be a separate handler that just refers to the same data-config.xml and not necessarily bundled with DIH handler code. Looking for some advise. If the DIH extension is the way to go then I would be happy to extend it and contribute that back to SOLR. Regards, Guna -- Regards, Shalin Shekhar Mangar. -- --Noble Paul
Re: How to make Relationships work for Multi-valued Index Fields?
Hello, I am also a newbie and was wanting to do almost the exact same thing. I was planning on doing the equivalent of:- dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name =f processor=FileListEntityProcessor baseDir=*** fileName=.*xml rootEntity=false dataSource=null entity name=record processor=XPathEntityProcessor stream=false rootEntity=false***changed*** forEach=/record url=${f.fileAbsolutePath} field column=ID xpath=/record/@id commonField=true/ ***change** !-- Address -- entity name=record_adr processor=XPathEntityProcessor stream=false forEach=/record/address url=${f.fileAbsolutePath} field column=address_street xpath=/ record/address/@street / field column=address_state xpath=/record/address//@state / field column=address_typexpath=/ record/address//@type / /entity /entity /entity /document /dataConfig ID is no longer unique within Solr, There would be multiple documents with a given ID; one for each address. You can then search on ID and get the three addresses, you can also search on an address more sensibly. I have not been able to try this yet as other issues are still to be dealt with. Comments? Hi I may be completely off on this being new to SOLR but I am not sure how to index related groups of fields in a document and preserver their 'grouping'. I would appreciate any help on this.Detailed description of the problem below. I am trying to index an entity that can have multiple occurrences in the same document - e.g. Address. The address could be Shipping, Home, Office etc. Each address element has multiple values in it like street, state etc.Thus each address element is a group with the state and street in one address element being related to each other. It looks like this in my source xml record coreInfo id=123 , .../ address street=XYZ1 State=CA ...type=home / address street=XYZ2 state=CA ... type=Office/ address street=XYZ3 state=CA type=Other/ /record I have setup my DIH to treat these as entities as below dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name =f processor=FileListEntityProcessor baseDir=*** fileName=.*xml rootEntity=false dataSource=null entity name=record processor=XPathEntityProcessor stream=false forEach=/record url=${f.fileAbsolutePath} field column=ID xpath=/record/@id / !-- Address -- entity name=record_adr processor=XPathEntityProcessor stream=false forEach=/record/address url=${f.fileAbsolutePath} field column=address_street xpath=/ record/address/@street / field column=address_state xpath=/record/address//@state / field column=address_typexpath=/ record/address//@type / /entity /entity /entity /document /dataConfig The problem is as follows. DIH seems to treat these as entities but solr seems to flatten them out on indexing to fields in a document (losing the entity part). So when I search for the an ID - in the response all the street fields are bunched to-gather, followed by all the state fields type etc. Thus I can't associate which street address corresponds to which address type in the response. What seems harder is this - say I need to query on 'Street' = XYZ1 and type=Office. This should NOT return a document since the street for the office address is XY2 and not XYZ1. However when I query for address_state:XYZ1 and address_type:Office I get back this document. The problem seems to be that while DIH allows 'entities' within a document the SOLR schema does not preserve them - it 'flattens' all of them out as indices for the document. I could work around the problem by creating SOLR fields like home_address_street and office_address_street and do some xpath mapping. However I don't want to do it as we can have multiple 'other' addresses. Also I have other fields whose type is not easily distinguished like address. As I mentioned being new to SOLR I might have completely goofed on a way to set it up - much appreciate any direction on it. I am using SOLR 1.3 Regards, Guna -- === Fergus
Re: How to make Relationships work for Multi-valued Index Fields?
nesting of an XPathEntityProcessor into another XPathEntityProcessor is possible only if a field in an xml is a filename/url . what is the purpose of nesting like this? is it because you have multiple addresses? the possible solutions are discussed elsewhere in this thread On Sat, Jan 24, 2009 at 2:41 PM, Fergus McMenemie fer...@twig.me.uk wrote: Hello, I am also a newbie and was wanting to do almost the exact same thing. I was planning on doing the equivalent of:- dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name =f processor=FileListEntityProcessor baseDir=*** fileName=.*xml rootEntity=false dataSource=null entity name=record processor=XPathEntityProcessor stream=false rootEntity=false***changed*** forEach=/record url=${f.fileAbsolutePath} field column=ID xpath=/record/@id commonField=true/ ***change** !-- Address -- entity name=record_adr processor=XPathEntityProcessor stream=false forEach=/record/address url=${f.fileAbsolutePath} field column=address_street xpath=/ record/address/@street / field column=address_state xpath=/record/address//@state / field column=address_typexpath=/ record/address//@type / /entity /entity /entity /document /dataConfig ID is no longer unique within Solr, There would be multiple documents with a given ID; one for each address. You can then search on ID and get the three addresses, you can also search on an address more sensibly. I have not been able to try this yet as other issues are still to be dealt with. Comments? Hi I may be completely off on this being new to SOLR but I am not sure how to index related groups of fields in a document and preserver their 'grouping'. I would appreciate any help on this.Detailed description of the problem below. I am trying to index an entity that can have multiple occurrences in the same document - e.g. Address. The address could be Shipping, Home, Office etc. Each address element has multiple values in it like street, state etc.Thus each address element is a group with the state and street in one address element being related to each other. It looks like this in my source xml record coreInfo id=123 , .../ address street=XYZ1 State=CA ...type=home / address street=XYZ2 state=CA ... type=Office/ address street=XYZ3 state=CA type=Other/ /record I have setup my DIH to treat these as entities as below dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name =f processor=FileListEntityProcessor baseDir=*** fileName=.*xml rootEntity=false dataSource=null entity name=record processor=XPathEntityProcessor stream=false forEach=/record url=${f.fileAbsolutePath} field column=ID xpath=/record/@id / !-- Address -- entity name=record_adr processor=XPathEntityProcessor stream=false forEach=/record/address url=${f.fileAbsolutePath} field column=address_street xpath=/ record/address/@street / field column=address_state xpath=/record/address//@state / field column=address_typexpath=/ record/address//@type / /entity /entity /entity /document /dataConfig The problem is as follows. DIH seems to treat these as entities but solr seems to flatten them out on indexing to fields in a document (losing the entity part). So when I search for the an ID - in the response all the street fields are bunched to-gather, followed by all the state fields type etc. Thus I can't associate which street address corresponds to which address type in the response. What seems harder is this - say I need to query on 'Street' = XYZ1 and type=Office. This should NOT return a document since the street for the office address is XY2 and not XYZ1. However when I query for address_state:XYZ1 and address_type:Office I get back this document. The problem seems to be that while DIH allows 'entities' within a document the SOLR schema does not preserve them - it 'flattens' all of them out as indices for the document. I could work around the problem by creating SOLR fields like home_address_street and office_address_street and do some xpath mapping. However I don't want to do it as we can have multiple 'other' addresses.
Re: Master failover - seeking comments
Did you look at the new in-built replication? http://wiki.apache.org/solr/SolrReplication#head-0e25211b6ef50373fcc2f9a6ad40380c169a5397 It can help you decide where to replicate from during runtime . Look at the snappull command you can pass the masterUrl at the time of replication. On Fri, Jan 23, 2009 at 7:55 PM, edre...@ha edre...@homeaway.com wrote: Thanks for the response. Let me clarify things a bit. Regarding the Slaves: Our project is a web application. It is our desire to embedd Solr into the web application. The web applications are configured with a local embedded Solr instance configured as a slave, and a remote Solr instance configured as a master. We have a requirement for real-time updates to the Solr indexes. Our strategy is to use the local embedded Solr instance as a read-only repository. Any time a write is made, we will send it to the remote Master. Once a user pushes a write operation to the remote Master, all subsequent read operations for this user now are made against the Master for the duration of the session. This approximates realtime updates and seems to work for our purposes. Writes to our system are a small percentage of Read operations. Now, back to the original question. We're simply looking for failover solution if the Master server goes down. Oh, and we are using the replication scripts to sync the servers. It seems like you are trying to write to Solr directly from your front end application. This is why you are thinking of multiple masters. I'll let others comment on how easy/hard/correct the solution would be. Well, yes. We have business requirements that want updates to Solr to be realtime, or as close to that as possible, so when a user changes something, our strategy was to save it to the DB and push it to the Solr Master as well. Although, we will have a background application that will help ensure that Solr is in sync with the DB for times that Solr is down and the DB is not. But, do you really need to have live writes? Can they be channeled through a background process? Since you anyway cannot do a commit per-write, the advantage of live writes is minimal. Moreover you would need to invest a lot of time in handling availability concerns to avoid losing updates. If you log/record the write requests to an intermediate store (or queue), you can do with one master (with another host on standby acting as a slave). We do need to have live writes, as I mentioned above. The concern you mention about losing live writes is exactly why we are looking at a Master Solr server failover strategy. We thought about having a backup Solr server that is a Slave to the Master and could be easily reconfigured as a new Master in a pinch. Our operations team has pushed us to come up with a solution that would be more seamless. This is why we came up with a Master/Master solution where both Masters are also slaves to each other. To test this, I ran the following scenario. 1) Slave 1 (S1) is configured to use M2 as it's master. 2) We push an update to M2. 3) We restart S1, now pointing to M1. 4) We wait for M1 to sync from M2 5) We then sync S1 to M1. 6) Success! How do you co-ordinate all this? This was just a test scenario I ran manually to see if the setup I described above would even work. Is there a Wiki page that outlines typical web application Solr deployment strategies? There are a lot of questions on the forum about this type of thing (including this one). For those who have expertise in this area, I'm sure there are many who could benefit from this (hint hint). As before, any comments or suggestions on the above would be much appreciated. Thanks, Erik -- View this message in context: http://www.nabble.com/Master-failover---seeking-comments-tp21614750p21625324.html Sent from the Solr - User mailing list archive at Nabble.com. -- --Noble Paul
Re: Random queries extremely slow
Use multiple boxes, with a mirroring delaay from one to another, like a pipeline. 2009/1/22 oleg_gnatovskiy oleg_gnatovs...@citysearch.com Well this probably isn't the cause of our random slow queries, but might be the cause of the slow queries after pulling a new index. Is there anything we could do to reduce the performance hit we take from this happening? Otis Gospodnetic wrote: Here is one example: pushing a large newly optimized index onto the server. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: oleg_gnatovskiy oleg_gnatovs...@citysearch.com To: solr-user@lucene.apache.org Sent: Thursday, January 22, 2009 2:22:51 PM Subject: Re: Random queries extremely slow What are some things that could happen to force files out of the cache on a Linux machine? I don't know what kinds of events to look for... yonik wrote: On Thu, Jan 22, 2009 at 1:46 PM, oleg_gnatovskiy wrote: Hello. Our production servers are operating relatively smoothly most of the time running Solr with 19 million listings. However every once in a while the same query that used to take 100 miliseconds takes 6000. Anything else happening on the system that may have forced some of the index files out of operating system disk cache at these times? -Yonik -- View this message in context: http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21611240.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21611454.html Sent from the Solr - User mailing list archive at Nabble.com. -- Alexander Ramos Jardim
Re: Results not appearing
They all appear in the stats admin page under the NumDocs maxDocs fields. I don't explicitly send a commit command, but my posting ends like this (suggesting they are commited): SimplePostTool: POSTing file 21166.xml SimplePostTool: POSTing file 21169.xml SimplePostTool: COMMITting Solr index changes.. I just tried re-posting all the documents set as text -- will that update the current documents indexed? (bearing in mind the unique key, message-id, will be included again) When I try searching I still get 0 results for anything included in the message-id and content fields, both of which should be indexed and returning results... Cheers for any help! ryguasu wrote: These might be obvious, but: * I assume you did a Solr commit command after indexing, right? * If you are using the fieldtype definitions from the default schema.xml, then your string fields are not being analyzed, which means you should expect search results only if you enter the entire, exact value of one of the Message-ID or Date fields in your query. Is that your intention? And yes, your analysis of stored seems correct. Stored fields are those whose values you need back at query time, and indexed fields are those you can do queries on. For a few complications, see http://wiki.apache.org/solr/FieldOptionsByUseCase On Fri, Jan 23, 2009 at 8:04 PM, Johnny X jonathanwel...@gmail.com wrote: I've indexed my XML using the below in the schema: field name=Message-ID type=string indexed=true stored=true required=true/ field name=Date type=string indexed=false stored=true/ field name=From type=string indexed=false stored=true/ field name=To type=string indexed=false stored=true/ field name=Subject type=string indexed=false stored=true/ field name=Mime-Version type=string indexed=false stored=true/ field name=Content-Type type=string indexed=false stored=true/ field name=Content-Transfer-Encoding type=string indexed=false stored=true/ field name=X-From type=string indexed=false stored=true/ field name=X-To type=string indexed=false stored=true/ field name=X-cc type=string indexed=false stored=true/ field name=X-bcc type=string indexed=false stored=true/ field name=X-Folder type=string indexed=false stored=true/ field name=X-Origin type=string indexed=false stored=true/ field name=X-FileName type=string indexed=false stored=true/ field name=Content type=string indexed=true stored=true/ uniqueKeyMessage-ID/uniqueKey However searching via the Message-ID or Content fields returns 0. Using Luke I can still see these fields are stored however. Out of interest, by setting the other fields to just stored=true, can they be returned in a query as part of a search? Cheers. -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21640562.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to make Relationships work for Multi-valued Index Fields?
Hi Fergus, XPathEntityprocessor can read multivalued fields easily eg dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name =f processor=FileListEntityProcessor baseDir=*** fileName=.*xml rootEntity=false dataSource=null entity name=record processor=XPathEntityProcessor forEach=/record url=${f.fileAbsolutePath} field column=ID xpath=/record/@id commonField=true/ ***change** field column=address_street xpath=/record/address/@street / field column=address_state xpath=/record/address/@state / field column=address_type xpath=/record/address/@type / /entity /entity /document /dataConfig In this case all address_street,address_state,address_type will be returned as separate lists while parsing. If you wish to put them into multple fields you can write a transformer and iterate thru the lists and put them into separate fields. If there are 3 address tags then you get a ListString for each fields where the length of the list==3. If an item is missing it will be added as a null. ensure that the fields are marked as multiValued=true in the schema.xml. Otherwise it does not return ListString . If there is no corresponding mapping in schema.xml you can explicitly put it here in the dataconfig.xml eg: field column=address_state multiValued=true xpath=/record/address/@state / I saw the syntax '/record/address//@state'. '//' is not supported . You will have to explicitly give the full path. --Noble On Sat, Jan 24, 2009 at 2:57 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: nesting of an XPathEntityProcessor into another XPathEntityProcessor is possible only if a field in an xml is a filename/url . what is the purpose of nesting like this? is it because you have multiple addresses? the possible solutions are discussed elsewhere in this thread On Sat, Jan 24, 2009 at 2:41 PM, Fergus McMenemie fer...@twig.me.uk wrote: Hello, I am also a newbie and was wanting to do almost the exact same thing. I was planning on doing the equivalent of:- dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name =f processor=FileListEntityProcessor baseDir=*** fileName=.*xml rootEntity=false dataSource=null entity name=record processor=XPathEntityProcessor stream=false rootEntity=false***changed*** forEach=/record url=${f.fileAbsolutePath} field column=ID xpath=/record/@id commonField=true/ ***change** !-- Address -- entity name=record_adr processor=XPathEntityProcessor stream=false forEach=/record/address url=${f.fileAbsolutePath} field column=address_street xpath=/ record/address/@street / field column=address_state xpath=/record/address//@state / field column=address_typexpath=/ record/address//@type / /entity /entity /entity /document /dataConfig ID is no longer unique within Solr, There would be multiple documents with a given ID; one for each address. You can then search on ID and get the three addresses, you can also search on an address more sensibly. I have not been able to try this yet as other issues are still to be dealt with. Comments? Hi I may be completely off on this being new to SOLR but I am not sure how to index related groups of fields in a document and preserver their 'grouping'. I would appreciate any help on this.Detailed description of the problem below. I am trying to index an entity that can have multiple occurrences in the same document - e.g. Address. The address could be Shipping, Home, Office etc. Each address element has multiple values in it like street, state etc.Thus each address element is a group with the state and street in one address element being related to each other. It looks like this in my source xml record coreInfo id=123 , .../ address street=XYZ1 State=CA ...type=home / address street=XYZ2 state=CA ... type=Office/ address street=XYZ3 state=CA type=Other/ /record I have setup my DIH to treat these as entities as below dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name =f processor=FileListEntityProcessor baseDir=*** fileName=.*xml rootEntity=false dataSource=null entity name=record processor=XPathEntityProcessor stream=false forEach=/record
Re: Results not appearing
If it helps, everything appears when I use Luke to search through the index...but the search in that returns nothing either. When I search using the admin page for the word 'Phillip' (which appears the most in all of the documents) I get the following: ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime0/int - lst name=params str name=indenton/str str name=start0/str str name=qphillip/str str name=rows10/str str name=version2.2/str /lst /lst result name=response numFound=0 start=0 / /response Duh...? Johnny X wrote: They all appear in the stats admin page under the NumDocs maxDocs fields. I don't explicitly send a commit command, but my posting ends like this (suggesting they are commited): SimplePostTool: POSTing file 21166.xml SimplePostTool: POSTing file 21169.xml SimplePostTool: COMMITting Solr index changes.. I just tried re-posting all the documents set as text -- will that update the current documents indexed? (bearing in mind the unique key, message-id, will be included again) When I try searching I still get 0 results for anything included in the message-id and content fields, both of which should be indexed and returning results... Cheers for any help! ryguasu wrote: These might be obvious, but: * I assume you did a Solr commit command after indexing, right? * If you are using the fieldtype definitions from the default schema.xml, then your string fields are not being analyzed, which means you should expect search results only if you enter the entire, exact value of one of the Message-ID or Date fields in your query. Is that your intention? And yes, your analysis of stored seems correct. Stored fields are those whose values you need back at query time, and indexed fields are those you can do queries on. For a few complications, see http://wiki.apache.org/solr/FieldOptionsByUseCase On Fri, Jan 23, 2009 at 8:04 PM, Johnny X jonathanwel...@gmail.com wrote: I've indexed my XML using the below in the schema: field name=Message-ID type=string indexed=true stored=true required=true/ field name=Date type=string indexed=false stored=true/ field name=From type=string indexed=false stored=true/ field name=To type=string indexed=false stored=true/ field name=Subject type=string indexed=false stored=true/ field name=Mime-Version type=string indexed=false stored=true/ field name=Content-Type type=string indexed=false stored=true/ field name=Content-Transfer-Encoding type=string indexed=false stored=true/ field name=X-From type=string indexed=false stored=true/ field name=X-To type=string indexed=false stored=true/ field name=X-cc type=string indexed=false stored=true/ field name=X-bcc type=string indexed=false stored=true/ field name=X-Folder type=string indexed=false stored=true/ field name=X-Origin type=string indexed=false stored=true/ field name=X-FileName type=string indexed=false stored=true/ field name=Content type=string indexed=true stored=true/ uniqueKeyMessage-ID/uniqueKey However searching via the Message-ID or Content fields returns 0. Using Luke I can still see these fields are stored however. Out of interest, by setting the other fields to just stored=true, can they be returned in a query as part of a search? Cheers. -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21641692.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr-duplicate post management
On Thu, Jan 22, 2009 at 2:33 PM, S.Selvam Siva s.selvams...@gmail.comwrote: On Thu, Jan 22, 2009 at 7:12 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : what i need is ,to log the existing urlid and new urlid(of course both will : not be same) ,when a .xml file of same id(unique field) is posted. : : I want to make this by modifying the solr source.Which file do i need to : modify so that i could get the above details in log ? : : I tried with DirectUpdateHandler2.java(which removes the duplicate : entries),but efforts in vein. DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's IndexWriter.updateDocument method when you have a uniqueKey and you aren't allowing duplicates -- this method doesn't give you any way to access the old document(s) that had that existing key. The easiest way to make a change like what you are interested in might be an UpdateProcessor that does a lookup/search for the uniqueKey of each document about to be added to see if it already exists. that's probably about as efficient as you can get, and would be nicely encapsulated. You might also want to take a look at SOLR-799, where some work is being done to create UpdateProcessors that can do near duplicate detection... http://wiki.apache.org/solr/Deduplication https://issues.apache.org/jira/browse/SOLR-799 -Hoss Hi, i added some code to *DirectUpdateHandler2.java's doDeletions()* (solr 1.2.0) ,and got the solution i wanted.(logging duplicate post entry-i.e old field and new field of duplicate post) Document d1=searcher.doc(prev);//existing doc to be deleted Document d2=searcher.doc(tdocs.doc());//new doc String oldname=d1.get(name); String id1=d1.get(id); String newname=d2.get(name); String id2=d1.get(id); out3.write(id1+,+oldname+,+newname+\n); But i dont know ,wether the performance of solr will be affected by this. Any comment on the performance issue for the above solution is welcome... -- Yours, S.Selvam
Re: faceting question
is there no other way then to use the patch? since the query A is super set of B ??? if not doable, I will probably use some caching technique. Best. On Sat, Jan 24, 2009 at 9:14 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Sat, Jan 24, 2009 at 6:56 AM, Cam Bazz camb...@gmail.com wrote: Hello; I got a multiField named tagList which may contain multiple tags. I am making a query like: tagList:a AND tagList:b AND tagList:c and I am also getting a tagList facet returning me some values. What I would like is Solr to return me facets as if the query was: tagList:a AND tagList:b is it even possible? If I understand correctly, 1. You want to query for tagList:a AND tagList:b AND tagList:c 2. At the same time, you want to request facets for tagList but only for tagList:a and tagList:b If that is correct, you can use the features introduced by https://issues.apache.org/jira/browse/SOLR-911 However you may need to put #1 as fq instead of q. -- Regards, Shalin Shekhar Mangar.
Re: Results not appearing
I should clarify that I misspoke before; I thought you had indexed=true on Message-Id and Date, whereas you had it on Message-Id and Content. It sounds like you figured this out and interpreted my reply in a useful way nonetheless, though. So that's good. The post tool should be a valid way to commit. As for your technique of updating the field types and reindexing the documents, I think it should be fine provided you kept the field type for the Message-Id field as string. If you changed it to text along with the other field types, then there's a chance your update technique might instead of the effect of inserting a duplicate copy of each document, so there are two copies of each document, one searchable, and one not searchable. (I'm not totally sure about this, but it's a worry I would have.) That doesn't sound like what's happened to you, though. Could the problem be that you're not specifying which field to query? If you're using the standard query analyzer and the stock schema.xml, then the default field name is text, whereas you don't have a field called text in your schema. In that setup if you want to search on the Content field you need to say so explicitly, like so: Content:phillip On Sat, Jan 24, 2009 at 7:25 AM, Johnny X jonathanwel...@gmail.com wrote: If it helps, everything appears when I use Luke to search through the index...but the search in that returns nothing either. When I search using the admin page for the word 'Phillip' (which appears the most in all of the documents) I get the following: ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime0/int - lst name=params str name=indenton/str str name=start0/str str name=qphillip/str str name=rows10/str str name=version2.2/str /lst /lst result name=response numFound=0 start=0 / /response Duh...? Johnny X wrote: They all appear in the stats admin page under the NumDocs maxDocs fields. I don't explicitly send a commit command, but my posting ends like this (suggesting they are commited): SimplePostTool: POSTing file 21166.xml SimplePostTool: POSTing file 21169.xml SimplePostTool: COMMITting Solr index changes.. I just tried re-posting all the documents set as text -- will that update the current documents indexed? (bearing in mind the unique key, message-id, will be included again) When I try searching I still get 0 results for anything included in the message-id and content fields, both of which should be indexed and returning results... Cheers for any help! ryguasu wrote: These might be obvious, but: * I assume you did a Solr commit command after indexing, right? * If you are using the fieldtype definitions from the default schema.xml, then your string fields are not being analyzed, which means you should expect search results only if you enter the entire, exact value of one of the Message-ID or Date fields in your query. Is that your intention? And yes, your analysis of stored seems correct. Stored fields are those whose values you need back at query time, and indexed fields are those you can do queries on. For a few complications, see http://wiki.apache.org/solr/FieldOptionsByUseCase On Fri, Jan 23, 2009 at 8:04 PM, Johnny X jonathanwel...@gmail.com wrote: I've indexed my XML using the below in the schema: field name=Message-ID type=string indexed=true stored=true required=true/ field name=Date type=string indexed=false stored=true/ field name=From type=string indexed=false stored=true/ field name=To type=string indexed=false stored=true/ field name=Subject type=string indexed=false stored=true/ field name=Mime-Version type=string indexed=false stored=true/ field name=Content-Type type=string indexed=false stored=true/ field name=Content-Transfer-Encoding type=string indexed=false stored=true/ field name=X-From type=string indexed=false stored=true/ field name=X-To type=string indexed=false stored=true/ field name=X-cc type=string indexed=false stored=true/ field name=X-bcc type=string indexed=false stored=true/ field name=X-Folder type=string indexed=false stored=true/ field name=X-Origin type=string indexed=false stored=true/ field name=X-FileName type=string indexed=false stored=true/ field name=Content type=string indexed=true stored=true/ uniqueKeyMessage-ID/uniqueKey However searching via the Message-ID or Content fields returns 0. Using Luke I can still see these fields are stored however. Out of interest, by setting the other fields to just stored=true, can they be returned in a query as part of a search? Cheers. -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21641692.html Sent from the Solr - User
Re: Solr stemming - preserve original words
I still don't understand your final goal but if you want to get an output in the form of run(40) = 20 from running, 10 from run, 8 from runners, 2 from runner you need to index your documents using standard analyzer. Walk through the index using org.apache.lucene.index.IndexReader and stem each term using stemmer. Storing stems (key) and orignal word list (value) in a map will give that kind of output. However if seeing something like the following list (not exactly you want but similar) on schema.jsp will help you run=run run=running run=runner run=runners add one line of code newstr = newstr + = + new String(termBuffer, 0, len); to org.apache.solr.analysis.EnglishPorterFilterFactory.java between lines #116 and #117. Rename the file, compile the code, put your jar file to libs directory under your solr home. Now you can use your new FilfterFactory in your schema.xml --- On Sat, 1/24/09, Thushara Wijeratna thu...@gmail.com wrote: From: Thushara Wijeratna thu...@gmail.com Subject: Re: Solr stemming - preserve original words To: solr-user@lucene.apache.org, iori...@yahoo.com Date: Saturday, January 24, 2009, 1:53 AM Chris, Ahmet - thanks for the responses. Ahmet - yes, i want to see run as a top term + the original words that formed that term The reason is that due to mis-stemming, the terms could become non-english. ex: permanent would stem to perm, archive would become archiv. I need to extract a set of keywords from the indexed content - I'd like these to be correct full english words. thanks, thushara
size of solr update document a limitation?
Hello Solr experts, is good practice to post large solr update documents? (e.g. 100kb-2mb). Will solr do the necessary tricks to make the field use a reader instead of strings? thanks in advance paul smime.p7s Description: S/MIME cryptographic signature
Re: Results not appearing
Thanks for the reply. I ended up fixing it by re-installing Tomcat and starting over. Searches now appear to work. Because I'm testing atm however, is it possible to delete the index and start afresh in future. At the moment I backed up the original index folder...if I just replace that with the current one including an index will that work...or will other parts of Solr recognise it's changed and as a result not work? What's the best solution for removing the index? Cheers. ryguasu wrote: I should clarify that I misspoke before; I thought you had indexed=true on Message-Id and Date, whereas you had it on Message-Id and Content. It sounds like you figured this out and interpreted my reply in a useful way nonetheless, though. So that's good. The post tool should be a valid way to commit. As for your technique of updating the field types and reindexing the documents, I think it should be fine provided you kept the field type for the Message-Id field as string. If you changed it to text along with the other field types, then there's a chance your update technique might instead of the effect of inserting a duplicate copy of each document, so there are two copies of each document, one searchable, and one not searchable. (I'm not totally sure about this, but it's a worry I would have.) That doesn't sound like what's happened to you, though. Could the problem be that you're not specifying which field to query? If you're using the standard query analyzer and the stock schema.xml, then the default field name is text, whereas you don't have a field called text in your schema. In that setup if you want to search on the Content field you need to say so explicitly, like so: Content:phillip On Sat, Jan 24, 2009 at 7:25 AM, Johnny X jonathanwel...@gmail.com wrote: If it helps, everything appears when I use Luke to search through the index...but the search in that returns nothing either. When I search using the admin page for the word 'Phillip' (which appears the most in all of the documents) I get the following: ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime0/int - lst name=params str name=indenton/str str name=start0/str str name=qphillip/str str name=rows10/str str name=version2.2/str /lst /lst result name=response numFound=0 start=0 / /response Duh...? Johnny X wrote: They all appear in the stats admin page under the NumDocs maxDocs fields. I don't explicitly send a commit command, but my posting ends like this (suggesting they are commited): SimplePostTool: POSTing file 21166.xml SimplePostTool: POSTing file 21169.xml SimplePostTool: COMMITting Solr index changes.. I just tried re-posting all the documents set as text -- will that update the current documents indexed? (bearing in mind the unique key, message-id, will be included again) When I try searching I still get 0 results for anything included in the message-id and content fields, both of which should be indexed and returning results... Cheers for any help! ryguasu wrote: These might be obvious, but: * I assume you did a Solr commit command after indexing, right? * If you are using the fieldtype definitions from the default schema.xml, then your string fields are not being analyzed, which means you should expect search results only if you enter the entire, exact value of one of the Message-ID or Date fields in your query. Is that your intention? And yes, your analysis of stored seems correct. Stored fields are those whose values you need back at query time, and indexed fields are those you can do queries on. For a few complications, see http://wiki.apache.org/solr/FieldOptionsByUseCase On Fri, Jan 23, 2009 at 8:04 PM, Johnny X jonathanwel...@gmail.com wrote: I've indexed my XML using the below in the schema: field name=Message-ID type=string indexed=true stored=true required=true/ field name=Date type=string indexed=false stored=true/ field name=From type=string indexed=false stored=true/ field name=To type=string indexed=false stored=true/ field name=Subject type=string indexed=false stored=true/ field name=Mime-Version type=string indexed=false stored=true/ field name=Content-Type type=string indexed=false stored=true/ field name=Content-Transfer-Encoding type=string indexed=false stored=true/ field name=X-From type=string indexed=false stored=true/ field name=X-To type=string indexed=false stored=true/ field name=X-cc type=string indexed=false stored=true/ field name=X-bcc type=string indexed=false stored=true/ field name=X-Folder type=string indexed=false stored=true/ field name=X-Origin type=string indexed=false stored=true/ field name=X-FileName type=string indexed=false stored=true/ field name=Content type=string indexed=true stored=true/ uniqueKeyMessage-ID/uniqueKey However searching via the
Re: Results not appearing
Without you stopping Solr itself, a solr client can remove all the documents in an index by doing a delete-by-query with the query *:* (without quotes). For XML interface clients, see http://wiki.apache.org/solr/UpdateXmlMessage. Solrj would have another way to do it. You'll need to do a commit after this to flush your changes. Alternatively, you can stop Solr and delete the whole data/ directory, which includes the index directory. If you do this, Solr will create a new fresh one the next time it starts up. For backups it might be a better habit to backup the data/ directory, rather than just the data/index directory. Assuming your schema.xml hasn't changed, then you should be able to restore one data/ directory with another. If you're changing your schema file, though, you need to make sure you restore a version of that file that is consistent with the one that you indexed with. On Sat, Jan 24, 2009 at 5:43 PM, Johnny X jonathanwel...@gmail.com wrote: Thanks for the reply. I ended up fixing it by re-installing Tomcat and starting over. Searches now appear to work. Because I'm testing atm however, is it possible to delete the index and start afresh in future. At the moment I backed up the original index folder...if I just replace that with the current one including an index will that work...or will other parts of Solr recognise it's changed and as a result not work? What's the best solution for removing the index? Cheers. ryguasu wrote: I should clarify that I misspoke before; I thought you had indexed=true on Message-Id and Date, whereas you had it on Message-Id and Content. It sounds like you figured this out and interpreted my reply in a useful way nonetheless, though. So that's good. The post tool should be a valid way to commit. As for your technique of updating the field types and reindexing the documents, I think it should be fine provided you kept the field type for the Message-Id field as string. If you changed it to text along with the other field types, then there's a chance your update technique might instead of the effect of inserting a duplicate copy of each document, so there are two copies of each document, one searchable, and one not searchable. (I'm not totally sure about this, but it's a worry I would have.) That doesn't sound like what's happened to you, though. Could the problem be that you're not specifying which field to query? If you're using the standard query analyzer and the stock schema.xml, then the default field name is text, whereas you don't have a field called text in your schema. In that setup if you want to search on the Content field you need to say so explicitly, like so: Content:phillip On Sat, Jan 24, 2009 at 7:25 AM, Johnny X jonathanwel...@gmail.com wrote: If it helps, everything appears when I use Luke to search through the index...but the search in that returns nothing either. When I search using the admin page for the word 'Phillip' (which appears the most in all of the documents) I get the following: ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime0/int - lst name=params str name=indenton/str str name=start0/str str name=qphillip/str str name=rows10/str str name=version2.2/str /lst /lst result name=response numFound=0 start=0 / /response Duh...? Johnny X wrote: They all appear in the stats admin page under the NumDocs maxDocs fields. I don't explicitly send a commit command, but my posting ends like this (suggesting they are commited): SimplePostTool: POSTing file 21166.xml SimplePostTool: POSTing file 21169.xml SimplePostTool: COMMITting Solr index changes.. I just tried re-posting all the documents set as text -- will that update the current documents indexed? (bearing in mind the unique key, message-id, will be included again) When I try searching I still get 0 results for anything included in the message-id and content fields, both of which should be indexed and returning results... Cheers for any help! ryguasu wrote: These might be obvious, but: * I assume you did a Solr commit command after indexing, right? * If you are using the fieldtype definitions from the default schema.xml, then your string fields are not being analyzed, which means you should expect search results only if you enter the entire, exact value of one of the Message-ID or Date fields in your query. Is that your intention? And yes, your analysis of stored seems correct. Stored fields are those whose values you need back at query time, and indexed fields are those you can do queries on. For a few complications, see http://wiki.apache.org/solr/FieldOptionsByUseCase On Fri, Jan 23, 2009 at 8:04 PM, Johnny X jonathanwel...@gmail.com wrote: I've indexed my XML using the below in the schema: field name=Message-ID type=string indexed=true stored=true required=true/ field name=Date
Re: How to make Relationships work for Multi-valued Index Fields?
I make this approach work with XPATH and XSL. However, this approach creates multiple fields of like this address_state_1 address_state_2 ... address_state_10 and credit_card_1 credit_card_2 credit_card_3 How do I search for a credit_card.The query syntax does not seem to support wild cards in field names. For e.g. I cant seem to do this - credit_card*:1234 4567 7890 1234 On the search side I would not know how many credit card fields got created for a document and so I need that to be dynamic. -g On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote: Oops, one more gotcha. The dynamic field support is only in 1.4 trunk. On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju chandrar...@apple.com wrote: record coreInfo id=123 , .../ address street=XYZ1 State=CA ...type=home / address street=XYZ2 state=CA ... type=Office/ address street=XYZ3 state=CA type=Other/ /record I have setup my DIH to treat these as entities as below dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name =f processor=FileListEntityProcessor baseDir=*** fileName=.*xml rootEntity=false dataSource=null entity name=record processor=XPathEntityProcessor stream=false forEach=/record url=${f.fileAbsolutePath} field column=ID xpath=/record/@id / !-- Address -- entity name=record_adr processor=XPathEntityProcessor stream=false forEach=/record/address url=${f.fileAbsolutePath} field column=address_street xpath=/record/address/@street / field column=address_state xpath=/record/address//@state / field column=address_type xpath=/record/address//@type / /entity /entity /entity /document /dataConfig I think the only way is to create a dynamic field for each attribute (street, state etc.). Write a transformer to copy the fields from your data config to appropriately named dynamic field (e.g. street_1, state_1, etc). To maintain this counter you will need to get/store it with Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and Context#setSessionAttribute(name, val, Context.SCOPE_DOC). I cant't think of an easier way. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: How to make Relationships work for Multi-valued Index Fields?
for searching you need to put them in a single field . use copyField in schema.xml to achieve that On Sun, Jan 25, 2009 at 7:39 AM, Gunaranjan Chandraraju chandrar...@apple.com wrote: I make this approach work with XPATH and XSL. However, this approach creates multiple fields of like this address_state_1 address_state_2 ... address_state_10 and credit_card_1 credit_card_2 credit_card_3 How do I search for a credit_card.The query syntax does not seem to support wild cards in field names. For e.g. I cant seem to do this - credit_card*:1234 4567 7890 1234 On the search side I would not know how many credit card fields got created for a document and so I need that to be dynamic. -g On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote: Oops, one more gotcha. The dynamic field support is only in 1.4 trunk. On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju chandrar...@apple.com wrote: record coreInfo id=123 , .../ address street=XYZ1 State=CA ...type=home / address street=XYZ2 state=CA ... type=Office/ address street=XYZ3 state=CA type=Other/ /record I have setup my DIH to treat these as entities as below dataConfig dataSource type=FileDataSource encoding=UTF-8 / document entity name =f processor=FileListEntityProcessor baseDir=*** fileName=.*xml rootEntity=false dataSource=null entity name=record processor=XPathEntityProcessor stream=false forEach=/record url=${f.fileAbsolutePath} field column=ID xpath=/record/@id / !-- Address -- entity name=record_adr processor=XPathEntityProcessor stream=false forEach=/record/address url=${f.fileAbsolutePath} field column=address_street xpath=/record/address/@street / field column=address_state xpath=/record/address//@state / field column=address_type xpath=/record/address//@type / /entity /entity /entity /document /dataConfig I think the only way is to create a dynamic field for each attribute (street, state etc.). Write a transformer to copy the fields from your data config to appropriately named dynamic field (e.g. street_1, state_1, etc). To maintain this counter you will need to get/store it with Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and Context#setSessionAttribute(name, val, Context.SCOPE_DOC). I cant't think of an easier way. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar. -- --Noble Paul