Re: What is the right way to bring a failed SolrCloud node back online?
We are working on a new mode (which should become the default) where ZooKeeper will be treated as the truth for a cluster. This mode will be able to handle situations like this - if the cluster state says a core should exist on a node and it doesn’t, it will be created on startup. The way things work currently is this kind of hybrid situation where the truth is partly in ZooKeeper partly on each node. This is not ideal at all. I think this new mode is very important, and it will be coming shortly. Until then, I’d recommend writing this logic externally as you suggest (I’ve seen it done before). - Mark http://about.me/markrmiller On Jan 24, 2014, at 12:01 PM, Nathan Neulinger nn...@neulinger.org wrote: I have an environment where new collections are being added frequently (isolated per customer), and the backup is virtually guaranteed to be missing some of them. As it stands, bringing up the restored/out-of-date instance results in thos collections being stuck in 'Recovering' state, because the cores don't exist on the resulting server. This can also be extended to the case of restoring a completely blank instance. Is there any way to tell SolrCloud Try recreating any missing cores for this collection based on where you know they should be located. Or do I need to actually determine a list of cores (..._shardX_replicaY) and trigger the core creates myself, at which point I gather that it will start recovery for each of them? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Re: Solr server requirements for 100+ million documents
Dumping the raw data would probably be a good idea. I guarantee you'll be re-indexing the data several times as you change the schema to accommodate different requirements... But it may also be worth spending some time figuring out why the DB access is slow. Sometimes one can tune that. If you go the SolrJ route, you also have the possibility of setting up N clients to work simultaneously, sometimes that'll help. FWIW, Erick On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi Kranti, Attach are the solrconfig schema xml for review. I did run indexing with just few fields (5-6 fields) in schema.xml keeping the same db config but Indexing almost still taking similar time (average 1 million records 1 hr) which confirms that the bottleneck is in the data acquisition which in our case is oracle database. I am thinking to not use dataimporthandler / jdbc to get data from Oracle but to rather dump data somehow from oracle using SQL loader and then index it. Any thoughts? Thnx -Original Message- From: Kranti Parisa [mailto:kranti.par...@gmail.com] Sent: Saturday, January 25, 2014 12:08 AM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents can you post the complete solrconfig.xml file and schema.xml files to review all of your settings that would impact your indexing performance. Thanks, Kranti K. Parisa http://www.linkedin.com/in/krantiparisa On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Thanks, Svante. Your indexing speed using db seems to really fast. Can you please provide some more detail on how you are indexing db records. Is it thru DataImportHandler? And what database? Is that local db? We are indexing around 70 fields (60 multivalued) but data is not populated always in all fields. The average size of document is in 5-10 kbs. -Original Message- From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of svante karlsson Sent: Friday, January 24, 2014 5:05 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 9524 sec using libcurl. 11 million took 763 seconds so the speed drops somewhat with increasing dbsize. We write 1000 docs (just an arbitrary number) in each request from two threads. If you will be using solrcloud you will want more writer threads. The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine. /svante 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net Thanks, Erick for the info. For indexing I agree the more time is consumed in data acquisition which in our case from Database. For indexing currently we are using the manual process i.e. Solr dashboard Data Import but now looking to automate. How do you suggest to automate the index part. Do you recommend to use SolrJ or should we try to automate using Curl? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, January 24, 2014 2:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Can't be done with the information you provided, and can only be guessed at even with more comprehensive information. Here's why: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why- we -dont-have-a-definitive-answer/ Also, at a guess, your indexing speed is so slow due to data acquisition; I rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, try commenting out the server.add() bit and running again. My guess is that your indexing speed will be almost unchanged, in which case it's the data acquisition process is where you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes without any attempts at parallelization. Best, Erick On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi, Currently we are indexing 10 million document from database (10 db data entities) index size is around 8 GB on windows virtual box. Indexing in one shot taking 12+ hours while indexing parallel in separate cores merging them together taking 4+ hours. We are looking to scale to 100+ million documents and looking for recommendation on servers requirements on below parameters for a Production environment. There can be 200+ users performing search same time. No of physical servers (considering solr cloud) Memory requirement Processor requirement (# cores) Linux as OS oppose to windows Thanks in advance. Susheel
Re: What is the right way to bring a failed SolrCloud node back online?
Thanks, yeah, I did just that - and sent the script in on SOLR-5665 if anyone wants a copy. Script is trivial, but you're welcome to stick it (trivial) in contrib or something if it's at all useful to anyone. -- Nathan On 01/26/2014 08:28 AM, Mark Miller wrote: We are working on a new mode (which should become the default) where ZooKeeper will be treated as the truth for a cluster. This mode will be able to handle situations like this - if the cluster state says a core should exist on a node and it doesn’t, it will be created on startup. The way things work currently is this kind of hybrid situation where the truth is partly in ZooKeeper partly on each node. This is not ideal at all. I think this new mode is very important, and it will be coming shortly. Until then, I’d recommend writing this logic externally as you suggest (I’ve seen it done before). - Mark http://about.me/markrmiller On Jan 24, 2014, at 12:01 PM, Nathan Neulinger nn...@neulinger.org wrote: I have an environment where new collections are being added frequently (isolated per customer), and the backup is virtually guaranteed to be missing some of them. As it stands, bringing up the restored/out-of-date instance results in thos collections being stuck in 'Recovering' state, because the cores don't exist on the resulting server. This can also be extended to the case of restoring a completely blank instance. Is there any way to tell SolrCloud Try recreating any missing cores for this collection based on where you know they should be located. Or do I need to actually determine a list of cores (..._shardX_replicaY) and trigger the core creates myself, at which point I gather that it will start recovery for each of them? -- Nathan Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412 -- Nathan Neulinger nn...@neulinger.org Neulinger Consulting (573) 612-1412
Fwd: Search Engine Framework decision
Hi, I want to creating a POC to search INTRANET along with documents uploaded on intranet. Documents(PDF, excel, word document, text files, images, videos) are also exists on SHAREPOINT. sharepoint has Authentication access at module level(folder level). My interanet website is http://myintranet/ http://sparsh/ . and Sharepoint url is different. Documents also exist in file folders. I have below queries: A) Which crawler framework do I use along with Solr for this POC, Nutch or Apache ManifoldCF? B) Is it possible to crawl Sharepoint documents usiing Nutch? If yes, only configuration level change would make this possible? or I have to write code to parse and send to solr? C) Which version of Solr+nutch+MCF should be used? because nutch version has dependency on solr version. wold nutch 1.7 works properly with solr 4.6.0? -- Rashmi Be the change that you want to see in this world! -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
RE: Solr server requirements for 100+ million documents
Thank you Erick for your valuable inputs. Yes, we have to re-index data again again. I'll look into possibility of tuning db access. On SolrJ and automating the indexing (incremental as well as one time) I want to get your opinion on below two points. We will be indexing separate sets of tables with similar data structure - Should we use SolrJ and write Java programs that can be scheduled to trigger indexing on demand/schedule based. - Is using SolrJ a better idea even for searching than using SolrNet? As our frontend is in .Net so we started using SolrNet but I am afraid down the road when we scale/support SolrClod using SolrJ is better? Thanks Susheel -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, January 26, 2014 8:37 AM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Dumping the raw data would probably be a good idea. I guarantee you'll be re-indexing the data several times as you change the schema to accommodate different requirements... But it may also be worth spending some time figuring out why the DB access is slow. Sometimes one can tune that. If you go the SolrJ route, you also have the possibility of setting up N clients to work simultaneously, sometimes that'll help. FWIW, Erick On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi Kranti, Attach are the solrconfig schema xml for review. I did run indexing with just few fields (5-6 fields) in schema.xml keeping the same db config but Indexing almost still taking similar time (average 1 million records 1 hr) which confirms that the bottleneck is in the data acquisition which in our case is oracle database. I am thinking to not use dataimporthandler / jdbc to get data from Oracle but to rather dump data somehow from oracle using SQL loader and then index it. Any thoughts? Thnx -Original Message- From: Kranti Parisa [mailto:kranti.par...@gmail.com] Sent: Saturday, January 25, 2014 12:08 AM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents can you post the complete solrconfig.xml file and schema.xml files to review all of your settings that would impact your indexing performance. Thanks, Kranti K. Parisa http://www.linkedin.com/in/krantiparisa On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Thanks, Svante. Your indexing speed using db seems to really fast. Can you please provide some more detail on how you are indexing db records. Is it thru DataImportHandler? And what database? Is that local db? We are indexing around 70 fields (60 multivalued) but data is not populated always in all fields. The average size of document is in 5-10 kbs. -Original Message- From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of svante karlsson Sent: Friday, January 24, 2014 5:05 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 9524 sec using libcurl. 11 million took 763 seconds so the speed drops somewhat with increasing dbsize. We write 1000 docs (just an arbitrary number) in each request from two threads. If you will be using solrcloud you will want more writer threads. The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine. /svante 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net Thanks, Erick for the info. For indexing I agree the more time is consumed in data acquisition which in our case from Database. For indexing currently we are using the manual process i.e. Solr dashboard Data Import but now looking to automate. How do you suggest to automate the index part. Do you recommend to use SolrJ or should we try to automate using Curl? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, January 24, 2014 2:59 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Can't be done with the information you provided, and can only be guessed at even with more comprehensive information. Here's why: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why - we -dont-have-a-definitive-answer/ Also, at a guess, your indexing speed is so slow due to data acquisition; I rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, try commenting out the server.add() bit and running again. My guess is that your indexing speed will be almost unchanged, in which case it's the data acquisition process is where you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs on my
Re: Fwd: Search Engine Framework decision
Rashmi, As far as I know Nutch is a web crawler. I don't think it can crawl documents from Microsoft Share Point. ManifoldCF is a better fit in your case. Regarding versioning if you don't have previous setups, then use latest versions of each. Ahmet On Sunday, January 26, 2014 5:24 PM, rashmi maheshwari maheshwari.ras...@gmail.com wrote: Hi, I want to creating a POC to search INTRANET along with documents uploaded on intranet. Documents(PDF, excel, word document, text files, images, videos) are also exists on SHAREPOINT. sharepoint has Authentication access at module level(folder level). My interanet website is http://myintranet/ http://sparsh/ . and Sharepoint url is different. Documents also exist in file folders. I have below queries: A) Which crawler framework do I use along with Solr for this POC, Nutch or Apache ManifoldCF? B) Is it possible to crawl Sharepoint documents usiing Nutch? If yes, only configuration level change would make this possible? or I have to write code to parse and send to solr? C) Which version of Solr+nutch+MCF should be used? because nutch version has dependency on solr version. wold nutch 1.7 works properly with solr 4.6.0? -- Rashmi Be the change that you want to see in this world! -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Re: Solr server requirements for 100+ million documents
1 That's what I'd do. For incremental updates you might have to create a trigger on the main table and insert rows into another table that is then used to do the incremental updates. This is particularly relevant for deletes. Consider the case where you've ingested all your data then rows are deleted. Removing those same documents from Solr requires either a re-indexing everything or b getting all the docs in Solr and comparing them with the rows in the DB etc. This is expensive. c recording the changes as above and just processing deletes from the change table. 2 SolrJ is usually the most current. I don't know how much work SolrNet gets. However, under the covers it's just HTTP calls so since you have access in either to just adding HTTP parameters, you should be able to get the full functionality out of either. I _think_ that I'd go with whatever you're most comfortable with. Best, Erick On Sun, Jan 26, 2014 at 9:54 AM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Thank you Erick for your valuable inputs. Yes, we have to re-index data again again. I'll look into possibility of tuning db access. On SolrJ and automating the indexing (incremental as well as one time) I want to get your opinion on below two points. We will be indexing separate sets of tables with similar data structure - Should we use SolrJ and write Java programs that can be scheduled to trigger indexing on demand/schedule based. - Is using SolrJ a better idea even for searching than using SolrNet? As our frontend is in .Net so we started using SolrNet but I am afraid down the road when we scale/support SolrClod using SolrJ is better? Thanks Susheel -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, January 26, 2014 8:37 AM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Dumping the raw data would probably be a good idea. I guarantee you'll be re-indexing the data several times as you change the schema to accommodate different requirements... But it may also be worth spending some time figuring out why the DB access is slow. Sometimes one can tune that. If you go the SolrJ route, you also have the possibility of setting up N clients to work simultaneously, sometimes that'll help. FWIW, Erick On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi Kranti, Attach are the solrconfig schema xml for review. I did run indexing with just few fields (5-6 fields) in schema.xml keeping the same db config but Indexing almost still taking similar time (average 1 million records 1 hr) which confirms that the bottleneck is in the data acquisition which in our case is oracle database. I am thinking to not use dataimporthandler / jdbc to get data from Oracle but to rather dump data somehow from oracle using SQL loader and then index it. Any thoughts? Thnx -Original Message- From: Kranti Parisa [mailto:kranti.par...@gmail.com] Sent: Saturday, January 25, 2014 12:08 AM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents can you post the complete solrconfig.xml file and schema.xml files to review all of your settings that would impact your indexing performance. Thanks, Kranti K. Parisa http://www.linkedin.com/in/krantiparisa On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Thanks, Svante. Your indexing speed using db seems to really fast. Can you please provide some more detail on how you are indexing db records. Is it thru DataImportHandler? And what database? Is that local db? We are indexing around 70 fields (60 multivalued) but data is not populated always in all fields. The average size of document is in 5-10 kbs. -Original Message- From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of svante karlsson Sent: Friday, January 24, 2014 5:05 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 9524 sec using libcurl. 11 million took 763 seconds so the speed drops somewhat with increasing dbsize. We write 1000 docs (just an arbitrary number) in each request from two threads. If you will be using solrcloud you will want more writer threads. The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine. /svante 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net Thanks, Erick for the info. For indexing I agree the more time is consumed in data acquisition which in our case from Database. For indexing currently we are using the manual process i.e. Solr dashboard Data Import but now looking to automate. How do you suggest to automate the index part.
Re: Solr server requirements for 100+ million documents
Erick's probably too modest to say so ;=) , but he wrote a great blog entry on indexing with SolrJ - http://searchhub.org/2012/02/14/indexing-with-solrj/ . I took the guts of the code in that blog and easily customized it to write a very fast indexer (content from MySQL, I excised all the Tika code as I am not using it). You should replace StreamingUpdateSolrServer by ConcurrentUpdateSolrServer and experiment to find the optimal number of threads to configure. -Simon On Sun, Jan 26, 2014 at 11:28 AM, Erick Erickson erickerick...@gmail.comwrote: 1 That's what I'd do. For incremental updates you might have to create a trigger on the main table and insert rows into another table that is then used to do the incremental updates. This is particularly relevant for deletes. Consider the case where you've ingested all your data then rows are deleted. Removing those same documents from Solr requires either a re-indexing everything or b getting all the docs in Solr and comparing them with the rows in the DB etc. This is expensive. c recording the changes as above and just processing deletes from the change table. 2 SolrJ is usually the most current. I don't know how much work SolrNet gets. However, under the covers it's just HTTP calls so since you have access in either to just adding HTTP parameters, you should be able to get the full functionality out of either. I _think_ that I'd go with whatever you're most comfortable with. Best, Erick On Sun, Jan 26, 2014 at 9:54 AM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Thank you Erick for your valuable inputs. Yes, we have to re-index data again again. I'll look into possibility of tuning db access. On SolrJ and automating the indexing (incremental as well as one time) I want to get your opinion on below two points. We will be indexing separate sets of tables with similar data structure - Should we use SolrJ and write Java programs that can be scheduled to trigger indexing on demand/schedule based. - Is using SolrJ a better idea even for searching than using SolrNet? As our frontend is in .Net so we started using SolrNet but I am afraid down the road when we scale/support SolrClod using SolrJ is better? Thanks Susheel -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, January 26, 2014 8:37 AM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents Dumping the raw data would probably be a good idea. I guarantee you'll be re-indexing the data several times as you change the schema to accommodate different requirements... But it may also be worth spending some time figuring out why the DB access is slow. Sometimes one can tune that. If you go the SolrJ route, you also have the possibility of setting up N clients to work simultaneously, sometimes that'll help. FWIW, Erick On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Hi Kranti, Attach are the solrconfig schema xml for review. I did run indexing with just few fields (5-6 fields) in schema.xml keeping the same db config but Indexing almost still taking similar time (average 1 million records 1 hr) which confirms that the bottleneck is in the data acquisition which in our case is oracle database. I am thinking to not use dataimporthandler / jdbc to get data from Oracle but to rather dump data somehow from oracle using SQL loader and then index it. Any thoughts? Thnx -Original Message- From: Kranti Parisa [mailto:kranti.par...@gmail.com] Sent: Saturday, January 25, 2014 12:08 AM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents can you post the complete solrconfig.xml file and schema.xml files to review all of your settings that would impact your indexing performance. Thanks, Kranti K. Parisa http://www.linkedin.com/in/krantiparisa On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar susheel.ku...@thedigitalgroup.net wrote: Thanks, Svante. Your indexing speed using db seems to really fast. Can you please provide some more detail on how you are indexing db records. Is it thru DataImportHandler? And what database? Is that local db? We are indexing around 70 fields (60 multivalued) but data is not populated always in all fields. The average size of document is in 5-10 kbs. -Original Message- From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of svante karlsson Sent: Friday, January 24, 2014 5:05 PM To: solr-user@lucene.apache.org Subject: Re: Solr server requirements for 100+ million documents I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 9524 sec using libcurl. 11 million took 763 seconds so the speed drops somewhat with increasing dbsize. We write 1000 docs (just an arbitrary number) in each
Re: How to handle multiple sub second updates to same SOLR Document
yutz Envoyé de mon iPhoneippj Le 26 janv. 2014 à 06:13, Shalin Shekhar Mangar shalinman...@gmail.com a écrit : There is no timestamp versioning as such in Solr but there is a new document based versioning which will allow you to specify your own (externally assigned) versions. See the Document Centric Versioning Constraints section at https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents Sub-second soft auto commit can be expensive but it is hard to say if it will be too expensive for your use-case. You must benchmark it yourself. On Sat, Jan 25, 2014 at 11:51 PM, christopher palm cpa...@gmail.com wrote: I have a scenario where the same SOLR document is being updated several times within a few ms of each other due to how the source system is sending in field updates on the document. The problem I am trying to solve is that the order of these updates isn’t guaranteed once the multi threaded SOLRJ client starts sending them to SOLR, and older updates are overlaying the newer updates on the same document. I would like to use a timestamp versioning so that the older document change won’t be sent into SOLR, but I didn’t see any automated way of doing this based on the document timestamp. Is there a good way to handle this scenario in SOLR 4.6? It seems that we would have to be soft auto committing with a subsecond level as well, is that even possible? Thanks, Chris -- Regards, Shalin Shekhar Mangar.
Tie breakers when sorting equal items
I promised to ask this on the forum just to confirm what I assume is true. Suppose you're returning results using a sort order based on some field (so, not relevancy). For example, suppose it's a date field which indicates when the document was loaded into the solr index. Suppose two items have exactly the same date/time in the field. Would solr return the two items in the order in which they were inserted. I would assume that the answer is not necessarily. I know that you can have secondary sort fields if something exists that would provide the desired functionality. I know that I could set up some kind of numbering scheme that would provide the same result (the customer doesn't want to pay for that). So, I'm really just asking if Solr has any guarantees that when you sort on a field and two items have the same value, they will be sorted in the order they were inserted into the index. Again, I assume the answer is no, but I said I would ask.
Re: How to run a subsequent update query to documents indexed from a dataimport query
Hi all, Any ideas on how to run a reindex update process for all the imported documents from a /dataimport query? Appreciate your help. Thanks, Dileepa On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I did some research on this and found some alternatives useful to my usecase. Please give your ideas. Can I update all documents indexed after a /dataimport query using the last_indexed_time in dataimport.properties? If so can anyone please give me some pointers? What I currently have in mind is something like below; 1. Store the indexing timestamp of the document as a field eg: field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ 2. Read the last_index_time from the dataimport.properties 3. Query all document id's indexed after the last_index_time and send them through the Stanbol update processor. But I have a question here; Does the last_index_time refer to when the dataimport is started(onImportStart) or when the dataimport is finished (onImportEnd)? If it's onImportEnd timestamp, them this solution won't work because the timestamp indexed in the document field will be : onImportStart doc-index-timestamp onImportEnd. Another alternative I can think of is trigger an update chain via a EventListener configured to run after a dataimport is processed (onImportEnd). In this case can the context in DIH give the list of document ids processed in the /dataimport request? If so I can send those doc ids with an /update query to run the Stanbol update process. Please give me your ideas and suggestions. Thanks, Dileepa On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I have a Solr requirement to send all the documents imported from a /dataimport query to go through another update chain as a separate background process. Currently I have configured my custom update chain in the /dataimport handler itself. But since my custom update process need to connect to an external enhancement engine (Apache Stanbol) to enhance the documents with some NLP fields, it has a negative impact on /dataimport process. The solution will be to have a separate update process running to enhance the content of the documents imported from /dataimport. Currently I have configured my custom Stanbol Processor as below in my /dataimport handler. requestHandler name=/dataimport class=solr.DataImportHandler lst name=defaults str name=configdata-config.xml/str str name=update.chainstanbolInterceptor/str /lst /requestHandler updateRequestProcessorChain name=stanbolInterceptor processor class=com.solr.stanbol.processor.StanbolContentProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain What I need now is to separate the 2 processes of dataimport and stanbol-enhancement. So this is like runing a separate re-indexing process periodically over the documents imported from /dataimport for Stanbol fields. The question is how to trigger my Stanbol update process to the documents imported from /dataimport? In Solr to trigger /update query we need to know the id and the fields of the document to be updated. In my case I need to run all the documents imported from the previous /dataimport process through a stanbol update.chain. Is there a way to keep track of the documents ids imported from /dataimport? Any advice or pointers will be really helpful. Thanks, Dileepa
What is the last_index_time in dataimport.properties?
Hi All, Can I please know what timestamp in the dataimport process is reordered as the last_index_time in dataimport.properties? Is it the time that the last dataimport process started ? OR Is it the time that the last dataimport process finished? Thanks, Dileepa
Re: What is the last_index_time in dataimport.properties?
Hi Dileepa, It is the time that the last dataimport process started. So it is safe to use it when considering updated documenets during the import. Ahmet On Sunday, January 26, 2014 9:10 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know what timestamp in the dataimport process is reordered as the last_index_time in dataimport.properties? Is it the time that the last dataimport process started ? OR Is it the time that the last dataimport process finished? Thanks, Dileepa
Re: What is the last_index_time in dataimport.properties?
Hi Ahmet, Thanks a lot. It means I can use the last_index_time to query documents indexed during the last dataimport request? I need to run a subsequent update process to all documents imported from a dataimport. Thanks, Dileepa On Mon, Jan 27, 2014 at 1:33 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Dileepa, It is the time that the last dataimport process started. So it is safe to use it when considering updated documenets during the import. Ahmet On Sunday, January 26, 2014 9:10 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know what timestamp in the dataimport process is reordered as the last_index_time in dataimport.properties? Is it the time that the last dataimport process started ? OR Is it the time that the last dataimport process finished? Thanks, Dileepa
Re: What is the last_index_time in dataimport.properties?
Hi, last_index_time traditionally is used to query Database. But it seems that you want to query solr, right? On Sunday, January 26, 2014 11:15 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi Ahmet, Thanks a lot. It means I can use the last_index_time to query documents indexed during the last dataimport request? I need to run a subsequent update process to all documents imported from a dataimport. Thanks, Dileepa On Mon, Jan 27, 2014 at 1:33 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Dileepa, It is the time that the last dataimport process started. So it is safe to use it when considering updated documenets during the import. Ahmet On Sunday, January 26, 2014 9:10 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know what timestamp in the dataimport process is reordered as the last_index_time in dataimport.properties? Is it the time that the last dataimport process started ? OR Is it the time that the last dataimport process finished? Thanks, Dileepa
Re: How to run a subsequent update query to documents indexed from a dataimport query
Hi, Here is what I understand from your Question. You have a custom update processor that runs with DIH. But it is slow. You want to run that text enhancement component after DIH. How would this help to speed up things? In this approach you will read/query/search already indexed and committed solr documents and run text enhancement thing on them. Probably this process will add new additional fields. And then you will update these solr documents? Did I understand your use case correctly? On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi all, Any ideas on how to run a reindex update process for all the imported documents from a /dataimport query? Appreciate your help. Thanks, Dileepa On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I did some research on this and found some alternatives useful to my usecase. Please give your ideas. Can I update all documents indexed after a /dataimport query using the last_indexed_time in dataimport.properties? If so can anyone please give me some pointers? What I currently have in mind is something like below; 1. Store the indexing timestamp of the document as a field eg: field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ 2. Read the last_index_time from the dataimport.properties 3. Query all document id's indexed after the last_index_time and send them through the Stanbol update processor. But I have a question here; Does the last_index_time refer to when the dataimport is started(onImportStart) or when the dataimport is finished (onImportEnd)? If it's onImportEnd timestamp, them this solution won't work because the timestamp indexed in the document field will be : onImportStart doc-index-timestamp onImportEnd. Another alternative I can think of is trigger an update chain via a EventListener configured to run after a dataimport is processed (onImportEnd). In this case can the context in DIH give the list of document ids processed in the /dataimport request? If so I can send those doc ids with an /update query to run the Stanbol update process. Please give me your ideas and suggestions. Thanks, Dileepa On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I have a Solr requirement to send all the documents imported from a /dataimport query to go through another update chain as a separate background process. Currently I have configured my custom update chain in the /dataimport handler itself. But since my custom update process need to connect to an external enhancement engine (Apache Stanbol) to enhance the documents with some NLP fields, it has a negative impact on /dataimport process. The solution will be to have a separate update process running to enhance the content of the documents imported from /dataimport. Currently I have configured my custom Stanbol Processor as below in my /dataimport handler. requestHandler name=/dataimport class=solr.DataImportHandler lst name=defaults str name=configdata-config.xml/str str name=update.chainstanbolInterceptor/str /lst /requestHandler updateRequestProcessorChain name=stanbolInterceptor processor class=com.solr.stanbol.processor.StanbolContentProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain What I need now is to separate the 2 processes of dataimport and stanbol-enhancement. So this is like runing a separate re-indexing process periodically over the documents imported from /dataimport for Stanbol fields. The question is how to trigger my Stanbol update process to the documents imported from /dataimport? In Solr to trigger /update query we need to know the id and the fields of the document to be updated. In my case I need to run all the documents imported from the previous /dataimport process through a stanbol update.chain. Is there a way to keep track of the documents ids imported from /dataimport? Any advice or pointers will be really helpful. Thanks, Dileepa
Re: How to run a subsequent update query to documents indexed from a dataimport query
Hi Ahmet, On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, Here is what I understand from your Question. You have a custom update processor that runs with DIH. But it is slow. You want to run that text enhancement component after DIH. How would this help to speed up things? In this approach you will read/query/search already indexed and committed solr documents and run text enhancement thing on them. Probably this process will add new additional fields. And then you will update these solr documents? Did I understand your use case correctly? Yes, that is exactly what I want to achieve. I want to separate out the enhancement process from the dataimport process. The dataimport process will be invoked by a client when new data is added/updated to the mysql database. Therefore the dataimport process with mandatory fields of the documents should be indexed as soon as possible. Mandatory fields are mapped to the data table columns in the data-config.xml and the normal /dataimport process doesn't take much time. The enhancements are done in my custom processor by sending the content field of the document to an external Stanbol[1] server to detect NLP enhancements. Then new NLP fields are added to the document (detected persons, organizations, places in the content) in the custom update processor and if this is executed during the dataimport process, it takes a lot of time. The NLP fields are not mandatory for the primary usage of the application which is to query documents with mandatory fields. The NLP fields are required for custom queries for Person, Organization entities. Therefore the NLP update process should be run as a background process detached from the primary /dataimport process. It should not slow down the existing /dataimport process. That's why I am looking for the best way to achieve my objective. I want to implement a way to separately update the imported documents from /dataimport to detect NLP enhancements. Currently I'm having the idea of adopting a timestamp based approach to trigger a /update query to all documents imported after the last_index_time in dataimport.prop and update them with NLP fields. Hope my requirement is clear :). Appreciate your suggestions. [1] http://stanbol.apache.org/ On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi all, Any ideas on how to run a reindex update process for all the imported documents from a /dataimport query? Appreciate your help. Thanks, Dileepa On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I did some research on this and found some alternatives useful to my usecase. Please give your ideas. Can I update all documents indexed after a /dataimport query using the last_indexed_time in dataimport.properties? If so can anyone please give me some pointers? What I currently have in mind is something like below; 1. Store the indexing timestamp of the document as a field eg: field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ 2. Read the last_index_time from the dataimport.properties 3. Query all document id's indexed after the last_index_time and send them through the Stanbol update processor. But I have a question here; Does the last_index_time refer to when the dataimport is started(onImportStart) or when the dataimport is finished (onImportEnd)? If it's onImportEnd timestamp, them this solution won't work because the timestamp indexed in the document field will be : onImportStart doc-index-timestamp onImportEnd. Another alternative I can think of is trigger an update chain via a EventListener configured to run after a dataimport is processed (onImportEnd). In this case can the context in DIH give the list of document ids processed in the /dataimport request? If so I can send those doc ids with an /update query to run the Stanbol update process. Please give me your ideas and suggestions. Thanks, Dileepa On Wed, Jan 22, 2014 at 6:14 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I have a Solr requirement to send all the documents imported from a /dataimport query to go through another update chain as a separate background process. Currently I have configured my custom update chain in the /dataimport handler itself. But since my custom update process need to connect to an external enhancement engine (Apache Stanbol) to enhance the documents with some NLP fields, it has a negative impact on /dataimport process. The solution will be to have a separate update process running to enhance the content of the documents imported from /dataimport. Currently I have configured my custom Stanbol Processor as below in my /dataimport handler. requestHandler name=/dataimport class=solr.DataImportHandler lst name=defaults str
Re: What is the last_index_time in dataimport.properties?
Yes Ahmet. I want to use the last_index_time to find the documents imported in the last /dataimport process and send them through a update process. I have explained this requirement in my other thread. Thanks, Dileepa On Mon, Jan 27, 2014 at 3:23 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, last_index_time traditionally is used to query Database. But it seems that you want to query solr, right? On Sunday, January 26, 2014 11:15 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi Ahmet, Thanks a lot. It means I can use the last_index_time to query documents indexed during the last dataimport request? I need to run a subsequent update process to all documents imported from a dataimport. Thanks, Dileepa On Mon, Jan 27, 2014 at 1:33 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Dileepa, It is the time that the last dataimport process started. So it is safe to use it when considering updated documenets during the import. Ahmet On Sunday, January 26, 2014 9:10 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, Can I please know what timestamp in the dataimport process is reordered as the last_index_time in dataimport.properties? Is it the time that the last dataimport process started ? OR Is it the time that the last dataimport process finished? Thanks, Dileepa
Complication - can block joins help?
OK, In order to do boosting, we often will create a dynamic field in SOLR. For example: A Professional hire out for work, I want to boost those who do woodworking. George Smith - builds chairs, and builds desks. He builds the most desks in the country (350 a year). And his closest competitor does 200 a year. id (integer) = 1 name (string) =George Smith work multiValued field = chairs, desks num_desk (dynamic field num*) = 500 Then I would do something like: q=num_desk^5.0 Is there a way to do this without a dynamic field? I thought about a field: desk|500 (use bar delimiter). But couldn't see how to have the value indexed to easily to a boost for those who do the most. If you think of all the type of work, this could be 50,000 dynamic fields. Probably a performance hog. Dr. Smith Angioplasty Performs 70 of these a year -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: How to run a subsequent update query to documents indexed from a dataimport query
Hi Dileepa, If I understand correctly this is what happens in your system correctly : 1. DIH Sends data to Solr 2. You have written a custom update processor ( http://wiki.apache.org/solr/UpdateRequestProcessor) which the asks your Stanbol server for meta data, adds it to the document and then indexes it. Its the part where you query the Stanbol server and wait for the response which takes time and you want to reduce this. According to me instead of waiting for your response from the Stanbol server and then indexing it, You could send the required field data from the doc to your Stanbol server and continue. Once Stanbol as enriched the document, you re-index the document and update it with the meta-data. This method makes you re-index the document but the changes from your client would be visible faster. Alternately you could do the same thing at the DIH level by writing a customer Transformer ( http://wiki.apache.org/solr/DataImportHandler#Writing_Custom_Transformers) On Mon, Jan 27, 2014 at 10:44 AM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi Ahmet, On Mon, Jan 27, 2014 at 3:26 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi, Here is what I understand from your Question. You have a custom update processor that runs with DIH. But it is slow. You want to run that text enhancement component after DIH. How would this help to speed up things? In this approach you will read/query/search already indexed and committed solr documents and run text enhancement thing on them. Probably this process will add new additional fields. And then you will update these solr documents? Did I understand your use case correctly? Yes, that is exactly what I want to achieve. I want to separate out the enhancement process from the dataimport process. The dataimport process will be invoked by a client when new data is added/updated to the mysql database. Therefore the dataimport process with mandatory fields of the documents should be indexed as soon as possible. Mandatory fields are mapped to the data table columns in the data-config.xml and the normal /dataimport process doesn't take much time. The enhancements are done in my custom processor by sending the content field of the document to an external Stanbol[1] server to detect NLP enhancements. Then new NLP fields are added to the document (detected persons, organizations, places in the content) in the custom update processor and if this is executed during the dataimport process, it takes a lot of time. The NLP fields are not mandatory for the primary usage of the application which is to query documents with mandatory fields. The NLP fields are required for custom queries for Person, Organization entities. Therefore the NLP update process should be run as a background process detached from the primary /dataimport process. It should not slow down the existing /dataimport process. That's why I am looking for the best way to achieve my objective. I want to implement a way to separately update the imported documents from /dataimport to detect NLP enhancements. Currently I'm having the idea of adopting a timestamp based approach to trigger a /update query to all documents imported after the last_index_time in dataimport.prop and update them with NLP fields. Hope my requirement is clear :). Appreciate your suggestions. [1] http://stanbol.apache.org/ On Sunday, January 26, 2014 8:43 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi all, Any ideas on how to run a reindex update process for all the imported documents from a /dataimport query? Appreciate your help. Thanks, Dileepa On Thu, Jan 23, 2014 at 12:21 PM, Dileepa Jayakody dileepajayak...@gmail.com wrote: Hi All, I did some research on this and found some alternatives useful to my usecase. Please give your ideas. Can I update all documents indexed after a /dataimport query using the last_indexed_time in dataimport.properties? If so can anyone please give me some pointers? What I currently have in mind is something like below; 1. Store the indexing timestamp of the document as a field eg: field name=timestamp type=date indexed=true stored=true default=NOW multiValued=false/ 2. Read the last_index_time from the dataimport.properties 3. Query all document id's indexed after the last_index_time and send them through the Stanbol update processor. But I have a question here; Does the last_index_time refer to when the dataimport is started(onImportStart) or when the dataimport is finished (onImportEnd)? If it's onImportEnd timestamp, them this solution won't work because the timestamp indexed in the document field will be : onImportStart doc-index-timestamp onImportEnd. Another alternative I can think of is trigger an update chain via a EventListener configured to run after a dataimport is processed
Re: Complication - can block joins help?
Is there an example for using payloads for 4.6? Without any custom code for this? On Sun, Jan 26, 2014 at 10:30 PM, William Bell billnb...@gmail.com wrote: OK, In order to do boosting, we often will create a dynamic field in SOLR. For example: A Professional hire out for work, I want to boost those who do woodworking. George Smith - builds chairs, and builds desks. He builds the most desks in the country (350 a year). And his closest competitor does 200 a year. id (integer) = 1 name (string) =George Smith work multiValued field = chairs, desks num_desk (dynamic field num*) = 500 Then I would do something like: q=num_desk^5.0 Is there a way to do this without a dynamic field? I thought about a field: desk|500 (use bar delimiter). But couldn't see how to have the value indexed to easily to a boost for those who do the most. If you think of all the type of work, this could be 50,000 dynamic fields. Probably a performance hog. Dr. Smith Angioplasty Performs 70 of these a year -- Bill Bell billnb...@gmail.com cell 720-256-8076 -- Bill Bell billnb...@gmail.com cell 720-256-8076