Re: UpdateProcessor as a batch
maybe introduce a distributed queue such as apache ignite, hazelcast or even redis. Read from the queue in batches, do your lookup then index the same batch. just a thought. Mike St. John. On Nov 3, 2016 3:58 PM, "Erick Erickson" <erickerick...@gmail.com> wrote: > I thought we might be talking past each other... > > I think you're into "roll your own" here. Anything that > accumulated docs for a while, did a batch lookup > on the external system, then passed on the docs > runs the risk of losing docs if the server is abnormally > shut down. > > I guess ideally you'd like to augment the list coming in > rather than the docs once they're removed from the > incoming batch and passed on, but I admit I have no > clue where to do that. Possibly in an update chain? If > so, you'd need to be careful to only augment when > they'd reached their final shard leader or all at once > before distribution to shard leaders. > > Is the expense for the external lookup doing the actual > lookups or establishing the connection? Would > having some kind of shared connection to the external > source be worthwhile? > > FWIW, > Erick > > On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma > <markus.jel...@openindex.io> wrote: > > Hi - i believe i did not explain myself well enough. > > > > Getting the data in Solr is not a problem, various sources index docs to > Solr, all in fine batches as everyone should do indeed. The thing is that i > need to do some preprocessing before it is indexed. Normally, > UpdateProcessors are the way to go. I've made quite a few of them and they > work fine. > > > > The problem is, i need to do a remote lookup for each document being > indexed. Right now, i make an external connection for each doc being > indexed in the current UpdateProcessor. This is still fast. But the remote > backend supports batched lookups, which are faster. > > > > This is why i'd love to be able to buffer documents in an > UpdateProcessor, and if there are enough, i do a remote lookup for all of > them, do some processing and let them be indexed. > > > > Thanks, > > Markus > > > > > > > > -Original message- > >> From:Erick Erickson <erickerick...@gmail.com> > >> Sent: Thursday 3rd November 2016 19:18 > >> To: solr-user <solr-user@lucene.apache.org> > >> Subject: Re: UpdateProcessor as a batch > >> > >> I _thought_ you'd been around long enough to know about the options I > >> mentioned ;). > >> > >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no > >> batching at that level that I know of. I'm pretty sure that even > >> indexing batches of 1,000 documents from, say, SolrJ go through this > >> method. > >> > >> I don't think there's much to be gained by any batching at this level, > >> it pretty immediately tells Lucene to index the doc. > >> > >> FWIW > >> Erick > >> > >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma > >> <markus.jel...@openindex.io> wrote: > >> > Erick - in this case data can come from anywhere. There is one piece > of code all incoming documents, regardless of their origin, are passed > thru, the update handler and update processors of Solr. > >> > > >> > In my case that is the most convenient point to partially modify the > documents, instead of moving that logic to separate places. > >> > > >> > I've seen the ContentStream in SolrQueryResponse and i probably could > tear incoming data apart and put it back together again, but that would not > be so easy as working with already deserialized objects such as > SolrInputDocument. > >> > > >> > UpdateHandler doesn't seem to work on a list of documents, it looked > like it works on incoming stuff, not a whole list. I've also looked if i > could buffer a batch in UpdateProcessor, work on them, and release them, > but that seems impossible. > >> > > >> > Thanks, > >> > Markus > >> > > >> > -Original message- > >> >> From:Erick Erickson <erickerick...@gmail.com> > >> >> Sent: Thursday 3rd November 2016 18:57 > >> >> To: solr-user <solr-user@lucene.apache.org> > >> >> Subject: Re: UpdateProcessor as a batch > >> >> > >> >> Markus: > >> >> > >> >> How are you indexing? SolrJ has a client.add(List< > SolrInputDocument>) > >> >> form, and post.jar lets you add as many documents as you want in a > >> >> batch > >> >> > >> >> Best, > >> >> Erick > >> >> > >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma > >> >> <markus.jel...@openindex.io> wrote: > >> >> > Hi - i need to process a batch of documents on update but i cannot > seem to find a point where i can hook in and process a list of > SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler. > >> >> > > >> >> > For now i let it go and implemented it on a per-document basis, it > is fast, but i'd prefer batches. Is that possible at all? > >> >> > > >> >> > Thanks, > >> >> > Markus > >> >> > >> >
Re: indexing db records via SolrJ
Take a look at some of the integrations people are using with apache storm, we do something similar on a larger scale , having created a pgsql spout and having a solr indexing bolt. -msj On Mon, Mar 16, 2015 at 11:08 AM, Hal Roberts hrobe...@cyber.law.harvard.edu wrote: We import anywhere from five to fifty million small documents a day from a postgres database. I wrestled to get the DIH stuff to work for us for about a year and was much happier when I ditched that approach and switched to writing the few hundred lines of relatively simple code to handle directly the logic of what gets updated and how it gets queried from postgres ourselves. The DIH stuff is great for lots of cases, but if you are getting to the point of trying to hack its undocumented internals, I suspect you are better off spending a day or two of your time just writing all of the update logic yourself. We found a relatively simple combination of postgres triggers, export to csv based on those triggers, and then just calling update/csv to work best for us. -hal On 3/16/15 9:59 AM, Shawn Heisey wrote: On 3/16/2015 7:15 AM, sreedevi s wrote: I had checked this post.I dont know whether this is possible but my query is whether I can use the configuration for DIH for indexing via SolrJ You can use SolrJ for accessing DIH. I have code that does this, but only for full index rebuilds. It won't be particularly obvious how to do it. Writing code that can intepret DIH status and know when it finishes, succeeds, or fails is very tricky because DIH only uses human-readable status info, not machine-readable, and the info is not very consistent. I can't just share my code, because it's extremely convoluted ... but the general gist is to create a SolrQuery object, use setRequestHandler to set the handler to /dataimport or whatever your DIH handler is, and set the other parameters on the request like command to full-import and so on. Thanks, Shawn -- Hal Roberts Fellow Berkman Center for Internet Society Harvard University
Migrating from master/slave to solrcloud.
Is there a quick way to go from single index master/slave to solrcloud without a full reindex? thanks. Msj
new collection clustering class not found.
I have a cluster with several collections using the same config in zk, when i add a new collection through the collection api it throws org.apache.solr.common.SolrException: Error loading class 'solr.clustering.ClusteringComponent' when i query all the other collections, clustering works fine, in the solr logs i can see the other collections are loading up the clustering libs. I've tried adding the libs to the sharedlib, but thats causing other issues. anyone see anything similar with solr 4.4.0? thanks msj
creating collections dynamically.
Is there any way to create collections dynamically. Having some issues using collections api, need to pass dataDir etc to the cores doesn't seem to work correctly? thanks. msj
Re: creating collections dynamically.
thanks shawn, i'll give it a try. msj On Fri, Nov 8, 2013 at 10:29 PM, Shawn Heisey s...@elyograg.org wrote: On 11/8/2013 7:39 PM, mike st. john wrote: Is there any way to create collections dynamically. Having some issues using collections api, need to pass dataDir etc to the cores doesn't seem to work correctly? You can't pass dataDir with the collections API. It is concerned with the entire collection, not individual cores. With SolrCloud, you really shouldn't be trying to override those things. One reason you might want to do this is that you want to share one instanceDir with all your cores. This is basically unsupported with SolrCloud, because the config is in zookeeper, not on the disk. The dataDir defaults to $instanceDir/data. If you *really* want to go against recommendations and control all the directories yourself, you can build the cores using the CoreAdmin API instead of the Collections API. The wiki page on SolrCloud has some details on how to do this. Thanks, Shawn
Re: multiple update processor chains.
Alexandre, it was setup with multiple processors and working fine. I just noticed in the docs, it mentioned you could have multiple chains, it seemed to make sense to have the ability to chain the defined processors in order without the need to merge them into a single update processor definition. thanks msj On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Only one chain per handler. But then you can define any sequence inside the chain, so why do you care about multiple chains? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote: is it possible to have multiple run by default? i've tried adding multiple update.chains for the UpdateRequestHandler but it didn't seem to work. wondering if its even possible. Thanks msj
Re: multiple update processor chains.
Your correct, its not specifically for the update.chain. my mistake. thanks msj On Mon, Sep 9, 2013 at 3:34 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Which section in the docs specifically? I thought it was multiple chains per config file, but you had to choose your specific chain for individual processors. I might be wrong though. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 1:51 PM, mike st. john mstj...@gmail.com wrote: Alexandre, it was setup with multiple processors and working fine. I just noticed in the docs, it mentioned you could have multiple chains, it seemed to make sense to have the ability to chain the defined processors in order without the need to merge them into a single update processor definition. thanks msj On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Only one chain per handler. But then you can define any sequence inside the chain, so why do you care about multiple chains? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote: is it possible to have multiple run by default? i've tried adding multiple update.chains for the UpdateRequestHandler but it didn't seem to work. wondering if its even possible. Thanks msj
Re: collections api setting dataDir
hi, i've sorted it all out. basically a few replicas had failed and the counts on the replicas were less than the leader., i basically killed the index on those replicas and let them recover. Thanks for the help. msj On Mon, Sep 9, 2013 at 11:08 AM, Shawn Heisey s...@elyograg.org wrote: On 9/7/2013 2:25 PM, mike st. john wrote: yes the collections api ignored it,what i ended up doing, was just building out some fairness in regards to creating the cores and calling coreadmin to create the cores, seemed to work ok. Only issue i'm having now, and i'm still investigating is subsequent queries are returning different counts. Every time I have seen distributed queries return different counts on different runs, it is because documents with the same value in the UniqueKey field exist in more than one shard. If you are letting SolrCloud route your documents automatically, this shouldn't happen ... but if you are using distrib=false or a router that doesn't do it automatically, then it could. The Collections API doesn't do the dataDir parameter. I suspect this is because you could pass an absolute path in, which would break things because every core would be trying to use the same dataDir. If you want a directory other than ${instanceDir}/data for dataDir, then you will need to create each core individually rather than use the Collections API. Java does have the capability to determine whether a path is relative or absolute, but it is safer to just ignore that parameter, especially given the fact that a single cloud is usually on many servers, and there's no reason those servers can't be running wildly different operating systems. Half your cloud could be on a Linux/UNIX OS and half of it could be on Windows. I personally find it better to let the Collections API do its thing and use the default. Thanks, Shawn
multiple update processor chains.
is it possible to have multiple run by default? i've tried adding multiple update.chains for the UpdateRequestHandler but it didn't seem to work. wondering if its even possible. Thanks msj
Re: collections api setting dataDir
Thanks erick, yes the collections api ignored it,what i ended up doing, was just building out some fairness in regards to creating the cores and calling coreadmin to create the cores, seemed to work ok. Only issue i'm having now, and i'm still investigating is subsequent queries are returning different counts. msj On Sat, Sep 7, 2013 at 1:58 PM, Erick Erickson erickerick...@gmail.comwrote: Did you try just specifying dataDir=blah? I haven't tried this, but the notes for the collections API indicate they're sugar around core creation commands, see: http://wiki.apache.org/solr/CoreAdmin#CREATE FWIW, Erick On Fri, Sep 6, 2013 at 4:23 PM, mike st. john mstj...@gmail.com wrote: is there any way to change the dataDir while creating a collection via the collection api?
collections api setting dataDir
is there any way to change the dataDir while creating a collection via the collection api?
Re: Odd behavior after adding an additional core.
hi, curl ' http://192.168.0.1:8983/solr/admin/collections?action=CREATEname=collectionxnumShards=4replicationFactor=1collection.configName=config1 ' after that, i added approx 100k documents, verified there were in the index and distributed across the shards. i then decided to start adding some replicas via coreadmin. curl ' http://192.168.0.1:8983/solr/admin/cores?action=CREATEname=collectionx_ex_replica1collection=collectionxcollection.configName=config1 ' adding the core, produced the following, it took away leader status from the leader on the shard it was replicating, inserted itself as down. changed the doc routing to implicit. Thanks. On Fri, Sep 6, 2013 at 4:24 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Can you give exact steps to reproduce this problem? Also, are you sure you supplied numShards=4 while creating the collection? On Fri, Sep 6, 2013 at 12:20 AM, mike st. john mstj...@gmail.com wrote: using solr 4.4 , i used collection admin to create a collection 4shards replication - factor of 1 i did this so i could index my data, then bring in replicas later by adding cores via coreadmin i added a new core via coreadmin, what i noticed shortly after adding the core, the leader of the shard where the new replica was placed was marked active the new core marked as the leader and the routing was now set to implicit. i've replicated this on another solr setup as well. Any ideas? Thanks msj -- Regards, Shalin Shekhar Mangar.
Odd behavior after adding an additional core.
using solr 4.4 , i used collection admin to create a collection 4shards replication - factor of 1 i did this so i could index my data, then bring in replicas later by adding cores via coreadmin i added a new core via coreadmin, what i noticed shortly after adding the core, the leader of the shard where the new replica was placed was marked active the new core marked as the leader and the routing was now set to implicit. i've replicated this on another solr setup as well. Any ideas? Thanks msj
setting the collection in cloudsolrserver without using setdefaultcollection.
Is there any way to set the collection without passing setDefaultCollection in cloudsolrserver? I'm using cloudsolrserver with spring, and would like to autowire it. Thanks msj
Re: Updating clusterstate from the zookeeper
you can use the eclipse plugin for zookeeper. http://www.massedynamic.org/mediawiki/index.php?title=Eclipse_Plug-in_for_ZooKeeper -Msj. On Fri, Apr 19, 2013 at 1:53 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I would like to know the answer to this as well. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello, After creating a distributed collection on several different servers I sometimes get to deal with failing servers (cores appear not available = grey) or failing cores (Down / unable to recover = brown / red). In case i wish to delete this errorneous collection (through collection API) only the green nodes get erased, leaving a meaningless unavailable collection in the clusterstate.json. Is there any way to edit explicitly the clusterstate.json? If not, how do i update it so the collection as above gets deleted? Cheers, Manu
Best way to backup solr?
Hi, Whats the best option for backing up solrcloud, replicate each shard ? Thanks msj
writing doc to another collection from UpdateReqeustProcessor
Whats the best approach in writing the current doc inside an UpdateRequestProcessor to another collection ? Would i just call up CloudSolrServer and process it as i normally would in solrj? Thanks msj
Re: inconsistent number of results returned in solr cloud
check for dup id's a quick way is to facet using the id as a field and set the mincount to 2. -Mike Hardik Upadhyay wrote: HI I am using solr 4.0 (Not BETA), and have created 2 shard 2 replica configuration. But when I query solr with filter query it returns inconsistent result count. Without filter query it returns same consistent result count. I don't understand why? Can any one help in this? Best Regards Hardik Upadhyay
Having an issue where atomic updates are treated as new docs running in solrcloud on 4.1
Hi, running tomcat , solr 4.1 distributed 4 shards 2 replicas per shard. Everything works fine searching, but i'm trying to use this instance as a nosql solution as well. What i've noticed , when i send a partial update i'll receive missing required field if the document is not located on the url i'm sending the update to, implying that its not distributing the updates to the correct servers. thanks for any help. Mike
Re: Having an issue where atomic updates are treated as new docs running in solrcloud on 4.1
Hi michael, ah, thats seems to be the issue, its set to implicit. This install originally was a 4.0 install, when it moved to 4.1 , the problems started. Is there an easy way to change the router to compositeId? -Mike Michael Della Bitta wrote: Hi Mike, Are you sure sending it to the collection URL as opposed to one of the shard URLs? If you go to the Cloud tab, click on Tree, and then click on clusterstate.json, what is the value for router for that collection? Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Mon, Mar 4, 2013 at 12:44 PM, mike st. johnmstj...@gmail.com wrote: Hi, running tomcat , solr 4.1 distributed 4 shards 2 replicas per shard. Everything works fine searching, but i'm trying to use this instance as a nosql solution as well. What i've noticed , when i send a partial update i'll receive missing required field if the document is not located on the url i'm sending the update to, implying that its not distributing the updates to the correct servers. thanks for any help. Mike
Re: Having an issue where atomic updates are treated as new docs running in solrcloud on 4.1
Mark, the odd piece here i think was, this was a 4.0 collection numShards=4 etc etc. moved to 4.1, i would assume the doc router would have been set to compositeId, not implicit, or is the move from 4.0 to 4.1 a complete rebuild from the collections up? -Mike Mark Miller wrote: On Mar 4, 2013, at 3:27 PM, Michael Della Bittamichael.della.bi...@appinions.com wrote: I personally don't know of one other than starting over with a new collection, but I'd love to be proven wrong, because I'm actually in the same boat as you! I think it might be possible by using a zookeeper tool to edit clusterstate.json (i like using the zk eclipse plugin for this type of thing). If you create a new collection with the same number of shards and be sure to specify num shards, you will see the hash ranges that should be used for each shard. Try updating the clusterstate.json to match - with the right router and hash ranges. - Mark
Re: Having an issue where atomic updates are treated as new docs running in solrcloud on 4.1
thanks mark. That worked great. -Mike Mark Miller wrote: Honestly, I'm not sure. Yonik did some testing around upgrading from 4.0 to 4.1 and said this was fine - but it sounds like perhaps there are some hitches. - Mark On Mar 4, 2013, at 3:35 PM, mike st. johnmstj...@gmail.com wrote: Mark, the odd piece here i think was, this was a 4.0 collection numShards=4 etc etc. moved to 4.1, i would assume the doc router would have been set to compositeId, not implicit, or is the move from 4.0 to 4.1 a complete rebuild from the collections up? -Mike Mark Miller wrote: On Mar 4, 2013, at 3:27 PM, Michael Della Bittamichael.della.bi...@appinions.com wrote: I personally don't know of one other than starting over with a new collection, but I'd love to be proven wrong, because I'm actually in the same boat as you! I think it might be possible by using a zookeeper tool to edit clusterstate.json (i like using the zk eclipse plugin for this type of thing). If you create a new collection with the same number of shards and be sure to specify num shards, you will see the hash ranges that should be used for each shard. Try updating the clusterstate.json to match - with the right router and hash ranges. - Mark
atomic updates fail with solrcloud , and real time get throwing NPE
atomic updates are failing in solrcloud , unless the update is sent to the shard where the doc resides. Real time get is throwing NPE when run without distrib=false tried with 4.1 and 4.2 snapshot. Any ideas? Thanks. msj