Re: UpdateProcessor as a batch

2016-11-03 Thread mike st. john
maybe introduce a distributed queue such as apache ignite,  hazelcast or
even redis.   Read from the queue in batches, do your lookup then index the
same batch.

just a thought.

Mike St. John.

On Nov 3, 2016 3:58 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

> I thought we might be talking past each other...
>
> I think you're into "roll your own" here. Anything that
> accumulated docs for a while, did a batch lookup
> on the external system, then passed on the docs
> runs the risk of losing docs if the server is abnormally
> shut down.
>
> I guess ideally you'd like to augment the list coming in
> rather than the docs once they're removed from the
> incoming batch and passed on, but I admit I have no
> clue where to do that. Possibly in an update chain? If
> so, you'd need to be careful to only augment when
> they'd reached their final shard leader or all at once
> before distribution to shard leaders.
>
> Is the expense for the external lookup doing the actual
> lookups or establishing the connection? Would
> having some kind of shared connection to the external
> source be worthwhile?
>
> FWIW,
> Erick
>
> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
> <markus.jel...@openindex.io> wrote:
> > Hi - i believe i did not explain myself well enough.
> >
> > Getting the data in Solr is not a problem, various sources index docs to
> Solr, all in fine batches as everyone should do indeed. The thing is that i
> need to do some preprocessing before it is indexed. Normally,
> UpdateProcessors are the way to go. I've made quite a few of them and they
> work fine.
> >
> > The problem is, i need to do a remote lookup for each document being
> indexed. Right now, i make an external connection for each doc being
> indexed in the current UpdateProcessor. This is still fast. But the remote
> backend supports batched lookups, which are faster.
> >
> > This is why i'd love to be able to buffer documents in an
> UpdateProcessor, and if there are enough, i do a remote lookup for all of
> them, do some processing and let them be indexed.
> >
> > Thanks,
> > Markus
> >
> >
> >
> > -Original message-
> >> From:Erick Erickson <erickerick...@gmail.com>
> >> Sent: Thursday 3rd November 2016 19:18
> >> To: solr-user <solr-user@lucene.apache.org>
> >> Subject: Re: UpdateProcessor as a batch
> >>
> >> I _thought_ you'd been around long enough to know about the options I
> >> mentioned ;).
> >>
> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
> >> batching at that level that I know of. I'm pretty sure that even
> >> indexing batches of 1,000 documents from, say, SolrJ go through this
> >> method.
> >>
> >> I don't think there's much to be gained by any batching at this level,
> >> it pretty immediately tells Lucene to index the doc.
> >>
> >> FWIW
> >> Erick
> >>
> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
> >> <markus.jel...@openindex.io> wrote:
> >> > Erick - in this case data can come from anywhere. There is one piece
> of code all incoming documents, regardless of their origin, are passed
> thru, the update handler and update processors of Solr.
> >> >
> >> > In my case that is the most convenient point to partially modify the
> documents, instead of moving that logic to separate places.
> >> >
> >> > I've seen the ContentStream in SolrQueryResponse and i probably could
> tear incoming data apart and put it back together again, but that would not
> be so easy as working with already deserialized objects such as
> SolrInputDocument.
> >> >
> >> > UpdateHandler doesn't seem to work on a list of documents, it looked
> like it works on incoming stuff, not a whole list. I've also looked if i
> could buffer a batch in UpdateProcessor, work on them, and release them,
> but that seems impossible.
> >> >
> >> > Thanks,
> >> > Markus
> >> >
> >> > -Original message-
> >> >> From:Erick Erickson <erickerick...@gmail.com>
> >> >> Sent: Thursday 3rd November 2016 18:57
> >> >> To: solr-user <solr-user@lucene.apache.org>
> >> >> Subject: Re: UpdateProcessor as a batch
> >> >>
> >> >> Markus:
> >> >>
> >> >> How are you indexing? SolrJ has a client.add(List<
> SolrInputDocument>)
> >> >> form, and post.jar lets you add as many documents as you want in a
> >> >> batch
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
> >> >> <markus.jel...@openindex.io> wrote:
> >> >> > Hi - i need to process a batch of documents on update but i cannot
> seem to find a point where i can hook in and process a list of
> SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler.
> >> >> >
> >> >> > For now i let it go and implemented it on a per-document basis, it
> is fast, but i'd prefer batches. Is that possible at all?
> >> >> >
> >> >> > Thanks,
> >> >> > Markus
> >> >>
> >>
>


Re: indexing db records via SolrJ

2015-03-16 Thread mike st. john
Take a look at some of the integrations people are using with apache storm,
  we do something similar on a larger scale , having created a pgsql spout
and having a solr indexing bolt.


-msj

On Mon, Mar 16, 2015 at 11:08 AM, Hal Roberts 
hrobe...@cyber.law.harvard.edu wrote:

 We import anywhere from five to fifty million small documents a day from a
 postgres database.  I wrestled to get the DIH stuff to work for us for
 about a year and was much happier when I ditched that approach and switched
 to writing the few hundred lines of relatively simple code to handle
 directly the logic of what gets updated and how it gets queried from
 postgres ourselves.

 The DIH stuff is great for lots of cases, but if you are getting to the
 point of trying to hack its undocumented internals, I suspect you are
 better off spending a day or two of your time just writing all of the
 update logic yourself.

 We found a relatively simple combination of postgres triggers, export to
 csv based on those triggers, and then just calling update/csv to work best
 for us.

 -hal


 On 3/16/15 9:59 AM, Shawn Heisey wrote:

 On 3/16/2015 7:15 AM, sreedevi s wrote:

 I had checked this post.I dont know whether this is possible but my query
 is whether I can use the configuration for DIH for indexing via SolrJ


 You can use SolrJ for accessing DIH.  I have code that does this, but
 only for full index rebuilds.

 It won't be particularly obvious how to do it.  Writing code that can
 intepret DIH status and know when it finishes, succeeds, or fails is
 very tricky because DIH only uses human-readable status info, not
 machine-readable, and the info is not very consistent.

 I can't just share my code, because it's extremely convoluted ... but
 the general gist is to create a SolrQuery object, use setRequestHandler
 to set the handler to /dataimport or whatever your DIH handler is, and
 set the other parameters on the request like command to full-import
 and so on.

 Thanks,
 Shawn


 --
 Hal Roberts
 Fellow
 Berkman Center for Internet  Society
 Harvard University



Migrating from master/slave to solrcloud.

2014-11-09 Thread mike st. john
Is there a quick way to go from single index master/slave to solrcloud
without a full reindex?


thanks.

Msj


new collection clustering class not found.

2013-11-09 Thread mike st. john
I have a cluster with several collections using the same config in zk,
when i add a new collection through the collection api it
throws org.apache.solr.common.SolrException: Error loading class
'solr.clustering.ClusteringComponent'


when i query all the other collections, clustering works fine,  in the solr
logs i can see the other collections are loading up the clustering libs.

I've tried adding the libs to the sharedlib, but thats causing other issues.


anyone see anything similar with solr 4.4.0?

thanks

msj


creating collections dynamically.

2013-11-08 Thread mike st. john
Is there any way to create collections dynamically.


Having some issues using collections api, need to pass dataDir etc to the
cores  doesn't seem to
work correctly?


thanks.

msj


Re: creating collections dynamically.

2013-11-08 Thread mike st. john
thanks shawn,  i'll give it a try.


msj


On Fri, Nov 8, 2013 at 10:29 PM, Shawn Heisey s...@elyograg.org wrote:

 On 11/8/2013 7:39 PM, mike st. john wrote:

 Is there any way to create collections dynamically.


 Having some issues using collections api, need to pass dataDir etc to the
 cores  doesn't seem to
 work correctly?


 You can't pass dataDir with the collections API. It is concerned with the
 entire collection, not individual cores. With SolrCloud, you really
 shouldn't be trying to override those things.  One reason you might want to
 do this is that you want to share one instanceDir with all your cores.
  This is basically unsupported with SolrCloud, because the config is in
 zookeeper, not on the disk.  The dataDir defaults to $instanceDir/data.

 If you *really* want to go against recommendations and control all the
 directories yourself, you can build the cores using the CoreAdmin API
 instead of the Collections API.  The wiki page on SolrCloud has some
 details on how to do this.

 Thanks,
 Shawn




Re: multiple update processor chains.

2013-09-09 Thread mike st. john
Alexandre,

it was setup with multiple processors and working fine.   I just noticed in
the docs, it mentioned you could have multiple chains, it seemed to make
sense to have the ability to chain the defined processors in order without
the need to merge them into a single update processor definition.

thanks
msj


On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Only one chain per handler. But then you can define any sequence inside the
 chain, so why do you care about multiple chains?

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com wrote:

  is it possible to have multiple run by default?
 
  i've tried adding multiple update.chains for the  UpdateRequestHandler
 but
  it didn't seem to work.
 
 
  wondering if its even possible.
 
 
 
  Thanks
 
  msj
 



Re: multiple update processor chains.

2013-09-09 Thread mike st. john
Your correct, its not specifically for the update.chain.   my mistake.

thanks

msj


On Mon, Sep 9, 2013 at 3:34 AM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 Which section in the docs specifically? I thought it was multiple chains
 per config file, but you had to choose your specific chain for individual
 processors.

 I might be wrong though.

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Mon, Sep 9, 2013 at 1:51 PM, mike st. john mstj...@gmail.com wrote:

  Alexandre,
 
  it was setup with multiple processors and working fine.   I just noticed
 in
  the docs, it mentioned you could have multiple chains, it seemed to make
  sense to have the ability to chain the defined processors in order
 without
  the need to merge them into a single update processor definition.
 
  thanks
  msj
 
 
  On Mon, Sep 9, 2013 at 12:28 AM, Alexandre Rafalovitch
  arafa...@gmail.comwrote:
 
   Only one chain per handler. But then you can define any sequence inside
  the
   chain, so why do you care about multiple chains?
  
   Regards,
  Alex.
  
   Personal website: http://www.outerthoughts.com/
   LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
   - Time is the quality of nature that keeps events from happening all at
   once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)
  
  
   On Mon, Sep 9, 2013 at 5:43 AM, mike st. john mstj...@gmail.com
 wrote:
  
is it possible to have multiple run by default?
   
i've tried adding multiple update.chains for the
  UpdateRequestHandler
   but
it didn't seem to work.
   
   
wondering if its even possible.
   
   
   
Thanks
   
msj
   
  
 



Re: collections api setting dataDir

2013-09-09 Thread mike st. john
hi,

i've sorted it all out.  basically a  few replicas had failed and the
counts on the replicas were less than the leader.,  i basically killed the
index on those replicas and let them recover.


Thanks for the  help.

msj


On Mon, Sep 9, 2013 at 11:08 AM, Shawn Heisey s...@elyograg.org wrote:

 On 9/7/2013 2:25 PM, mike st. john wrote:
  yes the collections api ignored it,what i ended up doing, was just
  building out some fairness in regards to creating the cores and calling
  coreadmin to create the cores, seemed to work ok.   Only issue i'm having
  now, and i'm still investigating is subsequent queries are returning
  different counts.

 Every time I have seen distributed queries return different counts on
 different runs, it is because documents with the same value in the
 UniqueKey field exist in more than one shard.  If you are letting
 SolrCloud route your documents automatically, this shouldn't happen ...
 but if you are using distrib=false or a router that doesn't do it
 automatically, then it could.

 The Collections API doesn't do the dataDir parameter.  I suspect this is
 because you could pass an absolute path in, which would break things
 because every core would be trying to use the same dataDir.  If you want
 a directory other than ${instanceDir}/data for dataDir, then you will
 need to create each core individually rather than use the Collections API.

 Java does have the capability to determine whether a path is relative or
 absolute, but it is safer to just ignore that parameter, especially
 given the fact that a single cloud is usually on many servers, and
 there's no reason those servers can't be running wildly different
 operating systems.  Half your cloud could be on a Linux/UNIX OS and half
 of it could be on Windows.

 I personally find it better to let the Collections API do its thing and
 use the default.

 Thanks,
 Shawn




multiple update processor chains.

2013-09-08 Thread mike st. john
is it possible to have multiple run by default?

i've tried adding multiple update.chains for the  UpdateRequestHandler but
it didn't seem to work.


wondering if its even possible.



Thanks

msj


Re: collections api setting dataDir

2013-09-07 Thread mike st. john
Thanks erick,

yes the collections api ignored it,what i ended up doing, was just
building out some fairness in regards to creating the cores and calling
coreadmin to create the cores, seemed to work ok.   Only issue i'm having
now, and i'm still investigating is subsequent queries are returning
different counts.


msj




On Sat, Sep 7, 2013 at 1:58 PM, Erick Erickson erickerick...@gmail.comwrote:

 Did you try just specifying dataDir=blah? I haven't tried this, but the
 notes for
 the collections API indicate they're sugar around core creation commands,
 see: http://wiki.apache.org/solr/CoreAdmin#CREATE

 FWIW,
 Erick


 On Fri, Sep 6, 2013 at 4:23 PM, mike st. john mstj...@gmail.com wrote:

  is there any way to change the dataDir while creating a collection via
 the
  collection api?
 



collections api setting dataDir

2013-09-06 Thread mike st. john
is there any way to change the dataDir while creating a collection via the
collection api?


Re: Odd behavior after adding an additional core.

2013-09-06 Thread mike st. john
hi,

curl '
http://192.168.0.1:8983/solr/admin/collections?action=CREATEname=collectionxnumShards=4replicationFactor=1collection.configName=config1
'

after that,  i added approx 100k documents,  verified there were in the
index and distributed across the shards.


i then decided to start adding some replicas via coreadmin.

curl '
http://192.168.0.1:8983/solr/admin/cores?action=CREATEname=collectionx_ex_replica1collection=collectionxcollection.configName=config1
'


adding the core, produced the following,   it took away leader status from
the leader on the shard it was replicating, inserted itself as down.
 changed the doc routing to implicit.


Thanks.



On Fri, Sep 6, 2013 at 4:24 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Can you give exact steps to reproduce this problem?

 Also, are you sure you supplied numShards=4 while creating the collection?

 On Fri, Sep 6, 2013 at 12:20 AM, mike st. john mstj...@gmail.com wrote:
  using solr 4.4  , i used collection admin to create a collection  4shards
  replication - factor of 1
 
  i did this so i could index my data, then bring in replicas later by
 adding
  cores via coreadmin
 
 
  i added a new core via coreadmin,  what i noticed shortly after adding
 the
  core,  the leader of the shard where the new replica was placed was
 marked
  active the new core marked as the leader  and the routing was now set to
  implicit.
 
 
 
  i've replicated this on another solr setup as well.
 
 
  Any ideas?
 
 
  Thanks
 
  msj



 --
 Regards,
 Shalin Shekhar Mangar.



Odd behavior after adding an additional core.

2013-09-05 Thread mike st. john
using solr 4.4  , i used collection admin to create a collection  4shards
replication - factor of 1

i did this so i could index my data, then bring in replicas later by adding
cores via coreadmin


i added a new core via coreadmin,  what i noticed shortly after adding the
core,  the leader of the shard where the new replica was placed was marked
active the new core marked as the leader  and the routing was now set to
implicit.



i've replicated this on another solr setup as well.


Any ideas?


Thanks

msj


setting the collection in cloudsolrserver without using setdefaultcollection.

2013-05-21 Thread mike st. john
Is there any way to set the collection without passing setDefaultCollection
in cloudsolrserver?

I'm using cloudsolrserver with spring, and would like to autowire it.



Thanks

msj


Re: Updating clusterstate from the zookeeper

2013-04-19 Thread mike st. john
you can use the eclipse plugin for zookeeper.


http://www.massedynamic.org/mediawiki/index.php?title=Eclipse_Plug-in_for_ZooKeeper


-Msj.


On Fri, Apr 19, 2013 at 1:53 PM, Michael Della Bitta 
michael.della.bi...@appinions.com wrote:

 I would like to know the answer to this as well.

 Michael Della Bitta

 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271

 www.appinions.com

 Where Influence Isn’t a Game


 On Thu, Apr 18, 2013 at 8:15 PM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  Hello,
  After creating a distributed collection on several different servers I
  sometimes get to deal with failing servers (cores appear not available
 =
  grey) or failing cores (Down / unable to recover = brown / red).
  In case i wish to delete this errorneous collection (through collection
  API) only the green nodes get erased, leaving a meaningless unavailable
  collection in the clusterstate.json.
 
  Is there any way to edit explicitly the clusterstate.json? If not, how
 do i
  update it so the collection as above gets deleted?
 
  Cheers,
  Manu



Best way to backup solr?

2013-03-16 Thread mike st. john

Hi,

Whats the best option for backing up solrcloud,

replicate each shard ?


Thanks

msj


writing doc to another collection from UpdateReqeustProcessor

2013-03-11 Thread mike st. john
Whats the best approach in writing the current doc inside an 
UpdateRequestProcessor to another collection ?



Would i just call up CloudSolrServer and process it as i normally would 
in solrj?




Thanks
msj


Re: inconsistent number of results returned in solr cloud

2013-03-08 Thread mike st. john

check for dup id's

a quick way is to facet using the id as a field and set the mincount to 2.


-Mike

Hardik Upadhyay wrote:


HI

I am using solr 4.0 (Not BETA), and have created 2 shard 2 replica 
configuration.
But when I query solr with filter query it returns inconsistent result 
count.

Without filter query it returns same consistent result count.
I don't understand why?

Can any one help in this?

Best Regards

Hardik Upadhyay




Having an issue where atomic updates are treated as new docs running in solrcloud on 4.1

2013-03-04 Thread mike st. john

Hi,

running tomcat , solr 4.1 distributed   4 shards 2 replicas per shard.   
Everything works fine searching,  but i'm trying to use this instance as 
a nosql solution as well.   What i've noticed , when i send a partial 
update i'll receive missing required field  if the document is not 
located on the url i'm sending the update to, implying that its not 
distributing the updates to the correct servers.


thanks for any help.


Mike


Re: Having an issue where atomic updates are treated as new docs running in solrcloud on 4.1

2013-03-04 Thread mike st. john

Hi michael,

ah, thats seems to be the issue, its set to implicit.

This install originally was a 4.0 install, when it moved to 4.1 , the 
problems started.


Is there an easy way to change the router to compositeId?


-Mike

Michael Della Bitta wrote:


Hi Mike,

Are you sure sending it to the collection URL as opposed to one of the
shard URLs?

If you go to the Cloud tab, click on Tree, and then click on
clusterstate.json, what is the value for router for that collection?


Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Mon, Mar 4, 2013 at 12:44 PM, mike st. johnmstj...@gmail.com wrote:


Hi,

running tomcat , solr 4.1 distributed 4 shards 2 replicas per shard.
Everything works fine searching, but i'm trying to use this instance as a
nosql solution as well. What i've noticed , when i send a partial update
i'll receive missing required field if the document is not located 
on the

url i'm sending the update to, implying that its not distributing the
updates to the correct servers.

thanks for any help.


Mike


Re: Having an issue where atomic updates are treated as new docs running in solrcloud on 4.1

2013-03-04 Thread mike st. john

Mark,

the odd piece here i think was, this was a 4.0 collection numShards=4 
etc etc.


moved to 4.1, i would assume the doc router would have been set to 
compositeId, not implicit, or is the move from 4.0 to 4.1 a complete 
rebuild from the collections up?


-Mike

Mark Miller wrote:


On Mar 4, 2013, at 3:27 PM, Michael Della 
Bittamichael.della.bi...@appinions.com wrote:




I personally don't know of one other than starting over with a new
collection, but I'd love to be proven wrong, because I'm actually in the
same boat as you!



I think it might be possible by using a zookeeper tool to edit 
clusterstate.json (i like using the zk eclipse plugin for this type of 
thing).


If you create a new collection with the same number of shards and be 
sure to specify num shards, you will see the hash ranges that should 
be used for each shard. Try updating the clusterstate.json to match - 
with the right router and hash ranges.


- Mark


Re: Having an issue where atomic updates are treated as new docs running in solrcloud on 4.1

2013-03-04 Thread mike st. john

thanks mark.

That worked great.


-Mike

Mark Miller wrote:


Honestly, I'm not sure. Yonik did some testing around upgrading from 
4.0 to 4.1 and said this was fine - but it sounds like perhaps there 
are some hitches.


- Mark

On Mar 4, 2013, at 3:35 PM, mike st. johnmstj...@gmail.com wrote:



Mark,

the odd piece here i think was, this was a 4.0 collection numShards=4 
etc etc.


moved to 4.1, i would assume the doc router would have been set to 
compositeId, not implicit, or is the move from 4.0 to 4.1 a complete 
rebuild from the collections up?


-Mike

Mark Miller wrote:


On Mar 4, 2013, at 3:27 PM, Michael Della 
Bittamichael.della.bi...@appinions.com wrote:




I personally don't know of one other than starting over with a new
collection, but I'd love to be proven wrong, because I'm actually 
in the

same boat as you!



I think it might be possible by using a zookeeper tool to edit 
clusterstate.json (i like using the zk eclipse plugin for this type 
of thing).


If you create a new collection with the same number of shards and be 
sure to specify num shards, you will see the hash ranges that should 
be used for each shard. Try updating the clusterstate.json to match 
- with the right router and hash ranges.


- Mark






atomic updates fail with solrcloud , and real time get throwing NPE

2013-03-03 Thread mike st. john
atomic updates are failing in solrcloud , unless the update is sent to 
the shard where the doc resides.  Real time get is throwing NPE when run 
without distrib=false


tried with 4.1 and 4.2 snapshot.


Any ideas?


Thanks.

msj